Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bringing Tailscale up fails #43

Closed
jjo93sa opened this issue Oct 4, 2020 · 30 comments · Fixed by #59
Closed

Bringing Tailscale up fails #43

jjo93sa opened this issue Oct 4, 2020 · 30 comments · Fixed by #59
Labels
bug This bug is confirmed and can be reproduced.

Comments

@jjo93sa
Copy link

jjo93sa commented Oct 4, 2020

Describe the bug
Been using this role for a while. Recently, I guess something in Tailscale changed, because there's now a problem bringing up the Tailscale connection, which fails every time (I'm using tailscale_args to set subnet routes).

To Reproduce
Steps to reproduce the behavior:

  1. Go to use this role to install Tailscale
  2. See failure at start-up

Expected behavior
Tailscale should start.

Screenshots
N/A.

Desktop (please complete the following information):

  • OS: Ubuntu 18.04, 20.0

Additional context
I think Tailscale introduced extra logging as part of "tailscale up", such that the test

tailscale_status.stdout | length == 0

fails, even when the application has successfully started. As a work around, I've made the test >=, but this is clearly a hack.

@jjo93sa jjo93sa added the bug:needs-reproduction A reported bug that needs to be confirmed and reproduced. label Oct 4, 2020
@artis3n
Copy link
Owner

artis3n commented Oct 4, 2020

Thanks for this issue - I noticed this recently as well and haven't had time to look into it. Something has definitely changed and the role needs to be updated.

@artis3n artis3n added bug This bug is confirmed and can be reproduced. and removed bug:needs-reproduction A reported bug that needs to be confirmed and reproduced. labels Oct 4, 2020
@artis3n
Copy link
Owner

artis3n commented Oct 5, 2020

tailscale up seems to always pass in CI, but I have encountered similar consistent failures on new bare metal hosts using this role recently. That is unfortunate. Probably won't happen until this weekend, but I'll spin up a bunch of spot instances and see if I can nail down the exact failing assumption and make it reproducible for testing.

@jjo93sa
Copy link
Author

jjo93sa commented Oct 5, 2020 via email

@artis3n
Copy link
Owner

artis3n commented Oct 6, 2020

If the machine is already connected, tailscale up returns exit code 0 with no stdout content. If --authkey has some issue during that first call, then tailscale up throws an OAuth URL to open in your browser and waits for that out of band auth to succeed. I am guessing that --authkey is perhaps not formatting correctly anymore and then the task fails timing out for the OAuth grant. This is a guess I need to isolate and reproduce.

@jjo93sa
Copy link
Author

jjo93sa commented Oct 8, 2020

I've tried to repeat the Tailscale install using this role on a clean VM. When tailscale status is issued within task "Check if Tailscale is connected", I get the following logging in Ansible's debug output:

"stdout": "[L+V9o] tx= 0 rx= 0 10.10.20.251:41641, 172.17.0.1:41641", "stdout_lines": [ "[L+V9o] tx= 0 rx= 0 10.10.20.251:41641, 172.17.0.1:41641"

In the following task, "Bring Tailscale Up" there is the original test I described:

when: tailscale_status.stdout | length == 0

And this is what is causing this task to be skipped:

skipping: [gargantua] => { "changed": false, "skip_reason": "Conditional result was False" }

Thus, whatever test we apply to the stdout string captured in the "Check if Tailscale is connected" task needs to take into account that the new Tailscale versions output text upon success where once they didn't? From the output I see it is difficult to determine what content in the output from tailscale status indicates success; although I do note that the exit code when running it manually is 0; perhaps Tailscale is a good citizen and uses non-zero return codes to indicate failure, certainly seems to be from my (very inexperienced) understanding of the CLI code

I've therefore opted (in the interim) to use the following conditional, which is tested to work:

when: tailscale_status.rc == 0

HTH

@artis3n
Copy link
Owner

artis3n commented Oct 8, 2020

That is helpful, thank you! I have not had a chance to test myself yet.

I did file this against Tailscale a few months ago and I was seeing exit code 0 in all circumstances, so unless that has changed I do not think we can rely on exit code status. I think we'll be able to write a less brittle conditional check, though. Hoping for some time this weekend to dig into it.

@jjo93sa
Copy link
Author

jjo93sa commented Oct 8, 2020 via email

@artis3n
Copy link
Owner

artis3n commented Oct 11, 2020

Brand new Ubuntu 20.04 AMI:

image

So the when conditional is still accurate. However:

ubuntu@ip-172-31-87-21:~$ sudo tailscale up --help
USAGE
  up [flags]

"tailscale up" connects this machine to your Tailscale network,
triggering authentication if necessary.

The flags passed to this command are specific to this machine. If you don't
specify any flags, options are reset to their default.

FLAGS
  -accept-dns true                           accept DNS configuration from the admin panel
  -accept-routes false                       accept routes advertised by other Tailscale nodes
  -advertise-routes ...                      routes to advertise to other nodes (comma-separated, e.g. 10.0.0.0/8,192.168.0.0/24)
  -advertise-tags ...                        ACL tags to request (comma-separated, e.g. eng,montreal,ssh)
  -authkey ...                               node authorization key
  -enable-derp true                          enable the use of DERP servers
  -host-routes true                          install host routes to other Tailscale nodes
  -hostname ...                              hostname to use instead of the one provided by the OS
  -login-server https://login.tailscale.com  base URL of control server
  -netfilter-mode on                         netfilter mode (one of on, nodivert, off)
  -shields-up false                          don't allow incoming connections
  -snat-subnet-routes true                   source NAT traffic to local routes advertised with -advertise-routes

The syntax for the flags has changed. Instead of --authkey it is -authkey...

But that doesn't seem to matter, I can successfully auth with --

image

However with both uses of authkey, putting in an invalid auth key hangs the process until I manually quit (or, I am supposing, Ansible times out).

image

Now on to debugging the Ansible role directly on an instance and see what's going on. The commands run by the role should be working correctly, unless some formatting issue appeared out of nowhere with the authkey variable. I think that is unlikely.

Can you elaborate on the circumstances where you saw non-zero return codes from tailscale status?

@artis3n
Copy link
Owner

artis3n commented Oct 11, 2020

When tailscale status is issued within task "Check if Tailscale is connected", I get the following logging in Ansible's debug output

The output from that command as you describe it is an already authenticated Tailscale node, so the role correctly skips running up.

@artis3n
Copy link
Owner

artis3n commented Oct 11, 2020

Testing steps:

Ubuntu -

sudo apt install python3 python3-pip ansible
ansible-galaxy install artis3n.tailscale
  • Copy molecule/default/converge.yml and modify the tailscale_auth_key appropriately
  • Set hosts: localhost and connection:local on the playbook

Amazon Linux 2:

sudo yum install python python-pip
pip install ansible
ansible-galaxy install artis3n.tailscale
  • Copy molecule/default/converge.yml and modify the tailscale_auth_key appropriately
  • Set hosts: localhost and connection:local on the playbook

Could not reproduce on Ubuntu 20.04 or Amazon Linux 2. The role successfully auth'd and connected the machine from a blank slate. I experienced an issue on PopOS 20.04 on a personal host but didn't dig into it. Can't reproduce that now.

@artis3n artis3n added bug:needs-reproduction A reported bug that needs to be confirmed and reproduced. and removed bug This bug is confirmed and can be reproduced. labels Oct 11, 2020
@jjo93sa
Copy link
Author

jjo93sa commented Oct 12, 2020

The logging I provided in #43 (comment) was generated on a fresh Ubuntu 18.04 installation running on a KVM host, using the this role executed as part of a playbook run with verbose logging. I could try it again, but prevent everything from the tail scale status to the end, and run those final steps manually to see if there's any logging/difference.

@jjo93sa
Copy link
Author

jjo93sa commented Oct 12, 2020

OK, I tried this again. I span-up a brand new Ubuntu 20.04.1 KVM guest. Applied tag never to tasks "Check if Tailscale is connected" and "Bring Tailscale Up" in the role, and these were not executed by Ansible. Logged into the VM, and generated this output by manually executing the tasks:

Screenshot 2020-10-12 at 17 59 26

This shows the logging that I see with tailscale status. It also shows the rc is 0.

So, now I run the tailscale up command with my authkey, and then status again.

Screenshot 2020-10-12 at 18 20 51

Confirmed that the link is up: I can ping help.ipn.dev AOK. For reference,

ansible@gargantua:~$ sudo tailscale version
1.1.527-gf4f1e2e09

This experience is consistent for me on 18.04 and 20.04, both as KVM guests and on Raspberry Pi "bare metal" installs.

Let me know if any other testing would be useful.

@artis3n
Copy link
Owner

artis3n commented Oct 12, 2020

Do the manual setup commands for Tailscale work in your virtualized Ubuntu host? I will try the role against an Ubuntu VM tonight. This role does not yet support Raspberry Pi.

@jjo93sa
Copy link
Author

jjo93sa commented Oct 12, 2020 via email

@artis3n
Copy link
Owner

artis3n commented Oct 12, 2020

The purpose of running the status command is to check whether you have authenticated to Tailscale - so I don't think an alternate command would be better. I will try to reproduce on an Ubuntu VM. The return code is always 0 for status, so I can't use that. If needed, I can more intelligently regex on the status stdout to check whether the server is authenticated to Tailscale, but I'd really like to nail down this as a reproducible case on my end to make that happen correctly.

@jjo93sa
Copy link
Author

jjo93sa commented Oct 12, 2020 via email

@artis3n
Copy link
Owner

artis3n commented Oct 12, 2020

That check is for idempotency - to not attempt to re-authenticate if the node is already authenticated.

@jjo93sa
Copy link
Author

jjo93sa commented Oct 13, 2020

Ah, yes, I knew that, sorry.

I'm checking through the Tailscale CLI status, and this commit is where I think the output was changed. In particular there's these lines added:

if statusArgs.self && st.Self != nil {
  printPS(st.Self)
}

With comment: "cmd/tailscale: add local node's information to status output (by default)"

@artis3n
Copy link
Owner

artis3n commented Oct 21, 2020

Can you pass a copy of the way you invoke the role? e.g. to match

there's now a problem bringing up the Tailscale connection, which fails every time (I'm using tailscale_args to set subnet routes)

@jjo93sa
Copy link
Author

jjo93sa commented Oct 22, 2020

This is what I have at the moment.

- name: Task 4 - Install Tailscale                                                                                                            
  include_role:                                                                                                                               
    name: ansible-role-tailscale                                                                                                              
  vars:                                                                                                                                       
    release_stability: stable                                                                                                               
    tailscale_args: "--accept-routes=false --advertise-routes={{ tailscale_subnets | join(',') }}"                                            
    tailscale_auth_key: !vault |                                                                                                              
              $ANSIBLE_VAULT;1.1;AES256                                                                                                       
              35363...6361                                                             
  tags: [ tailscale, always ] 

@artis3n
Copy link
Owner

artis3n commented Oct 24, 2020

This is my output on a guest Ubuntu 20.04 VM.
image

@artis3n
Copy link
Owner

artis3n commented Oct 24, 2020

---
- name: Test
  hosts: localhost
  connection: local
  tasks:
    - name: "Include artis3n.tailscale"
      include_role:
        name: artis3n.tailscale
      vars:
        tailscale_auth_key: !vault |
            $ANSIBLE_VAULT;1.2;AES256;tailscale
           ....

@artis3n
Copy link
Owner

artis3n commented Oct 24, 2020

Similar success with (reverted to clean snapshot):

---
- name: Test
  hosts: localhost
  connection: local
  tasks:
    - name: "Include artis3n.tailscale"
      include_role:
        name: artis3n.tailscale
      vars:
        tailscale_args: "--accept-routes=false --advertise-routes=10.0.0.0/24,10.0.1.0/24"
        tailscale_auth_key: !vault |
            $ANSIBLE_VAULT;1.2;AES256;tailscale
            ....

@artis3n
Copy link
Owner

artis3n commented Oct 25, 2020

Try running ansible-galaxy list and ensure you are using version v1.6.1 of this role

@jjo93sa
Copy link
Author

jjo93sa commented Oct 25, 2020

Thanks for your help on this one. I've also been trying some experiments. I created a new Ubuntu 20.04 x86_64 VM under VMware Fusion. I executed the installation commands from the Tailscale "Getting Started" page.

Using the stable branch, I get the same result as you: no logging upon execution of tailscale status. The version of Tailscale was 1.0.5 (IIRC)

Using the unstable branch, I get the logging as described earlier (status command result doesn't depend on sudo):

Screenshot 2020-10-25 at 09 14 27

So, it might seem that this is the solution to the problem. The only concern I have is that I'm sure I've tried both stable and unstable branches using the Ansible role, and experienced the problem in both cases. I guess that's the next thing to test? However, if the unstable branch is likely to migrate to stable, it might be worth addressing that in this role ahead of time?

@artis3n
Copy link
Owner

artis3n commented Oct 25, 2020

I am not sure why you are seeing that on the stable branch when executing the role, but I'm not able to reproduce that.

But if this is on the unstable branch, this is a good heads up that a fix will be needed. I will play with that but I'd rather see it become behavior on the stable branch first before I invest a lot of time in resolving it. This behavior may not make it into stable.

@artis3n
Copy link
Owner

artis3n commented Oct 30, 2020

Aaand tailscale 1.2.0 is released so let's see what happens with renewed testing

@artis3n
Copy link
Owner

artis3n commented Nov 1, 2020

1.2.0 doesn't appear to have broken the role, so I still cannot reproduce this. Going to leave open if you manage to identify why you are seeing this behavior. FWIW the VM testing I did with Ubuntu 20.04 was with VMWare Workstation 15.x.

@artis3n artis3n added cannot-reproduce The issue as described has not been reproduced, despite attempts to reproduce the issue bug This bug is confirmed and can be reproduced. and removed bug:needs-reproduction A reported bug that needs to be confirmed and reproduced. cannot-reproduce The issue as described has not been reproduced, despite attempts to reproduce the issue labels Nov 1, 2020
@artis3n
Copy link
Owner

artis3n commented Nov 7, 2020

Got it! On Tailscale 1.2.2 I am seeing the behavior in this issue. So it did make its way from unstable to stable.

TASK [artis3n.tailscale : Tailscale Version] ***********************************
    ok: [instance] => {
        "tailscale_version.stdout": "1.2.2\n  tailscale commit: 76c2982d8832b9a70305a24abcc600486e39b523\n  go version: go1.15.4"
    }
    
    TASK [artis3n.tailscale : Bring Tailscale Up] **********************************
    skipping: [instance]
    
    TASK [artis3n.tailscale : Print Status if Tailscale Up Is Skipped - Please Include in GitHub Issue] ***
    ok: [instance] => {
        "tailscale_status": {
            "changed": false,
            "cmd": [
                "tailscale",
                "status"
            ],
            "delta": "0:00:00.006949",
            "end": "2020-11-07 15:25:24.602192",
            "failed": false,
            "rc": 0,
            "start": "2020-11-07 15:25:24.595243",
            "stderr": "",
            "stderr_lines": [],
            "stdout": "[L+V9o]                                            tx=       0 rx=       0       ",
            "stdout_lines": [
                "[L+V9o]                                            tx=       0 rx=       0       "
            ]
        }
    }

artis3n added a commit that referenced this issue Nov 7, 2020
@artis3n artis3n closed this as completed in 795afd4 Nov 7, 2020
@artis3n
Copy link
Owner

artis3n commented Nov 7, 2020

Merging the PR auto-closed this issue - v1.8.0 is now on Ansible Galaxy. That version should fix this issue. Re-open if that is not the case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This bug is confirmed and can be reproduced.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants