New issue
Advanced search Search tips

Issue 827305 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner: ----
Closed: Jul 19
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Dead ChromeOS swarming builders?

Project Member Reported by dgarr...@chromium.org, Mar 29 2018

Issue description

We appear to have a few builders which rebooted, but did not reconnect afterwards.

None of them have been in this state for long, but I still find it surprising.

We reboot these builders frequently, and the first one I examined had an uptime equal to how long it's been dead.


https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&f=pool%3AChromeOS&f=status%3Adead&l=100&s=id%3Aasc
 
Hum.... the list of dead builders is similar in size but changing.

Is there some sanity fallback that reboots after 20 minutes of being disconnected? Should we work to understand why they aren't connecting at startup, or just let this go?
Cc: mar...@chromium.org
What is our current reboot policy for swarming builders?

PS: This may be related to  https://crbug.com/825387 .

Hum:

1) When/why exactly is a builder marked as Dead?
2) When do they reboot?

My currently belief for 2 is that they reboot after a build failure, or after 20 minutes as Dead.


Cc: pprabhu@chromium.org jclinton@chromium.org
Thanks.

Any idea why they are going away and then recovering?

Comment 7 by jkop@chromium.org, Apr 2 2018

Owner: mar...@chromium.org
Status: Assigned (was: Untriaged)
https://chrome-swarming.appspot.com/bot?id=swarm-cros-13&selected=1&show_all_events=true&sort_stats=total%3Adesc

There's something really wrong with the bot.
https://screenshot.googleplex.com/6jKQyOtdFEz

Restarting the host sometimes takes 20 minutes. Restarting the bot takes >1 min. I assume it's induce by some hooks, Prathmesh had noted that querying the device was slow.
When I investigated a machine that was restarted but not connected, I found that it had rebooted and come up, but just not connected, and it will reliably connect 20 minutes of uptime (not sure if that's another reboot or not)

Where can I look for swarming client logs to see what's going on in a not-yet connected machine?

Comment 10 by maruel@google.com, Apr 7 2018

/b/s/logs
Owner: dgarr...@chromium.org
Status: Started (was: Assigned)
Thanks!
Status: Available (was: Started)
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>CI
Owner: ----
I think it's worth understanding this, but I haven't been digging since the builders recover on their own. However, there always a few machines in this state.
We have quite a few in this state now, and a few that appear to still be running a task after a week or more.

Looking at swarm-cros-65, which has been dead for two days, I find:

1364 2018-07-17 19:21:40.221 I: rmtree(/tmp/pip_build_root)
1364 2018-07-17 19:21:40.221 D: make_tree_deleteable(/tmp/pip_build_root)
1364 2018-07-17 19:21:40.251 W: Swallowing make_tree_deleteable() error: [Errno 1] Operation not permitted: '/tm
p/pip_build_root'
1364 2018-07-17 19:21:46.259 E: /tmp/pip_build_root
Traceback (most recent call last):
  File "/b/swarming/swarming_bot.1.zip/config/bot_config.py", line 118, in _delete_globs
    os_utilities.rmtree(path)
  File "/b/swarming/swarming_bot.1.zip/api/os_utilities.py", line 1083, in rmtree
    file_path.rmtree(path)
  File "/usr/lib/python2.7/shutil.py", line 254, in rmtree
    os.rmdir(path)
OSError: [Errno 1] Operation not permitted: '/tmp/pip_build_root'
1364 2018-07-17 19:21:46.294 D: GOOGAPPUID = sha1(2018-07-17-swarm-cros-65:/b/swarming) % 1000 = 942
1364 2018-07-17 19:21:50.518 D: "POST /swarming/api/v1/bot/event HTTP/1.1" 200 2
1364 2018-07-17 19:21:50.519 D: Request https://chrome-swarming.appspot.com/swarming/api/v1/bot/event succeeded
1364 2018-07-17 19:21:50.520 I: ts_mon hook_name='on_bot_shutdown' pool=u'cores:32|cpu:x86-64-Haswell_GCE|cpu:x8
6-64-avx2|gpu:none|image:chromeos-trusty-17090600-04964e5cafc|inside_docker:0|kvm:1|machine_type:n1-highmem-32|o
s:Linux|os:Ubuntu-14.04|pool:ChromeOS|python:2.7.6|role:precq|role:tryjob|zone:us-central1-b'
1364 2018-07-17 19:21:50.520 I: on_bot_shutdown(): 0.001s
1364 2018-07-17 19:21:50.521 I: Skipping setup_bot, SWARMING_EXTERNAL_BOT_SETUP is set
1364 2018-07-17 19:21:50.521 I: Restarting machine with command sudo -n /sbin/shutdown -f -r now (Internal failu
re)
1364 2018-07-17 19:21:50.613 I: Restart command exited successfully
1364 2018-07-17 19:21:50.614 I: Restarting machine with command sudo -n /sbin/shutdown -r now (Internal failure)
1364 2018-07-17 19:21:50.676 I: Restart command exited successfully
1364 2018-07-17 19:21:50.677 I: Sleeping for 300

swarm-cros-68 (the next dead bot) appears to have died from a kernel panic, which should reasonably leave the machine hung and dead.
maruel:

I seem to find some that have kernel panics (our problem), and some that failed the shutdown.

I left swarm-cros-65 alone, in case you want you look at it more closely.

This should help with the pip_build_root issue:
https://chrome-internal-review.googlesource.com/652984
https://chrome-internal-review.googlesource.com/653267
should have helped, done as part of issue 864726.
So... you believe this it updated in newer swarm bot service versions?

We need to get new build images working to be able to get a new swarm bot service version.

Okay.
No, the swarming bot is automatically deployed, you don't need to reimage, so this should not be happening since yesterday (18th, your log is from 17th).
That suggests I should just reboot all of the dead bots and they'll stay fixed this time? Nice!
Status: Fixed (was: Available)
Reinstancing now. Will revisit if this comes back.

Sign in to add a comment