Dead ChromeOS swarming builders? |
|||||||||
Issue descriptionWe appear to have a few builders which rebooted, but did not reconnect afterwards. None of them have been in this state for long, but I still find it surprising. We reboot these builders frequently, and the first one I examined had an uptime equal to how long it's been dead. https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&f=pool%3AChromeOS&f=status%3Adead&l=100&s=id%3Aasc
,
Mar 30 2018
What is our current reboot policy for swarming builders? PS: This may be related to https://crbug.com/825387 .
,
Mar 30 2018
Hum: 1) When/why exactly is a builder marked as Dead? 2) When do they reboot? My currently belief for 2 is that they reboot after a build failure, or after 20 minutes as Dead.
,
Mar 30 2018
,
Mar 30 2018
Thanks. Any idea why they are going away and then recovering?
,
Apr 2 2018
,
Apr 7 2018
https://chrome-swarming.appspot.com/bot?id=swarm-cros-13&selected=1&show_all_events=true&sort_stats=total%3Adesc There's something really wrong with the bot. https://screenshot.googleplex.com/6jKQyOtdFEz Restarting the host sometimes takes 20 minutes. Restarting the bot takes >1 min. I assume it's induce by some hooks, Prathmesh had noted that querying the device was slow.
,
Apr 7 2018
When I investigated a machine that was restarted but not connected, I found that it had rebooted and come up, but just not connected, and it will reliably connect 20 minutes of uptime (not sure if that's another reboot or not) Where can I look for swarming client logs to see what's going on in a not-yet connected machine?
,
Apr 7 2018
/b/s/logs
,
Apr 7 2018
Thanks!
,
Apr 25 2018
,
Apr 27 2018
,
Apr 27 2018
I think it's worth understanding this, but I haven't been digging since the builders recover on their own. However, there always a few machines in this state.
,
Jul 19
We have quite a few in this state now, and a few that appear to still be running a task after a week or more.
Looking at swarm-cros-65, which has been dead for two days, I find:
1364 2018-07-17 19:21:40.221 I: rmtree(/tmp/pip_build_root)
1364 2018-07-17 19:21:40.221 D: make_tree_deleteable(/tmp/pip_build_root)
1364 2018-07-17 19:21:40.251 W: Swallowing make_tree_deleteable() error: [Errno 1] Operation not permitted: '/tm
p/pip_build_root'
1364 2018-07-17 19:21:46.259 E: /tmp/pip_build_root
Traceback (most recent call last):
File "/b/swarming/swarming_bot.1.zip/config/bot_config.py", line 118, in _delete_globs
os_utilities.rmtree(path)
File "/b/swarming/swarming_bot.1.zip/api/os_utilities.py", line 1083, in rmtree
file_path.rmtree(path)
File "/usr/lib/python2.7/shutil.py", line 254, in rmtree
os.rmdir(path)
OSError: [Errno 1] Operation not permitted: '/tmp/pip_build_root'
1364 2018-07-17 19:21:46.294 D: GOOGAPPUID = sha1(2018-07-17-swarm-cros-65:/b/swarming) % 1000 = 942
1364 2018-07-17 19:21:50.518 D: "POST /swarming/api/v1/bot/event HTTP/1.1" 200 2
1364 2018-07-17 19:21:50.519 D: Request https://chrome-swarming.appspot.com/swarming/api/v1/bot/event succeeded
1364 2018-07-17 19:21:50.520 I: ts_mon hook_name='on_bot_shutdown' pool=u'cores:32|cpu:x86-64-Haswell_GCE|cpu:x8
6-64-avx2|gpu:none|image:chromeos-trusty-17090600-04964e5cafc|inside_docker:0|kvm:1|machine_type:n1-highmem-32|o
s:Linux|os:Ubuntu-14.04|pool:ChromeOS|python:2.7.6|role:precq|role:tryjob|zone:us-central1-b'
1364 2018-07-17 19:21:50.520 I: on_bot_shutdown(): 0.001s
1364 2018-07-17 19:21:50.521 I: Skipping setup_bot, SWARMING_EXTERNAL_BOT_SETUP is set
1364 2018-07-17 19:21:50.521 I: Restarting machine with command sudo -n /sbin/shutdown -f -r now (Internal failu
re)
1364 2018-07-17 19:21:50.613 I: Restart command exited successfully
1364 2018-07-17 19:21:50.614 I: Restarting machine with command sudo -n /sbin/shutdown -r now (Internal failure)
1364 2018-07-17 19:21:50.676 I: Restart command exited successfully
1364 2018-07-17 19:21:50.677 I: Sleeping for 300
,
Jul 19
swarm-cros-68 (the next dead bot) appears to have died from a kernel panic, which should reasonably leave the machine hung and dead.
,
Jul 19
maruel: I seem to find some that have kernel panics (our problem), and some that failed the shutdown. I left swarm-cros-65 alone, in case you want you look at it more closely.
,
Jul 19
This should help with the pip_build_root issue: https://chrome-internal-review.googlesource.com/652984 https://chrome-internal-review.googlesource.com/653267 should have helped, done as part of issue 864726.
,
Jul 19
So... you believe this it updated in newer swarm bot service versions? We need to get new build images working to be able to get a new swarm bot service version. Okay.
,
Jul 19
No, the swarming bot is automatically deployed, you don't need to reimage, so this should not be happening since yesterday (18th, your log is from 17th).
,
Jul 19
That suggests I should just reboot all of the dead bots and they'll stay fixed this time? Nice!
,
Jul 19
Reinstancing now. Will revisit if this comes back. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by dgarr...@chromium.org
, Mar 29 2018