skylab bot died because no space left on drone |
||||
Issue descriptionExample bot: https://chrome-swarming.appspot.com/bot?id=chromeos-skylab-bot-b6dd746d-22c6-4f40-98f8-ad6eafa55cc2&selected=1&sort_stats=total%3Adesc Bot died event: Failed to call hook on_after_task(): [Errno 28] No space left on device Traceback (most recent call last): File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/bot_code/bot_main.py", line 293, in _call_hook_safe return _call_hook(chained, botobj, name, *args) File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/bot_code/bot_main.py", line 164, in hook return func(chained, botobj, name, *args, **kwargs) File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/bot_code/bot_main.py", line 272, in _call_hook ret = hook(botobj, *args, **kwargs) File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/config/bot_config.py", line 633, in on_after_task cleaned_up = _chromium_cleanup(bot) File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/config/bot_config.py", line 263, in _chromium_cleanup return _delete_globs(globs) File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/config/bot_config.py", line 118, in _delete_globs os_utilities.rmtree(path) File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/api/os_utilities.py", line 1083, in rmtree file_path.rmtree(path) File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/utils/file_path.py", line 1203, in rmtree (root, len(errors), delay)) IOError: [Errno 28] No space left on device Calling stack: 0 api/bot.py:230:post_error() 1 bot_code/bot_main.py:298:_call_hook_safe() 2 bot_code/bot_main.py:939:_run_manifest() 3 bot_code/bot_main.py:1179:_poll_server() 4 bot_code/bot_main.py:1070:_run_bot_inner() 5 bot_code/bot_main.py:966:_run_bot() 6 bot_code/bot_main.py:1402:main() 7 __main__.py:166:CMDstart_bot() 8 __main__.py:254:main() 9 __main__.py:266:<module>() 10 runpy.py:72:_run_code() 11 runpy.py:162:_run_module_as_main()
,
Jul 11
None of the bots by itself consumed too much space: chromeos-test@pprabhu-skylab-drone-2:/run/skylab_swarming$ du -hs * 31M d1883 63M d2352 60M d2926 60M d3140 62M d3774 30M d3869 63M d3898 29M d4699 58M d4940 26M d5036 29M d5056 30M d5068 2.2M d5301 60M d5527 63M d6146 28M d6244 63M d6441 63M d6733 30M d6800 30M d7095 60M d7318 13M d7561 30M d7632 30M d7647 19M d7795 62M d8206 57M d8255 0 d828 28M d8310 59M d8521 30M d8595 31M d8977 0 d9202 60M d9430 6.2M d9634 0 d9693 chromeos-test@pprabhu-skylab-drone-2:/run/skylab_swarming$ du -hs . 1.4G . But overall, the bots consumed 1.4G / 1.6G space on /run/ There are two problems here. (1) We don't have anything cleaning up bot working directories after a bot is shutdown (the bots restart with a randomly generated working directory each time). This is the case so that we can easily debug any issues with bot restarts. We should probably clean up very old directories though. (2) 1.6G is too small of a partition size, given that we have 95G available on the disk.
,
Jul 11
I've cleared up the immediate problem by restarting all bots on that drone. Working on the suggested fixes now.
,
Jul 12
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/a22969335bb7ce5037170d77d765a382fe3f5063 commit a22969335bb7ce5037170d77d765a382fe3f5063 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jul 12 20:02:29 2018
,
Jul 12
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/07d2ca3d093fc465e522e92a9dbd32e1dddf1c26 commit 07d2ca3d093fc465e522e92a9dbd32e1dddf1c26 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jul 12 20:11:43 2018
,
Jul 12
We'll revisit cleaning up Swarming working directory if needed. The swarming directories are minuscule compared to just the autotest logs left in the system and now they're on the same partition. Let's not add directory rotation complexity until needed. |
||||
►
Sign in to add a comment |
||||
Comment 1 by pprabhu@chromium.org
, Jul 11