New issue
Advanced search Search tips

Issue 862765 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jul 12
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

skylab bot died because no space left on drone

Project Member Reported by pprabhu@chromium.org, Jul 11

Issue description

Example bot: https://chrome-swarming.appspot.com/bot?id=chromeos-skylab-bot-b6dd746d-22c6-4f40-98f8-ad6eafa55cc2&selected=1&sort_stats=total%3Adesc

Bot died event: 
Failed to call hook on_after_task(): [Errno 28] No space left on device
Traceback (most recent call last):
File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/bot_code/bot_main.py", line 293, in _call_hook_safe
return _call_hook(chained, botobj, name, *args)
File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/bot_code/bot_main.py", line 164, in hook
return func(chained, botobj, name, *args, **kwargs)
File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/bot_code/bot_main.py", line 272, in _call_hook
ret = hook(botobj, *args, **kwargs)
File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/config/bot_config.py", line 633, in on_after_task
cleaned_up = _chromium_cleanup(bot)
File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/config/bot_config.py", line 263, in _chromium_cleanup
return _delete_globs(globs)
File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/config/bot_config.py", line 118, in _delete_globs
os_utilities.rmtree(path)
File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/api/os_utilities.py", line 1083, in rmtree
file_path.rmtree(path)
File "/var/run/skylab_swarming/d4699/swarming_bot.1.zip/utils/file_path.py", line 1203, in rmtree
(root, len(errors), delay))
IOError: [Errno 28] No space left on device
Calling stack:
0 api/bot.py:230:post_error()
1 bot_code/bot_main.py:298:_call_hook_safe()
2 bot_code/bot_main.py:939:_run_manifest()
3 bot_code/bot_main.py:1179:_poll_server()
4 bot_code/bot_main.py:1070:_run_bot_inner()
5 bot_code/bot_main.py:966:_run_bot()
6 bot_code/bot_main.py:1402:main()
7 __main__.py:166:CMDstart_bot()
8 __main__.py:254:main()
9 __main__.py:266:<module>()
10 runpy.py:72:_run_code()
11 runpy.py:162:_run_module_as_main()
 
Cc: xixuan@chromium.org ayatane@chromium.org
Problem is that /run/ is full. We use /run to store Swarming bot working directories, but this partition is only 1.6G on these machines.

Labels: -Pri-2 Pri-1
None of the bots by itself consumed too much space:

chromeos-test@pprabhu-skylab-drone-2:/run/skylab_swarming$ du -hs *
31M     d1883
63M     d2352
60M     d2926
60M     d3140
62M     d3774
30M     d3869
63M     d3898
29M     d4699
58M     d4940
26M     d5036
29M     d5056
30M     d5068
2.2M    d5301
60M     d5527
63M     d6146
28M     d6244
63M     d6441
63M     d6733
30M     d6800
30M     d7095
60M     d7318
13M     d7561
30M     d7632
30M     d7647
19M     d7795
62M     d8206
57M     d8255
0       d828
28M     d8310
59M     d8521
30M     d8595
31M     d8977
0       d9202
60M     d9430
6.2M    d9634
0       d9693

chromeos-test@pprabhu-skylab-drone-2:/run/skylab_swarming$ du -hs .
1.4G    .

But overall, the bots consumed 1.4G / 1.6G space on /run/

There are two problems here.
(1) We don't have anything cleaning up bot working directories after a bot is shutdown (the bots restart with a randomly generated working directory each time).
This is the case so that we can easily debug any issues with bot restarts.
We should probably clean up very old directories though.

(2) 1.6G is too small of a partition size, given that we have 95G available on the disk.
Status: Started (was: Assigned)
I've cleared up the immediate problem by restarting all bots on that drone.

Working on the suggested fixes now.
Project Member

Comment 4 by bugdroid1@chromium.org, Jul 12

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/a22969335bb7ce5037170d77d765a382fe3f5063

commit a22969335bb7ce5037170d77d765a382fe3f5063
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jul 12 20:02:29 2018

Project Member

Comment 5 by bugdroid1@chromium.org, Jul 12

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/07d2ca3d093fc465e522e92a9dbd32e1dddf1c26

commit 07d2ca3d093fc465e522e92a9dbd32e1dddf1c26
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jul 12 20:11:43 2018

Status: Fixed (was: Started)
We'll revisit cleaning up Swarming working directory if needed. The swarming directories are minuscule compared to just the autotest logs left in the system and now they're on the same partition. Let's not add directory rotation complexity until needed.

Sign in to add a comment