New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 823448 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

VMTest "No space left on device" running SimpleTestUpdateAndVerify on betty-nyc-android-pfq

Project Member Reported by sha...@chromium.org, Mar 19 2018

Issue description

https://luci-milo.appspot.com/buildbot/chromeos/betty-nyc-android-pfq/1547

2018/03/19 09:10:01 - cros_generate_test_payloads.py - INFO    : Dumping /b/c/cbuild/repository/src/build/images/betty/latest-cbuildbot/update.cache
2018/03/19 09:10:01 - cros_build_lib.py - INFO    : RunCommand: /b/c/cbuild/repository/src/platform/crostestutils/au_test_harness/cros_au_test_harness.py '--base_image=/b/c/cbuild/repository/src/build/images/betty/latest-cbuildbot/chromiumos_test_image.bin' '--target_image=/b/c/cbuild/repository/src/build/images/betty/latest-cbuildbot/chromiumos_test_image.bin' '--board=betty' '--type=vm' '--remote=0.0.0.0' --verbose '--jobs=1' '--ssh_private_key=/b/c/cbuild/repository/src/build/images/betty/latest-cbuildbot/id_rsa' '--test_prefix=SimpleTestUpdateAndVerify' '--test_results_root=/b/c/cbuild/repository/chroot/tmp/cbuildbotj3itQa/pfq_suite/test_harness' --no_graphics --whitelist_chrome_crashes in /b/c/cbuild/repository/src/scripts
2018/03/19 09:10:01 - cros_au_test_harness.py - INFO    : Loading update cache from /b/c/cbuild/repository/src/build/images/betty/latest-cbuildbot/update.cache
2018/03/19 09:10:01 - dev_server_wrapper.py - DEBUG   : Retrieving http://127.0.0.1:8080/check_health
2018/03/19 09:10:01 - cros_build_lib.py - DEBUG   : RunCommand: /b/c/cbuild/repository/chromite/bin/cros_sdk --no-ns-pid -- sudo 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache' 'CROS_SUDO_KEEP_ALIVE=unknown' -- start_devserver --pidfile /tmp/cbuildbotj3itQa/pfq_suite/test_harness/devserver_wrapperVRpkGK --logfile /tmp/cbuildbotj3itQa/pfq_suite/test_harness/dev_server.log '--port=8080' --critical_update in /b/c/cbuild/repository
2018/03/19 09:10:06 - dev_server_wrapper.py - DEBUG   : Retrieving http://127.0.0.1:8080/check_health
E
======================================================================
ERROR: SimpleTestUpdateAndVerify (crostestutils.au_test_harness.au_test.AUTest)
Test that updates to itself.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/b/c/cbuild/repository/src/platform/crostestutils/au_test_harness/../../crostestutils/au_test_harness/au_test.py", line 220, in SimpleTestUpdateAndVerify
    target_image_path = self.worker.PrepareBase(self.target_image_path)
  File "/b/c/cbuild/repository/src/platform/crostestutils/au_test_harness/../../crostestutils/au_test_harness/vm_au_worker.py", line 59, in PrepareBase
    shutil.copy(self.vm_image_path, private_image_path)
  File "/usr/lib/python2.7/shutil.py", line 119, in copy
    copyfile(src, dst)
  File "/usr/lib/python2.7/shutil.py", line 84, in copyfile
    copyfileobj(fsrc, fdst)
  File "/usr/lib/python2.7/shutil.py", line 52, in copyfileobj
    fdst.write(buf)
IOError: [Errno 28] No space left on device
----------------------------------------------------------------------
 
Cc: kroot@chromium.org domlasko...@chromium.org
Labels: -Pri-3 Pri-1
FWIW, the "no space left on device" error is now occurring earlier in betty-nyc-android-pfq runs such as:

https://luci-milo.appspot.com/buildbot/chromeos/betty-nyc-android-pfq/1550

Pending 42/64, Building 7/7, [Time 20:39:14 | Elapsed 4m39.7s | Load 88.47 64.92 33.3]
Completed chromeos-base/google-breakpad-2017.12.23.132100-r134 (in 4m39.5s)
Pending 42/64, Building 6/6, [Time 20:39:43 | Elapsed 5m8.7s | Load 66.77 61.20:49:45: WARNING: Killing tasks: [<_BackgroundTask(_BackgroundTask-7, started)>]
20:49:45: WARNING: Killing 34469 (sig=24 SIGXCPU)
20:49:45: WARNING: RunCommand: pstree -Apals 34469
20:49:45: WARNING: RunCommand: lsof -p 34469
20:49:45: WARNING: RunCommand: gdb --nx -q -p 34469 -ex 'set prompt'
20:49:52: INFO: Refreshing due to a 401 (attempt 1/2)
20:49:52: INFO: Refreshing access_token
20:50:15: WARNING: Killing 34469 (sig=15 SIGTERM)
  File "chromite/bin/cbuildbot", line 169, in <module>
    DoMain()
  File "chromite/bin/cbuildbot", line 165, in DoMain
    commandline.ScriptWrapperMain(FindTarget)
  File "/b/c/cbuild/repository/chromite/lib/commandline.py", line 911, in ScriptWrapperMain
    ret = target(argv[1:])
  File "/b/c/cbuild/repository/chromite/scripts/cbuildbot.py", line 1016, in main
    _RunBuildStagesWrapper(options, site_config, build_config)
  File "/b/c/cbuild/repository/chromite/scripts/cbuildbot.py", line 174, in _RunBuildStagesWrapper
    if not builder.Run():
  File "/b/c/cbuild/repository/chromite/cbuildbot/builders/generic_builders.py", line 327, in Run
    self.RunStages()
  File "/b/c/cbuild/repository/chromite/cbuildbot/builders/simple_builders.py", line 600, in RunStages
    super(DistributedBuilder, self).RunStages()
  File "/b/c/cbuild/repository/chromite/cbuildbot/builders/simple_builders.py", line 439, in RunStages
    self._RunDefaultTypeBuild()
  File "/b/c/cbuild/repository/chromite/cbuildbot/builders/simple_builders.py", line 426, in _RunDefaultTypeBuild
    self.RunBuildStages()
  File "/b/c/cbuild/repository/chromite/cbuildbot/builders/simple_builders.py", line 420, in RunBuildStages
    queue.put([builder_run, board])
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 751, in BackgroundTaskRunner
    queue.put(_AllTasksComplete())
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 547, in ParallelTasks
    task_errors = task.Wait()
  File "/b/c/cbuild/repository/chromite/lib/parallel.py", line 362, in Wait
    traceback.print_stack()
Warning: Short write for /b/c/cbuild/repository/cbuildbot_logs/cbuildbot.log/0.
Unhandled exception occured in tee:
Traceback (most recent call last):
  File "/b/c/cbuild/repository/chromite/cbuildbot/tee.py", line 145, in run
    _tee(input_fd, output_files, self._complain)
  File "/b/c/cbuild/repository/chromite/cbuildbot/tee.py", line 87, in _tee
    _output(data, output_files, complain)
  File "/b/c/cbuild/repository/chromite/cbuildbot/tee.py", line 74, in _output
    _output(warning, output_files, False)
  File "/b/c/cbuild/repository/chromite/cbuildbot/tee.py", line 59, in _output
    offset += os.write(f.fileno(), line[offset:])
OSError: [Errno 28] No space left on device
cbuildbot: Signaled to shutdown: caught 15 signal.

Bumping priority since this will continue to block Android PFQ, which has been red for awhile now.
Components: Infra>Client>ChromeOS
Owner: jrbarnette@chromium.org
Status: Started (was: Untriaged)
I've looked at the logs; it appears that the build server is indeed out
of space (as opposed to either the target image or the VM).

Looking on the builder, you see this:
chrome-bot@build1-m2:(Linux 14.04):~$ df -m /
Filesystem     1M-blocks    Used Available Use% Mounted on
/dev/sda4        3360711 3202642         0 100% /

Which pretty much cements the conclusion.

I'm trying to figure out where the space has gone, and how to clean up.

chrome-bot@build1-m2:(Linux 14.04):/b$ df -m /
Filesystem     1M-blocks    Used Available Use% Mounted on
/dev/sda4        3360711 3202642         0 100% /
chrome-bot@build1-m2:(Linux 14.04):/b$ du -sm /b/c
3061593	/b/c
chrome-bot@build1-m2:(Linux 14.04):/b$ echo $(( 3061593000 / 3360711 ))
910

So, 91% of the total disk space in the file system is under /b/c.

chrome-bot@build1-m2:(Linux 14.04):~$ du -sm /b/c/cbuild/repository/.cache
2912930	/b/c/cbuild/repository/.cache
chrome-bot@build1-m2:(Linux 14.04):~$ echo $(( 2912930000 / 3360711 ))
866

Or, 87% of the total disk space is under that one directory.

It's named ".cache", so it sounds safe to delete it.  But, I don't
know if it's truly safe, or if that will actually fix the problem...

Digging further, the culprits are under /b/c/cbuild/repository/.cache/distfiles/host.
And the biggest files there have names like cheets_x86_64-target_files-4664449.zip.

I'm more confident that it's safe to delete these; that should
allow stuff to move forward.  However, I fully expect that this
problem will recur if we don't add some more aggressive cleanup.

OK, the disk space leakage has been previously identified, see
bug 814989.

I'm holding this bug open as the "get things moving again" bug.
I'll go update that bug as the "we need a permanent fix" bug.

Status: Fixed (was: Started)
Space should be sufficiently cleaned up for now:

chrome-bot@build1-m2:(Linux 14.04):~$ df -k /
Filesystem      1K-blocks      Used  Available Use% Mounted on
/dev/sda4      3441367600 926841916 2339691152  29% /

Sign in to add a comment