New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 676430 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

Paygen stage should make it more obvious when builder is out of space

Project Member Reported by diand...@chromium.org, Dec 21 2016

Issue description

In  bug #675646 , we are postulating that the build slave was out of disk space.

Ideally, something in the logs should have made this more obvious.

Specifically, the error that we saw was:

===

@@@STEP_FAILURE@@@
06:40:28: ERROR: <type 'exceptions.IOError'>: [Errno 28] No space left on device
Traceback (most recent call last):
  File "/b/cbuild/internal_master/chromite/lib/parallel.py", line 602, in TaskRunner
    task(*x, **task_kwargs)
  File "/b/cbuild/internal_master/chromite/lib/parallel.py", line 800, in <lambda>
    fn = lambda idx, task_args: out_queue.put((idx, task(*task_args)))
  File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_build_lib.py", line 268, in _GenerateSinglePayload
    dry_run=dry_run)
  File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_payload_lib.py", line 837, in CreateAndUploadPayload
    dry_run=dry_run).Run()
  File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_payload_lib.py", line 707, in Run
    self._drm(self._VerifyPayload)
  File "/b/cbuild/internal_master/chromite/lib/paygen/dryrun_lib.py", line 45, in __call__
    return self.Run(func, *args, **kwargs)
  File "/b/cbuild/internal_master/chromite/lib/paygen/dryrun_lib.py", line 82, in Run
    return self._Call(func, *args, **kwargs)
  File "/b/cbuild/internal_master/chromite/lib/paygen/dryrun_lib.py", line 86, in _Call
    return func(*args, **kwargs)
  File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_payload_lib.py", line 681, in _VerifyPayload
    self._ApplyPayload(payload, is_delta)
  File "/b/cbuild/internal_master/chromite/lib/paygen/paygen_payload_lib.py", line 641, in _ApplyPayload
    payload.Apply(bspatch_path=bspatch_path, **part_files)
  File "/b/cbuild/internal_master/src/platform/dev/host/lib/update_payload/payload.py", line 321, in Apply
    old_rootfs_part=old_rootfs_part)
  File "/b/cbuild/internal_master/src/platform/dev/host/lib/update_payload/applier.py", line 569, in Run
    self.payload.manifest.old_rootfs_info)
  File "/b/cbuild/internal_master/src/platform/dev/host/lib/update_payload/applier.py", line 517, in _ApplyToPartition
    new_part_file, new_part_info.size)
  File "/b/cbuild/internal_master/src/platform/dev/host/lib/update_payload/applier.py", line 458, in _ApplyOperations
    self._ApplyReplaceOperation(op, op_name, data, new_part_file, part_size)
  File "/b/cbuild/internal_master/src/platform/dev/host/lib/update_payload/applier.py", line 269, in _ApplyReplaceOperation
    part_file.write(out_data[data_start:data_end])
IOError: [Errno 28] No space left on device

<type 'exceptions.IOError'>: [Errno 28] No space left on device

===

With this error, it's unclear to the oblivious sheriff / deputy / trooper if the error was that we ran out of space for something Chrome OS related (like we blew out the image size) or if we ran out of space on the build slave.

===

A proposal is to catch errors _somewhere_ in that call stack (maybe just catch IOErrors?) and then print the output of "df -h" to the logs.  Plausibly you could even look for "100%" somewhere in the text and print a hint that the build slave might be out of space.

===

See  bug #676152  for an example of a different script that does similar.  In that case it's a bash script that catches things, but python should be able to catch exceptions too.




 
Also please note the comment in issue 675645.
Owner: de...@chromium.org
That error does look like we really ran out of disk space on the builder.

"IOError: [Errno 28] No space left on device" is a generic python exception raised by the paycheck script (which is running as a library). Paycheck is traditionally maintained by the update engine team.

A generic error handler might not be as useful as it sounds. In general, it wouldn't run until after stage specific cleanup had finished. In this case, that would probably have freed up ~20 gigs of space.
Components: Build
Components: -Build Infra>Client>ChromeOS>Build
Labels: -Pri-2 Pri-3
Owner: ahass...@chromium.org
Status: Available (was: Untriaged)

Comment 5 by nxia@chromium.org, Jun 8 2018

Cc: -nxia@chromium.org
Status: WontFix (was: Available)
Given #2, I'm going to close this.
Status: Available (was: WontFix)
It's an on going annoyance. We run out of space on the disk image, and everyone thinks the build bot ran out, but none of them have been below 1T free in years.

There really is a filesystem out of space, but it's the loopback mounted image we are generating, not the builder. Even Deymo misread it, and he should know better.
Owner: ----

Sign in to add a comment