New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 675646 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Aug 3
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 676413



Sign in to add a comment

no space left on device for wizpig and caroline

Project Member Reported by semenzato@chromium.org, Dec 19 2016

Issue description

(At least) two release builds failed with "no space left on device"

https://uberchromegw.corp.google.com/i/chromeos/builders/caroline-release/builds/225/steps/Paygen/logs/stdio

https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-release/builds/681/steps/Paygen/logs/stdio

By the way, it's not clear (that I could tell) what device is out of space.  I assume it's the drone/shard running autoserv?  Would it be difficult to log the actual host name/address?

Also, will it recover automatically or does it need help?

Thanks!

 

Comment 2 by sbasi@chromium.org, Dec 20 2016

Cc: nxia@chromium.org dgarr...@chromium.org d...@chromium.org
Looks like its the builders themselves that are running out of space not the drone/shards.

NingNing,Don,Dan any idea whats up?

Comment 3 by d...@chromium.org, Dec 20 2016

Generally in paygen this means that the disk image that has been allocated is too small to contain the content that is being written to it. It generally doesn't mean that the slave's disk is full.

The disk usage is high, but it's been pretty consistently high for almost a month now. If it's hit a tipping point, we need to re-evaluate disk space expectations. A trooper is the appropriate person to do this. But please check paygen sizes first.

https://viceroy.corp.google.com/chrome_infra/Machines/per_machine?duration=7d&hostname=cros-beefy20-c2&refresh=-1

Comment 4 by sbasi@chromium.org, Dec 21 2016

 Issue 676370  has been merged into this issue.
Cc: alliewood@chromium.org senj@chromium.org de...@chromium.org garnold@chromium.org
Hrm.  Yesterday we were fighting against images that were too bug, but those failures happened in "build_image".   bug #676152  was a good example.

In that bug it was easy (ish) to figure out what was going on due to good debug output.  In this case we are lacking.

Looking at the file throwing the exception (src/platform/dev/host/lib/update_payload/applier.py), I find who might know something more about paygen.  Hopefully one of those people can help us here.

===

The bit of code that is throwing the exception looks like:

      # Make sure it's not a fake (signature) operation.
      if start_block != common.PSEUDO_EXTENT_MARKER:
        data_end = data_start + count

        # Make sure we're not running past partition boundary.
        if (start_block + num_blocks) * block_size > part_size:
          raise PayloadError(
              '%s: extent (%s) exceeds partition size (%d)' %
              (ex_name, common.FormatExtent(ex, block_size),
               part_size))

        # Make sure that we have enough data to write.
        if data_end >= data_length + block_size:
          raise PayloadError(
              '%s: more dst blocks than data (even with padding)')

        # Pad with zeros if necessary.
        if data_end > data_length:
          padding = data_end - data_length
          out_data += '\0' * padding

        self.payload.payload_file.seek(start_block * block_size)
        part_file.seek(start_block * block_size)
        part_file.write(out_data[data_start:data_end])

It seems like something _tried_ to see if we were out of space before writing.  ...but I guess that didn't work?


Comment 6 by d...@chromium.org, Dec 21 2016

That or the builder is actually running out of space, in which case this should be punted to Troopers. It's important to figure out which, though, b/c trooper has no idea how to handle CrOS image size space issues.
Right.  I wonder if we could somehow catch the exception thrown here and run "df -h"...

Comment 8 by sbasi@chromium.org, Dec 21 2016

Cc: aaboagye@chromium.org
Aseda, Aviv says you've been helping Don with paygen stuff, anything you can chime in here?
The three failures above are:
* 2 failures on slave cros-beefy20-c2
* 1 failure on slave cros-beefy82-c2

Using the charts from #3, it appears that both are hitting something about 98% disk usage.  It sure seems to me like the slave is probably running out of disk space.


Can we clean _something_ off those disks and see if the problem goes away?  Note also that disk usage goes from 60% (idle) to 98% (max), so that means our build is using 40% of the space on the disks of these builders.  Historically I don't think we've optimized to reduce transitory disk space, but maybe it's time?


Even if it turns out that the builders are out of disk space, it seems like changing the scripts to make it more obvious what's happening is a good idea.

Comment 10 by d...@chromium.org, Dec 21 2016

With that theory, please punt to troopers and ask them to respawn these images. That will start with a vanilla disk. Through the same process, it will likely refill eventually, but it is a quick and easy fix to this immediate problem.

Comment 11 by senj@chromium.org, Dec 21 2016

The code was verifying the correctness of the update payload, so the check is to make sure the payload won't write past partition size when the chromebooks got the update.
There's no check for builder disk space, and I think it's out of space.
Trooper request: bug #676413
Before reimaging, would it be possible to run "du /" and take a look at the output with "sort -nr"?  It may give obvious clues

Comment 14 by d...@chromium.org, Dec 21 2016

RE #13, you should request that in the bug linked in #12.
So this is forked to 3 bugs:
* trooper request: bug #676413
* request for better error message: bug #676430
* why do we eat 40% disk space: bug #676433

The 2nd two could use owners.

Comment 16 by sbasi@chromium.org, Dec 21 2016

Blockedon: 676413
Lulu release build failed on cros-beefy44-c2.


Comment 18 by nxia@chromium.org, Jun 8 2018

Cc: -nxia@chromium.org
Status: Assigned (was: Untriaged)
This bug has an owner, thus, it's been triaged. Changing status to "assigned".
Status: Archived (was: Assigned)

Sign in to add a comment