no space left on device for wizpig and caroline |
||||||||
Issue description(At least) two release builds failed with "no space left on device" https://uberchromegw.corp.google.com/i/chromeos/builders/caroline-release/builds/225/steps/Paygen/logs/stdio https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-release/builds/681/steps/Paygen/logs/stdio By the way, it's not clear (that I could tell) what device is out of space. I assume it's the drone/shard running autoserv? Would it be difficult to log the actual host name/address? Also, will it recover automatically or does it need help? Thanks!
,
Dec 20 2016
Looks like its the builders themselves that are running out of space not the drone/shards. NingNing,Don,Dan any idea whats up?
,
Dec 20 2016
Generally in paygen this means that the disk image that has been allocated is too small to contain the content that is being written to it. It generally doesn't mean that the slave's disk is full. The disk usage is high, but it's been pretty consistently high for almost a month now. If it's hit a tipping point, we need to re-evaluate disk space expectations. A trooper is the appropriate person to do this. But please check paygen sizes first. https://viceroy.corp.google.com/chrome_infra/Machines/per_machine?duration=7d&hostname=cros-beefy20-c2&refresh=-1
,
Dec 21 2016
Issue 676370 has been merged into this issue.
,
Dec 21 2016
Hrm. Yesterday we were fighting against images that were too bug, but those failures happened in "build_image". bug #676152 was a good example. In that bug it was easy (ish) to figure out what was going on due to good debug output. In this case we are lacking. Looking at the file throwing the exception (src/platform/dev/host/lib/update_payload/applier.py), I find who might know something more about paygen. Hopefully one of those people can help us here. === The bit of code that is throwing the exception looks like: # Make sure it's not a fake (signature) operation. if start_block != common.PSEUDO_EXTENT_MARKER: data_end = data_start + count # Make sure we're not running past partition boundary. if (start_block + num_blocks) * block_size > part_size: raise PayloadError( '%s: extent (%s) exceeds partition size (%d)' % (ex_name, common.FormatExtent(ex, block_size), part_size)) # Make sure that we have enough data to write. if data_end >= data_length + block_size: raise PayloadError( '%s: more dst blocks than data (even with padding)') # Pad with zeros if necessary. if data_end > data_length: padding = data_end - data_length out_data += '\0' * padding self.payload.payload_file.seek(start_block * block_size) part_file.seek(start_block * block_size) part_file.write(out_data[data_start:data_end]) It seems like something _tried_ to see if we were out of space before writing. ...but I guess that didn't work?
,
Dec 21 2016
That or the builder is actually running out of space, in which case this should be punted to Troopers. It's important to figure out which, though, b/c trooper has no idea how to handle CrOS image size space issues.
,
Dec 21 2016
Right. I wonder if we could somehow catch the exception thrown here and run "df -h"...
,
Dec 21 2016
Aseda, Aviv says you've been helping Don with paygen stuff, anything you can chime in here?
,
Dec 21 2016
The three failures above are: * 2 failures on slave cros-beefy20-c2 * 1 failure on slave cros-beefy82-c2 Using the charts from #3, it appears that both are hitting something about 98% disk usage. It sure seems to me like the slave is probably running out of disk space. Can we clean _something_ off those disks and see if the problem goes away? Note also that disk usage goes from 60% (idle) to 98% (max), so that means our build is using 40% of the space on the disks of these builders. Historically I don't think we've optimized to reduce transitory disk space, but maybe it's time? Even if it turns out that the builders are out of disk space, it seems like changing the scripts to make it more obvious what's happening is a good idea.
,
Dec 21 2016
With that theory, please punt to troopers and ask them to respawn these images. That will start with a vanilla disk. Through the same process, it will likely refill eventually, but it is a quick and easy fix to this immediate problem.
,
Dec 21 2016
The code was verifying the correctness of the update payload, so the check is to make sure the payload won't write past partition size when the chromebooks got the update. There's no check for builder disk space, and I think it's out of space.
,
Dec 21 2016
Trooper request: bug #676413
,
Dec 21 2016
Before reimaging, would it be possible to run "du /" and take a look at the output with "sort -nr"? It may give obvious clues
,
Dec 21 2016
RE #13, you should request that in the bug linked in #12.
,
Dec 21 2016
So this is forked to 3 bugs: * trooper request: bug #676413 * request for better error message: bug #676430 * why do we eat 40% disk space: bug #676433 The 2nd two could use owners.
,
Dec 21 2016
,
Dec 22 2016
Lulu release build failed on cros-beefy44-c2.
,
Jun 8 2018
,
Aug 3
This bug has an owner, thus, it's been triaged. Changing status to "assigned".
,
Aug 3
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by semenzato@chromium.org
, Dec 20 2016