test results are devouring shard inodes |
|||||||||||||
Issue descriptionCQ run for veyron_speedy failed because the devserver seems to be out of disk space: https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_speedy-paladin/builds/6413 provision_AutoUpdate ABORT: [Errno 28] No space left on device: '/usr/local/autotest/results/hosts/chromeos4-row4-rack12-host15/4178960-provision/20172908110428/provision_AutoUpdate/status.log' provision [ FAILED ] provision FAIL: provision [ FAILED ] provision FAIL: provision [ FAILED ] provision FAIL: provision [ FAILED ] provision ABORT: [Errno 28] No space left on device: '/usr/local/autotest/results/hosts/chromeos4-row4-rack10-host21/4179009-provision/20172908111054/provision_AutoUpdate/debug/provision_AutoUpdate.DEBUG' provision [ FAILED ] provision ABORT: [Errno 28] No space left on device: '/usr/local/autotest/results/hosts/chromeos4-row4-rack11-host13/4179010-provision/20172908111054/provision_AutoUpdate/debug/provision_AutoUpdate.DEBUG' provision [ FAILED ] provision ABORT: [Errno 28] No space left on device: '/usr/local/autotest/results/hosts/chromeos4-row4-rack12-host18/4179011-provision/20172908111054/provision_AutoUpdate/debug/provision_AutoUpdate.DEBUG' provision [ FAILED ] provision ABORT: [Errno 28] No space left on device: '/usr/local/autotest/results/hosts/chromeos4-row4-rack10-host20/4179014-provision/20172908111054/provision_AutoUpdate/debug/provision_AutoUpdate.DEBUG' provision [ FAILED ] provision ABORT: [Errno 28] No space left on device: '/usr/local/autotest/results/hosts/chromeos4-row4-rack11-host6/4179015-provision/20172908111054/provision_AutoUpdate/debug/provision_AutoUpdate.DEBUG' provision [ FAILED ] ...
,
Aug 29 2017
Somewhat related bug, lots of files in /tmp: https://bugs.chromium.org/p/chromium/issues/detail?id=759875 However, the number in /tmp is trivial compared to the total, I believe /usr/local/autotest/results is eating up most of it
,
Aug 29 2017
I've deleted some old results to free up some breathing space. It looks like disk usage start rising midday on 2017-08-24 http://shortn/_OyvsTIt2ec gs_offloader does appear to be running.
,
Aug 29 2017
Pcon query http://shortn/_qpXsoLBhE5 chromeos-server100 looks like it will die soon too https://viceroy.corp.google.com/chromeos/machines?hostname=chromeos-server100&duration=8d&refresh=-1#_VG_fQiMacdj Viceroy for chromeos-server100 suggests that this started at around the same time. The metrics are cut off due to https://bugs.chromium.org/p/chromium/issues/detail?id=757494
,
Aug 29 2017
http://shortn/_LYeEBA91LI Looks like the affected machines are all gs_offloading normally. There is a drop in uploaded jobs on the 20th, four days before the inode growth. offloading metrics come back on the 24th, the same time the inode growth started. I'm not sure if that's meaningful yet.
,
Aug 29 2017
Push to prod on the 24th suspicious autotest: git log --oneline 00cd7dc6a..6b45b8c0d 5272c9ecc [autotest] Use new suite_args in CTS/GTS 0f714f84a autotest: add metric to record 'all devservers in subnet X are down'. 83636918b [autotest] Delete unused DelayedCallTask ac39cc336 [autotest] Protect ACTS result files from result throttling. 523638859 autotest: remove devserver server-side package from site-packages. 754403d11 autotest: Improve error message when choking on unmanaged pool f583ceaa5 [autotest] Make Container.set_hostname reliable. 0c432d6a8 autotest: Ignore failures in cleanup during crash collection chromite: git log --oneline 0f965afc..95f30c12 95f30c12 Revert "parallel_emerge: Work around Portage library bug with usepkg" 6dc75de9 cloud_trace: Record span metrics 30e9cd90 cloud_trace: Env variable for context 893f101b Update experimental builders at beginning of UpdateSlaveStatus dd9052b1 cidb: add CIDB connection for Google App Engine 2fa1ce2a Update config settings by config-updater. 6c86c300 Update config settings by config-updater. ace2aa28 cbuildbot: Add EbuildLogs step before RunParallelSteps f56c0aba PreCQLauncher: cbuildbot --remote -> cros tryjob. 313d5c5d cros tryjob: Add --timeout and --sanity-check-build options. cb26e28f Run git reset --hard in case of git error. 3c993467 cros flash: increase visibility of the downloaded image version reporting 80bba8ad cbuildbot: move parsed buildbot configs into options object earlier 8fbc96cc cbuildbot: Add step to archive ebuild logs 97dc2a5e cros tryjob: Confirm unknown build configs. d4ee9add log changed CLs descriptive to log and build page 84ce4ef4 [unibuild] Add reef-uni to the pre-cq test battery 478cf117 chromeos_config: finish switching VMTest to betty.
,
Aug 29 2017
(
80 log files
x 5 {chrome,vmlog,ui,power_manager,update_engine}
+ 40 log files
x 49 shutdown.*
)
x 4 {sysinfo,crashinfo.*,provision_FirmwareUpdate/sysinfo,provision_FirmwareUpdate/sysinfo/reboot_current}/var
= a hell of a lot of files for a single test (just shy of 10k)
,
Aug 29 2017
It looks like /var/log isn't being cleaned up between tests, so each test on a DUT is uploading more and more logs
,
Aug 30 2017
Further investigation is pointing strongly to some sort of change
the amount of provision_FirmwareUpdate results that we're offloading.
Several shards are at risk of running out of inodes, and the problem
is likely to grow over time.
We've been attending to chromeos-server97 as one of the shards at
particular risk: On that shard, a large amount of the problem with
contents not offloading was traced to five DUTs:
chromeos1-row1-rack10-host4
chromeos4-row11-rack11-host11
chromeos4-row11-rack11-host15
chromeos4-row12-rack11-host11
chromeos4-row12-rack11-host13
These five DUTs have been locked, to stop the bleeding on that shard.
,
Aug 30 2017
,
Aug 30 2017
In order to stop the bleeding globally, we've locked all DUTs that run FAFT. We'll let them sit overnight, and see if the shards start to recover (or at least, don't get worse). Assuming that pans out, we'll figure out a better plan in the morning (Pacific coast time). If it doesn't pan out, the non-Pacific time sheriffs will have to do something, or else endure the failures.
,
Aug 30 2017
During check of today's provision failure drop caused by enabling throttling, I found all the DUTs (like chromeos1-row1-rack3-host3, chromeos1-row2-rack3-host6,chromeos1-row2-rack4-host4) locked due to this bug. It's a coincidence that all 'frequently failed provision' DUTs in this afternoon are those who run FAFT?
,
Aug 30 2017
,
Aug 30 2017
Here is a small dashboard showing the severity of the problem: https://pcon.corp.google.com/p#phobbs/Gs%20Offloader The individual graphs: Shard free inodes: http://shortn/_kzCOHhztZL Percent of inodes consumed per 24h: http://shortn/_WZhRVQ5cz7 GS Offloader queue length: http://shortn/_3ckoZsWwUt
,
Aug 30 2017
,
Aug 30 2017
> It's a coincidence that all 'frequently failed provision' DUTs in this afternoon are those who run FAFT? Per c#11, I locked all the DUTs in FAFT pools. The criterion for locking was "in a FAFT pool", not "frequently failed provision." Whatever caused provisioning to fail when we enabled throttling should be unrelated to this bug, since this bug started at around 12:23 on 8/24.
,
Aug 30 2017
I've been doing more digging on chromeos-server14.mtv, looking at the accumulated un-offloaded results. Only about 20% of the directories include provision_FirmwareUpdate, and looking at the sizes, they cover a range; not consistently high or low. Also, looking at the graphs from c#14, most servers haven't leveled off, which is what we were hoping for. The upshot of it all is that the evidence for "provision_FirmwareUpdate is causing the problem" is weak. Something has gotten bigger, and provision_FirmwareUpdate is part of the problem, but it's not the whole problem, and may be as much a victim as a cause. Updating the summary to reflect it.
,
Aug 30 2017
Going back to first principles: This problem started at
very close to 12:23 on 8/24; the graphs are unambiguous.
This time corresponds to a push to prod (see c#6).
That means that this problem, essentially, was caused by
one of the changes in the push.
Below is the full blamelist:
6b45b8c0d Make DUT root filesystem read-writable if not
5272c9ecc [autotest] Use new suite_args in CTS/GTS
069aac925 Check for unique (for each test step) crash files
0f714f84a autotest: add metric to record 'all devservers in subnet X are down'.
7f3c6fae7 drone_manager: Don't reimplement "min(xs, key=f)"
e14162206 drone_manager: Fix drone/active_processes metric
4ea9564e7 subcommand: Remove useless "lambda_function"
83636918b [autotest] Delete unused DelayedCallTask
fcb03452c Enable suite for Bluetooth LDAC tests
ac39cc336 [autotest] Protect ACTS result files from result throttling.
523638859 autotest: remove devserver server-side package from site-packages.
754403d11 autotest: Improve error message when choking on unmanaged pool
de0246d93 security_SandboxedServices: Add ARC services.
c4474c25e [moblab] Add new suite to support the USB camera qualification.
4097374a0 platform_FilePerms: Add mount points for host-verified art files
cb234fda9 cr50_stress_experimental: add firmware_Cr50SetBoardId
7751a2a9e cr50_stress_experimental: add firmware_Cr50BID
b78ffc73a Add control file for NoSim and LockedSim tests.
f583ceaa5 [autotest] Make Container.set_hostname reliable.
b079090fc platform_FilePerms: Add network namespace rules
b5ba22569 cr50_test: change debug filename
d94992cf1 network_WiFi_RxFrag: whirlwind APs don't support fragmentation
0c432d6a8 autotest: Ignore failures in cleanup during crash collection
353837109 cr50_stress_experimental: add firmware_Cr50Update.erase_nvmem
0e8a13a91 firmware_Cr50BID: use original for universal or bid image
6795d4d05 firmware_Cr50BID: change original to universal
33b5f04bc Let VirtualFilesystemImage restore loop device ownership and permissions.
092adc0b4 Enable valid_job_urls_only=True
898bd550e migrate to /run and /run/lock
Returning to the status quo ante of 8/24 isn't really an option, but
we can start reverting suspicious CLs. The following have descriptions
suggesting a relationship to results gathering, so they should get first
look:
069aac925 Check for unique (for each test step) crash files
ac39cc336 [autotest] Protect ACTS result files from result throttling.
0c432d6a8 autotest: Ignore failures in cleanup during crash collection
,
Aug 30 2017
,
Aug 30 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/e0bfe4f708a33698e50798bdf5cd628c57af3cf5 commit e0bfe4f708a33698e50798bdf5cd628c57af3cf5 Author: Dan Shi <dshi@google.com> Date: Wed Aug 30 17:20:31 2017
,
Aug 30 2017
This was exacerbated (mainly caused) by this: https://chrome-internal-review.googlesource.com/c/chromeos/chromeos-admin/+/434777/ dshi reverted and doing forced Puppet run, this should fix the problem entirely.
,
Aug 30 2017
Issue 760635 has been merged into this issue.
,
Aug 30 2017
,
Aug 31 2017
Shards have been recovering for a few hours now
,
Aug 31 2017
As an example, there is still 17000+ jobs not uploading on chromeos-server104. The average consumption rate is around 2+ seconds per job, so hopefully the uploading might be all finished around noon MTV time.
,
Aug 31 2017
> "we've locked all DUTs that run FAFT" Can hosts(mainly chromeos1-*) be unlocked now?
,
Aug 31 2017
,
Aug 31 2017
; atest host mod --unlock $(atest host list --locked | awk '/crbug.com.760254/ {print $1}')
Unlocked hosts:
chromeos1-row1-rack1-host6, chromeos1-row1-rack10-host2,
chromeos1-row1-rack10-host4, chromeos1-row1-rack10-host6,
chromeos1-row1-rack11-host3, chromeos1-row1-rack11-host4,
chromeos1-row1-rack2-host3, chromeos1-row1-rack2-host5,
chromeos1-row1-rack3-host2, chromeos1-row1-rack3-host3,
chromeos1-row1-rack3-host4, chromeos1-row1-rack3-host6,
chromeos1-row1-rack4-host2, chromeos1-row1-rack4-host3,
chromeos1-row1-rack4-host4, chromeos1-row1-rack4-host5,
chromeos1-row1-rack9-host6, chromeos1-row2-rack11-host4,
chromeos1-row2-rack3-host5, chromeos1-row2-rack3-host6,
chromeos1-row2-rack4-host2, chromeos1-row2-rack4-host4,
chromeos1-row2-rack4-host5, chromeos1-row2-rack4-host6,
chromeos1-row2-rack5-host6, chromeos2-row7-rack10-host13,
chromeos2-row8-rack10-host11, chromeos4-row10-rack6-host13,
chromeos4-row10-rack7-host11, chromeos4-row10-rack7-host13,
chromeos4-row11-rack10-host11, chromeos4-row11-rack11-host11,
chromeos4-row11-rack11-host15, chromeos4-row12-rack10-host11,
chromeos4-row12-rack11-host11, chromeos4-row12-rack11-host13,
chromeos4-row12-rack11-host15, chromeos4-row12-rack2-host11,
chromeos4-row12-rack2-host13, chromeos4-row12-rack3-host11,
chromeos4-row12-rack4-host10, chromeos4-row12-rack4-host12,
chromeos4-row12-rack4-host13, chromeos4-row12-rack4-host14,
chromeos4-row2-rack11-host11, chromeos4-row2-rack11-host18,
chromeos4-row2-rack4-host11, chromeos4-row2-rack4-host13,
chromeos4-row3-rack10-host10, chromeos4-row3-rack10-host11,
chromeos4-row3-rack10-host12, chromeos4-row3-rack10-host14,
chromeos4-row3-rack10-host15, chromeos4-row3-rack10-host16,
chromeos4-row3-rack4-host11, chromeos4-row3-rack4-host13,
chromeos4-row3-rack4-host15, chromeos4-row3-rack5-host10,
chromeos4-row3-rack5-host12, chromeos4-row3-rack5-host13,
chromeos4-row3-rack6-host13, chromeos4-row3-rack6-host16,
chromeos4-row3-rack7-host12, chromeos4-row3-rack9-host10,
chromeos4-row4-rack10-host14, chromeos4-row4-rack11-host11,
chromeos4-row4-rack11-host14, chromeos4-row4-rack12-host17,
chromeos4-row4-rack13-host11, chromeos4-row4-rack13-host13,
chromeos4-row4-rack4-host19, chromeos4-row4-rack5-host11,
chromeos4-row4-rack5-host12, chromeos4-row4-rack5-host17,
chromeos4-row4-rack6-host10, chromeos4-row4-rack6-host13,
chromeos4-row4-rack6-host22, chromeos4-row4-rack7-host11,
chromeos4-row4-rack7-host12, chromeos4-row4-rack7-host15,
chromeos4-row4-rack7-host16, chromeos4-row5-rack5-host15,
chromeos4-row5-rack6-host13, chromeos4-row5-rack7-host11,
chromeos4-row5-rack7-host13, chromeos4-row6-rack1-host11,
chromeos4-row6-rack1-host13, chromeos4-row6-rack11-host12,
chromeos4-row6-rack11-host13, chromeos4-row6-rack2-host11,
chromeos4-row6-rack4-host13, chromeos4-row6-rack6-host11,
chromeos4-row6-rack6-host13, chromeos4-row8-rack3-host6,
chromeos4-row8-rack4-host13, chromeos4-row8-rack5-host13,
chromeos4-row9-rack9-host11, chromeos4-row9-rack9-host14,
chromeos4-row9-rack9-host4, chromeos6-row1-rack5-host1
,
Sep 5 2017
,
Sep 5 2017
Fixed by reverting https://chrome-internal-review.googlesource.com/c/chromeos/chromeos-admin/+/434777/ |
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by jrbarnette@chromium.org
, Aug 29 2017Labels: -Pri-1 Pri-0
I did a spot check, and I believe the problem is not the devserver, but rather, the shard, chromeos-server14.mtv. More to the point, on that server, there's this: $ for o in -m -i > do > df $o / > done Filesystem 1M-blocks Used Available Use% Mounted on /dev/xvda1 2132843 1459897 564581 73% / Filesystem Inodes IUsed IFree IUse% Mounted on /dev/xvda1 138690560 138690560 0 100% / IOW, we're out of inodes. P0, because it's causing CQ failures, and it'll continue to do so.