Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 4 users
Status: Verified
Owner:
Closed: Sep 5
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 759780



Sign in to add a comment
test results are devouring shard inodes
Project Member Reported by mka@chromium.org, Aug 29 Back to list
CQ run for veyron_speedy failed because the devserver seems to be out of disk space:

https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_speedy-paladin/builds/6413

  provision_AutoUpdate                         ABORT: [Errno 28] No space left on device: '/usr/local/autotest/results/hosts/chromeos4-row4-rack12-host15/4178960-provision/20172908110428/provision_AutoUpdate/status.log'
  provision                                  [ FAILED ]
  provision                                    FAIL: 
  provision                                  [ FAILED ]
  provision                                    FAIL: 
  provision                                  [ FAILED ]
  provision                                    FAIL: 
  provision                                  [ FAILED ]
  provision                                    ABORT: [Errno 28] No space left on device: '/usr/local/autotest/results/hosts/chromeos4-row4-rack10-host21/4179009-provision/20172908111054/provision_AutoUpdate/debug/provision_AutoUpdate.DEBUG'
  provision                                  [ FAILED ]
  provision                                    ABORT: [Errno 28] No space left on device: '/usr/local/autotest/results/hosts/chromeos4-row4-rack11-host13/4179010-provision/20172908111054/provision_AutoUpdate/debug/provision_AutoUpdate.DEBUG'
  provision                                  [ FAILED ]
  provision                                    ABORT: [Errno 28] No space left on device: '/usr/local/autotest/results/hosts/chromeos4-row4-rack12-host18/4179011-provision/20172908111054/provision_AutoUpdate/debug/provision_AutoUpdate.DEBUG'
  provision                                  [ FAILED ]
  provision                                    ABORT: [Errno 28] No space left on device: '/usr/local/autotest/results/hosts/chromeos4-row4-rack10-host20/4179014-provision/20172908111054/provision_AutoUpdate/debug/provision_AutoUpdate.DEBUG'
  provision                                  [ FAILED ]
  provision                                    ABORT: [Errno 28] No space left on device: '/usr/local/autotest/results/hosts/chromeos4-row4-rack11-host6/4179015-provision/20172908111054/provision_AutoUpdate/debug/provision_AutoUpdate.DEBUG'
  provision                                  [ FAILED ]
...
 
Components: -Infra Infra>Client>ChromeOS
Labels: -Pri-1 Pri-0
I did a spot check, and I believe the problem is not the devserver,
but rather, the shard, chromeos-server14.mtv.

More to the point, on that server, there's this:
    $ for o in -m -i
    > do
    > df $o /
    > done
    Filesystem     1M-blocks    Used Available Use% Mounted on
    /dev/xvda1       2132843 1459897    564581  73% /
    Filesystem        Inodes     IUsed IFree IUse% Mounted on
    /dev/xvda1     138690560 138690560     0  100% /

IOW, we're out of inodes.

P0, because it's causing CQ failures, and it'll continue to do so.

Somewhat related bug, lots of files in /tmp: https://bugs.chromium.org/p/chromium/issues/detail?id=759875


However, the number in /tmp is trivial compared to the total, I believe /usr/local/autotest/results is eating up most of it
I've deleted some old results to free up some breathing space.

It looks like disk usage start rising midday on 2017-08-24

http://shortn/_OyvsTIt2ec

gs_offloader does appear to be running.
Pcon query http://shortn/_qpXsoLBhE5

chromeos-server100 looks like it will die soon too

https://viceroy.corp.google.com/chromeos/machines?hostname=chromeos-server100&duration=8d&refresh=-1#_VG_fQiMacdj

Viceroy for chromeos-server100 suggests that this started at around the same time.

The metrics are cut off due to https://bugs.chromium.org/p/chromium/issues/detail?id=757494


http://shortn/_LYeEBA91LI

Looks like the affected machines are all gs_offloading normally.  There is a drop in uploaded jobs on the 20th, four days before the inode growth. offloading metrics come back on the 24th, the same time the inode growth started.

I'm not sure if that's meaningful yet.
Push to prod on the 24th suspicious

autotest:
git log --oneline 00cd7dc6a..6b45b8c0d
5272c9ecc [autotest] Use new suite_args in CTS/GTS
0f714f84a autotest: add metric to record 'all devservers in subnet X are down'.
83636918b [autotest] Delete unused DelayedCallTask
ac39cc336 [autotest] Protect ACTS result files from result throttling.
523638859 autotest: remove devserver server-side package from site-packages.
754403d11 autotest: Improve error message when choking on unmanaged pool
f583ceaa5 [autotest] Make Container.set_hostname reliable.
0c432d6a8 autotest: Ignore failures in cleanup during crash collection


chromite:
git log --oneline 0f965afc..95f30c12
95f30c12 Revert "parallel_emerge: Work around Portage library bug with usepkg"
6dc75de9 cloud_trace: Record span metrics
30e9cd90 cloud_trace: Env variable for context
893f101b Update experimental builders at beginning of UpdateSlaveStatus
dd9052b1 cidb: add CIDB connection for Google App Engine
2fa1ce2a Update config settings by config-updater.
6c86c300 Update config settings by config-updater.
ace2aa28 cbuildbot: Add EbuildLogs step before RunParallelSteps
f56c0aba PreCQLauncher: cbuildbot --remote -> cros tryjob.
313d5c5d cros tryjob: Add --timeout and --sanity-check-build options.
cb26e28f Run git reset --hard in case of git error.
3c993467 cros flash: increase visibility of the downloaded image version reporting
80bba8ad cbuildbot: move parsed buildbot configs into options object earlier
8fbc96cc cbuildbot: Add step to archive ebuild logs
97dc2a5e cros tryjob: Confirm unknown build configs.
d4ee9add log changed CLs descriptive to log and build page
84ce4ef4 [unibuild] Add reef-uni to the pre-cq test battery
478cf117 chromeos_config: finish switching VMTest to betty.
(
80 log files
x 5 {chrome,vmlog,ui,power_manager,update_engine}

+ 40 log files
x 49 shutdown.*
)

x 4 {sysinfo,crashinfo.*,provision_FirmwareUpdate/sysinfo,provision_FirmwareUpdate/sysinfo/reboot_current}/var

= a hell of a lot of files for a single test (just shy of 10k)
It looks like /var/log isn't being cleaned up between tests, so each test on a DUT is uploading more and more logs
Cc: dchan@chromium.org
Summary: provision_FirmwareUpdate is devouring shard inodes (was: devserver for veyron_speedy out of disk space)
Further investigation is pointing strongly to some sort of change
the amount of provision_FirmwareUpdate results that we're offloading.

Several shards are at risk of running out of inodes, and the problem
is likely to grow over time.

We've been attending to chromeos-server97 as one of the shards at
particular risk:  On that shard, a large amount of the problem with
contents not offloading was traced to five DUTs:
    chromeos1-row1-rack10-host4
    chromeos4-row11-rack11-host11
    chromeos4-row11-rack11-host15
    chromeos4-row12-rack11-host11
    chromeos4-row12-rack11-host13

These five DUTs have been locked, to stop the bleeding on that shard.

Cc: alemate@chromium.org kitching@chromium.org yusukes@chromium.org
In order to stop the bleeding globally, we've locked all DUTs that
run FAFT.  We'll let them sit overnight, and see if the shards start
to recover (or at least, don't get worse).

Assuming that pans out, we'll figure out a better plan in the morning
(Pacific coast time).

If it doesn't pan out, the non-Pacific time sheriffs will have to do
something, or else endure the failures.

During check of today's provision failure drop caused by enabling throttling, I found all the DUTs (like chromeos1-row1-rack3-host3, chromeos1-row2-rack3-host6,chromeos1-row2-rack4-host4) locked due to this bug.

It's a coincidence that all 'frequently failed provision' DUTs in this afternoon are those who run FAFT?
Cc: waihong@chromium.org
Here is a small dashboard showing the severity of the problem: https://pcon.corp.google.com/p#phobbs/Gs%20Offloader

The individual graphs:

  Shard free inodes: http://shortn/_kzCOHhztZL
  Percent of inodes consumed per 24h: http://shortn/_WZhRVQ5cz7
  GS Offloader queue length: http://shortn/_3ckoZsWwUt
Cc: dshi@chromium.org
> It's a coincidence that all 'frequently failed provision' DUTs in this afternoon are those who run FAFT?

Per c#11, I locked all the DUTs in FAFT pools.  The criterion
for locking was "in a FAFT pool", not "frequently failed provision."

Whatever caused provisioning to fail when we enabled throttling
should be unrelated to this bug, since this bug started at around
12:23 on 8/24.
Summary: test results are devouring shard inodes (was: provision_FirmwareUpdate is devouring shard inodes)
I've been doing more digging on chromeos-server14.mtv, looking at
the accumulated un-offloaded results.  Only about 20% of the directories
include provision_FirmwareUpdate, and looking at the sizes, they cover
a range; not consistently high or low.

Also, looking at the graphs from c#14, most servers haven't leveled off,
which is what we were hoping for.

The upshot of it all is that the evidence for "provision_FirmwareUpdate
is causing the problem" is weak.  Something has gotten bigger, and
provision_FirmwareUpdate is part of the problem, but it's not the whole
problem, and may be as much a victim as a cause.

Updating the summary to reflect it.

Going back to first principles:  This problem started at
very close to 12:23 on 8/24; the graphs are unambiguous.
This time corresponds to a push to prod (see c#6).
That means that this problem, essentially, was caused by
one of the changes in the push.

Below is the full blamelist:
    6b45b8c0d Make DUT root filesystem read-writable if not
    5272c9ecc [autotest] Use new suite_args in CTS/GTS
    069aac925 Check for unique (for each test step) crash files
    0f714f84a autotest: add metric to record 'all devservers in subnet X are down'.
    7f3c6fae7 drone_manager: Don't reimplement "min(xs, key=f)"
    e14162206 drone_manager: Fix drone/active_processes metric
    4ea9564e7 subcommand: Remove useless "lambda_function"
    83636918b [autotest] Delete unused DelayedCallTask
    fcb03452c Enable suite for Bluetooth LDAC tests
    ac39cc336 [autotest] Protect ACTS result files from result throttling.
    523638859 autotest: remove devserver server-side package from site-packages.
    754403d11 autotest: Improve error message when choking on unmanaged pool
    de0246d93 security_SandboxedServices: Add ARC services.
    c4474c25e [moblab] Add new suite to support the USB camera qualification.
    4097374a0 platform_FilePerms: Add mount points for host-verified art files
    cb234fda9 cr50_stress_experimental: add firmware_Cr50SetBoardId
    7751a2a9e cr50_stress_experimental: add firmware_Cr50BID
    b78ffc73a Add control file for NoSim and  LockedSim tests.
    f583ceaa5 [autotest] Make Container.set_hostname reliable.
    b079090fc platform_FilePerms: Add network namespace rules
    b5ba22569 cr50_test: change debug filename
    d94992cf1 network_WiFi_RxFrag: whirlwind APs don't support fragmentation
    0c432d6a8 autotest: Ignore failures in cleanup during crash collection
    353837109 cr50_stress_experimental: add firmware_Cr50Update.erase_nvmem
    0e8a13a91 firmware_Cr50BID: use original for universal or bid image
    6795d4d05 firmware_Cr50BID: change original to universal
    33b5f04bc Let VirtualFilesystemImage restore loop device ownership and permissions.
    092adc0b4 Enable valid_job_urls_only=True
    898bd550e migrate to /run and /run/lock

Returning to the status quo ante of 8/24 isn't really an option, but
we can start reverting suspicious CLs.  The following have descriptions
suggesting a relationship to results gathering, so they should get first
look:
    069aac925 Check for unique (for each test step) crash files
    ac39cc336 [autotest] Protect ACTS result files from result throttling.
    0c432d6a8 autotest: Ignore failures in cleanup during crash collection

Cc: kmshelton@chromium.org
Project Member Comment 20 by bugdroid1@chromium.org, Aug 30
The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/e0bfe4f708a33698e50798bdf5cd628c57af3cf5

commit e0bfe4f708a33698e50798bdf5cd628c57af3cf5
Author: Dan Shi <dshi@google.com>
Date: Wed Aug 30 17:20:31 2017

This was exacerbated (mainly caused) by this: https://chrome-internal-review.googlesource.com/c/chromeos/chromeos-admin/+/434777/

dshi reverted and doing forced Puppet run, this should fix the problem entirely.
Issue 760635 has been merged into this issue.
Cc: ihf@chromium.org
Labels: -Pri-0 Pri-1
Shards have been recovering for a few hours now
Cc: cywang@chromium.org
As an example, there is still 17000+ jobs not uploading on chromeos-server104. The average consumption rate is around 2+ seconds per job, so hopefully the uploading might be all finished around noon MTV time.
> "we've locked all DUTs that run FAFT"
Can hosts(mainly chromeos1-*) be unlocked now?

Cc: sontis@chromium.org pgangishetty@chromium.org
; atest host mod --unlock $(atest host list --locked | awk '/crbug.com.760254/ {print $1}')
Unlocked hosts: 
	chromeos1-row1-rack1-host6, chromeos1-row1-rack10-host2,
	chromeos1-row1-rack10-host4, chromeos1-row1-rack10-host6,
	chromeos1-row1-rack11-host3, chromeos1-row1-rack11-host4,
	chromeos1-row1-rack2-host3, chromeos1-row1-rack2-host5,
	chromeos1-row1-rack3-host2, chromeos1-row1-rack3-host3,
	chromeos1-row1-rack3-host4, chromeos1-row1-rack3-host6,
	chromeos1-row1-rack4-host2, chromeos1-row1-rack4-host3,
	chromeos1-row1-rack4-host4, chromeos1-row1-rack4-host5,
	chromeos1-row1-rack9-host6, chromeos1-row2-rack11-host4,
	chromeos1-row2-rack3-host5, chromeos1-row2-rack3-host6,
	chromeos1-row2-rack4-host2, chromeos1-row2-rack4-host4,
	chromeos1-row2-rack4-host5, chromeos1-row2-rack4-host6,
	chromeos1-row2-rack5-host6, chromeos2-row7-rack10-host13,
	chromeos2-row8-rack10-host11, chromeos4-row10-rack6-host13,
	chromeos4-row10-rack7-host11, chromeos4-row10-rack7-host13,
	chromeos4-row11-rack10-host11, chromeos4-row11-rack11-host11,
	chromeos4-row11-rack11-host15, chromeos4-row12-rack10-host11,
	chromeos4-row12-rack11-host11, chromeos4-row12-rack11-host13,
	chromeos4-row12-rack11-host15, chromeos4-row12-rack2-host11,
	chromeos4-row12-rack2-host13, chromeos4-row12-rack3-host11,
	chromeos4-row12-rack4-host10, chromeos4-row12-rack4-host12,
	chromeos4-row12-rack4-host13, chromeos4-row12-rack4-host14,
	chromeos4-row2-rack11-host11, chromeos4-row2-rack11-host18,
	chromeos4-row2-rack4-host11, chromeos4-row2-rack4-host13,
	chromeos4-row3-rack10-host10, chromeos4-row3-rack10-host11,
	chromeos4-row3-rack10-host12, chromeos4-row3-rack10-host14,
	chromeos4-row3-rack10-host15, chromeos4-row3-rack10-host16,
	chromeos4-row3-rack4-host11, chromeos4-row3-rack4-host13,
	chromeos4-row3-rack4-host15, chromeos4-row3-rack5-host10,
	chromeos4-row3-rack5-host12, chromeos4-row3-rack5-host13,
	chromeos4-row3-rack6-host13, chromeos4-row3-rack6-host16,
	chromeos4-row3-rack7-host12, chromeos4-row3-rack9-host10,
	chromeos4-row4-rack10-host14, chromeos4-row4-rack11-host11,
	chromeos4-row4-rack11-host14, chromeos4-row4-rack12-host17,
	chromeos4-row4-rack13-host11, chromeos4-row4-rack13-host13,
	chromeos4-row4-rack4-host19, chromeos4-row4-rack5-host11,
	chromeos4-row4-rack5-host12, chromeos4-row4-rack5-host17,
	chromeos4-row4-rack6-host10, chromeos4-row4-rack6-host13,
	chromeos4-row4-rack6-host22, chromeos4-row4-rack7-host11,
	chromeos4-row4-rack7-host12, chromeos4-row4-rack7-host15,
	chromeos4-row4-rack7-host16, chromeos4-row5-rack5-host15,
	chromeos4-row5-rack6-host13, chromeos4-row5-rack7-host11,
	chromeos4-row5-rack7-host13, chromeos4-row6-rack1-host11,
	chromeos4-row6-rack1-host13, chromeos4-row6-rack11-host12,
	chromeos4-row6-rack11-host13, chromeos4-row6-rack2-host11,
	chromeos4-row6-rack4-host13, chromeos4-row6-rack6-host11,
	chromeos4-row6-rack6-host13, chromeos4-row8-rack3-host6,
	chromeos4-row8-rack4-host13, chromeos4-row8-rack5-host13,
	chromeos4-row9-rack9-host11, chromeos4-row9-rack9-host14,
	chromeos4-row9-rack9-host4, chromeos6-row1-rack5-host1

Blocking: 759780
Sign in to add a comment