New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 659306 link

Starred by 2 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Feature



Sign in to add a comment

autotest: reboot machine when /tmp become full

Project Member Reported by gwendal@chromium.org, Oct 25 2016

Issue description

A disk qual failed because /tmp became full but the machine was not rebooted.

chrome-os-partner:48483
gs://chromeos-moblab-wistron/results/00:50:b6:59:4b:ff/a32f17fa-83b4-11e6-9efc-0050b6594bff/199-moblab

10/09 19:04:23.544 WARNI|      abstract_ssh:0443| trying scp, rsync failed: Command <rsync -L  --timeout=1800 --rsh='/usr/bin/ssh -a -x    -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o Serve>
* Command:.
    rsync -L  --timeout=1800 --rsh='/usr/bin/ssh -a -x    -o
    StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes
    -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3
    -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22' -az --no-o --no-g
    "/tmp/tmplcnVmZ" "root@192.168.231.110:"/tmp/sysinfo/autoserv-
    r2GqUc/global_config.ini""
Exit status: 11
Duration: 0.145009994507

stderr:
rsync: write failed on "/tmp/sysinfo/autoserv-r2GqUc/global_config.ini": No space left on device (28)
rsync error: error in file IO (code 11) at receiver.c(393) [receiver=3.1.1]
10/09 19:04:23.548 DEBUG|      abstract_ssh:0446| Trying scp.
...


As part of the cleanup process, when a test fails because scp to DUT /tmp fails, should we try to reboot it?
 

Comment 1 by sbasi@chromium.org, Oct 26 2016

Cc: jean@chromium.org ntang@chromium.org stephenlin@chromium.org
You should run this past the P-eng team as they drive MobLab now. Rebooting would kill any other parallel running tests.

My suggestion is for MobLab don't use /tmp as there's more space elsewhere on the device. So update this test/caller to be smarter.

An alternative idea is to ensure /tmp has more space. Maybe using a bind-mount the larger drive at startup over /tmp or something...
I believe the issue here is /tmp filled up _on the DUT_.
And, IIUC, that means the fix is to add a verifier that checks
"is there space in /tmp?".  When it fails, it should trigger
reboot as a repair action.

gwendal@ - can you confirm?

I note also that we already have a need for a common verifier that
checks for various "out of space" conditions; see bug 596131.
If we're going to address this problem with a verifier; we should
make sure to fix that bug at the same time.

Sign in to add a comment