New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 863601 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 16
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Banon device failing multiple tests with the "[Errno 28] No space left on device" error

Project Member Reported by jmuppala@chromium.org, Jul 13

Issue description

Logs@
https://stainless.corp.google.com/search?view=list&first_date=2018-06-16&last_date=2018-07-01&suite=wifi_matfunc%7Cwifi_release&board=banon&status=FAIL&status=ERROR&status=ABORT&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=false



Sample failure:
Command: 
    rsync -L  --timeout=1800 --rsh='/usr/bin/ssh -a -x  -o Protocol=2 -o
    StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes
    -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3
    -o ConnectionAttempts=4 -l root -p 22' -az --no-o --no-g  "/tmp/tmpYFoSRj"
    "root@chromeos15-row1-rack6-host4:"/tmp/sysinfo/autoserv-
    yot9hR/global_config.ini""
Exit status: 11
Duration: 0.367069005966

stderr:
rsync: write failed on "/tmp/sysinfo/autoserv-yot9hR/global_config.ini": No space left on device (28)
rsync error: error in file IO (code 11) at receiver.c(393) [receiver=3.1.2]
07/01 18:29:45.584 DEBUG|      abstract_ssh:0549| Trying scp.
07/01 18:29:45.585 DEBUG|             utils:0218| Running 'scp -rq  -o StrictHostKeyChecking=no -o UserKnownHostsFile=/tmp/tmpExxBTl -P 22 "/tmp/tmpYFoSRj" 'root@chromeos15-row1-rack6-host4:"/tmp/sysinfo/autoserv-yot9hR/global_config.ini"''
07/01 18:29:45.993 DEBUG|          ssh_host:0301| Running (ssh) 'nohup /tmp/sysinfo/autoserv-yot9hR/bin/autotestd /tmp/autoserv-LzqWfv -H autoserv --verbose --hostname=chromeos15-row1-rack6-host4 --user=chromeos-test /tmp/sysinfo/autoserv-yot9hR/control.autoserv >/dev/null 2>/dev/null &' from '_do_run|execute_control|execute_section|_execute_daemon|run|run_very_slowly'
07/01 18:29:46.446 DEBUG|          ssh_host:0301| Running (ssh) '/tmp/sysinfo/autoserv-yot9hR/bin/autotestd_monitor /tmp/autoserv-LzqWfv 0 0' from '_do_run|execute_control|execute_section|_execute_daemon|run|run_very_slowly'
07/01 18:29:47.153 DEBUG|          autotest:1281| Traceback (most recent call last):
07/01 18:29:47.191 INFO |          autotest:1340| Traceback (most recent call last):
07/01 18:29:47.200 DEBUG|          autotest:1281|   File "/tmp/sysinfo/autoserv-yot9hR/bin/autotestd_monitor", line 12, in <module>
07/01 18:29:47.200 INFO |          autotest:1340|   File "/tmp/sysinfo/autoserv-yot9hR/bin/autotestd_monitor", line 12, in <module>
07/01 18:29:47.200 DEBUG|          autotest:1281|     print >> stderr, 'Entered autotestd_monitor.'
07/01 18:29:47.200 INFO |          autotest:1340|     print >> stderr, 'Entered autotestd_monitor.'
07/01 18:29:47.200 DEBUG|          autotest:1281| IOError: [Errno 28] No space left on device
07/01 18:29:47.201 INFO |          autotest:1340| IOError: [Errno 28] No space left on device
07/01 18:29:47.202 DEBUG|          autotest:0956| Result exit status is 1.
07/01 18:29:47.203 DEBUG|             utils:0218| Running 'ping chromeos15-row1-rack6-host4 -w1 -c1'
07/01 18:29:47.218 DEBUG|             utils:0286| [stdout] PING chromeos15-row1-rack6-host4.cros.corp.google.com (100.115.124.153) 56(84) bytes of data.
07/01 18:29:47.219 DEBUG|             utils:0286| [stdout] 64 bytes from 100.115.124.153: icmp_seq=1 ttl=55 time=3.47 ms
07/01 18:29:47.219 DEBUG|             utils:0286| [stdout] 
07/01 18:29:47.219 DEBUG|             utils:0286| [stdout] --- chromeos15-row1-rack6-host4.cros.corp.google.com ping statistics ---
07/01 18:29:47.219 DEBUG|             utils:0286| [stdout] 1 packets transmitted, 1 received, 0% packet loss, time 0ms
07/01 18:29:47.219 DEBUG|             utils:0286| [stdout] rtt min/avg/max/mdev = 3.473/3.473/3.473/0.000 ms
07/01 18:29:47.219 INFO |        server_job:0216| END ABORT	----	----	timestamp=1530494987	localtime=Jul 01 18:29:47	Autotest client terminated unexpectedly: DUT is pingable, could not determine if an un-expected reboot occured during the test.
07/01 18:29:47.220 DEBUG|          autotest:1108| Autotest job finishes running. Below is the post-processing operations.
07/01 18:29:47.229 DEBUG|          ssh_host:0301| Running (ssh) 'true' from 'collect_client_job_results|wait_up|is_up|ssh_ping|run|run_very_slowly'
07/01 18:29:47.640 DEBUG|      abstract_ssh:0670| Host chromeos15-row1-rack6-host4 is now up
07/01 18:29:47.641 DEBUG|            runner:0089| result tools are already deployed to chromeos15-row1-rack6-host4.
07/01 18:29:47.641 DEBUG|            runner:0100| Getting directory summary for /tmp/sysinfo/autoserv-yot9hR/results/default
07/01 18:29:47.649 DEBUG|          ssh_host:0301| Running (ssh) '/usr/local/autotest/result_tools/utils.py -p /tmp/sysinfo/autoserv-yot9hR/results/default -m 20000' from '_do_run|execute_control|collect_client_job_results|run_on_client|run|run_very_slowly'
07/01 18:29:48.052 DEBUG|             utils:0286| [stdout] 2018-07-01 18:29:47,986 Running result_tools/utils on path: /tmp/sysinfo/autoserv-yot9hR/results/default
07/01 18:29:48.053 DEBUG|             utils:0286| [stdout] 2018-07-01 18:29:47,986 Throttle result size to : 19 MB
07/01 18:29:48.098 ERROR|             utils:0286| [stderr] Traceback (most recent call last):
07/01 18:29:48.098 ERROR|             utils:0286| [stderr]   File "/usr/local/autotest/result_tools/utils.py", line 428, in <module>
07/01 18:29:48.098 ERROR|             utils:0286| [stderr]     main()
07/01 18:29:48.098 ERROR|             utils:0286| [stderr]   File "/usr/local/autotest/result_tools/utils.py", line 424, in main
07/01 18:29:48.098 ERROR|             utils:0286| [stderr]     execute(options.path, options.max_size_KB)
07/01 18:29:48.099 ERROR|             utils:0286| [stderr]   File "/usr/local/autotest/result_tools/utils.py", line 377, in execute
07/01 18:29:48.099 ERROR|             utils:0286| [stderr]     (free_space, len(summary_json)))
07/01 18:29:48.099 ERROR|             utils:0286| [stderr] utils_lib.NotEnoughDiskError: Not enough disk space after saving the summary file. Available free disk: 0 bytes. Summary file size: 1201 bytes.
07/01 18:29:48.100 ERROR|            runner:0121| Non-critical failure: Failed to create directory summary for /tmp/sysinfo/autoserv-yot9hR/results/default.
Traceback (most recent call last):
  File "/usr/local/autotest/client/bin/result_tools/runner.py", line 114, in run_on_client
    timeout=_BUILD_DIR_SUMMARY_TIMEOUT)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 323, in run
    return self.run_very_slowly(*args, **kwargs)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 312, in run_very_slowly
    ssh_failure_retry_ok)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 262, in _run
    raise error.AutoservRunError("command execution error", result)
AutoservRunError: command execution error
* Command: 
    /usr/bin/ssh -a -x   -o Protocol=2 -o StrictHostKeyChecking=no -o
    UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o
    ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4
    -l root -p 22 chromeos15-row1-rack6-host4 "export LIBC_FATAL_STDERR_=1; if
    type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
    \"server[stack::collect_client_job_results|run_on_client|run] ->
    ssh_run(/usr/local/autotest/result_tools/utils.py -p /tmp/sysinfo
    /autoserv-yot9hR/results/default -m 20000)\";fi;
    /usr/local/autotest/result_tools/utils.py -p /tmp/sysinfo/autoserv-
    yot9hR/results/default -m 20000"
Exit status: 1
Duration: 0.437424898148

stdout:
2018-07-01 18:29:47,986 Running result_tools/utils on path: /tmp/sysinfo/autoserv-yot9hR/results/default
2018-07-01 18:29:47,986 Throttle result size to : 19 MB
stderr:
Traceback (most recent call last):
  File "/usr/local/autotest/result_tools/utils.py", line 428, in <module>
    main()
  File "/usr/local/autotest/result_tools/utils.py", line 424, in main
    execute(options.path, options.max_size_KB)
  File "/usr/local/autotest/result_tools/utils.py", line 377, in execute
    (free_space, len(summary_json)))
utils_lib.NotEnoughDiskError: Not enough disk space after saving the summary file. Available free disk: 0 bytes. Summary file size: 1201 bytes.
07/01 18:29:48.101 DEBUG|      abstract_ssh:0413| get_file. source: /tmp/sysinfo/autoserv-yot9hR/results/default/, dest: /usr/local/autotest/results/213361755-chromeos-test, delete_dest: False,preserve_perm: True, preserve_symlinks:True
07/01 18:29:48.102 DEBUG|      abstract_ssh:0425| Using Rsync.
07/01 18:29:48.102 DEBUG|             utils:0218| Running 'rsync -l  --timeout=1800 --rsh='/usr/bin/ssh -a -x  -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22' -az --no-o --no-g  root@chromeos15-row1-rack6-host4:"/tmp/sysinfo/autoserv-yot9hR/results/default/" "/usr/local/autotest/results/213361755-chromeos-test"'
07/01 18:29:48.494 DEBUG|        server_job:1370| Client state file /usr/local/autotest/results/213361755-chromeos-test/control.autoserv.state not found
07/01 18:29:48.811 DEBUG|          base_job:0399| Persistent state client.* deleted
07/01 18:29:48.845 DEBUG|          autotest:1122| Autotest job finishes.
07/01 18:29:48.845 ERROR|               log:0027| post-test iteration server sysinfo error:
07/01 18:29:48.846 ERROR|         traceback:0013| Traceback (most recent call last):
07/01 18:29:48.846 ERROR|         traceback:0013|   File "/usr/local/autotest/client/common_lib/log.py", line 25, in decorated_func
07/01 18:29:48.846 ERROR|         traceback:0013|     fn(*args, **dargs)
07/01 18:29:48.847 ERROR|         traceback:0013|   File "/usr/local/autotest/server/test.py", line 76, in wrapper
07/01 18:29:48.847 ERROR|         traceback:0013|     func(self, mytest, host, at, outputdir)
07/01 18:29:48.847 ERROR|         traceback:0013|   File "/usr/local/autotest/server/test.py", line 214, in after_iteration_hook
07/01 18:29:48.848 ERROR|         traceback:0013|     results_dir=self.job.resultdir)
07/01 18:29:48.848 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autotest.py", line 479, in run
07/01 18:29:48.848 ERROR|         traceback:0013|     client_disconnect_timeout, use_packaging=use_packaging)
07/01 18:29:48.848 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autotest.py", line 562, in _do_run
07/01 18:29:48.849 ERROR|         traceback:0013|     client_disconnect_timeout=client_disconnect_timeout)
07/01 18:29:48.849 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autotest.py", line 1054, in execute_control
07/01 18:29:48.849 ERROR|         traceback:0013|     logger, client_disconnect_timeout)
07/01 18:29:48.849 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autotest.py", line 999, in execute_section
07/01 18:29:48.850 ERROR|         traceback:0013|     raise err
07/01 18:29:48.850 ERROR|         traceback:0013| AutotestRunError: client job was aborted
07/01 18:29:48.851 DEBUG|              test:0420| after_iteration_hooks completed
07/01 18:29:48.851 INFO |       wifi_client:1318| ======= WiFi autotest complete. Cleaning up... =======



 
Cc: akhouderchah@chromium.org briannorris@chromium.org grundler@chromium.org kirtika@chromium.org
Looks like couple other devices are running into this issue as well. Also it only seems to be affecting devices in wificell pools.

https://stainless.corp.google.com/search?view=matrix&row=hostname&col=build&suite=%5Ewifi%5C_matfunc%24&reason=No+space+left+on+device&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=false&days=15


chromeos15-row1-rack6-host4 - banon
chromeos15-row1-rack6-host2 - wizpig
chromeos15-row1-rack10-host2 - edgar



Adding other folks to help understand what may be causing these devices to run out of space.
Labels: -Pri-3 Pri-2
Cc: -akhouderchah@chromium.org
Owner: akhouderchah@chromium.org
Status: Assigned (was: Untriaged)
I experienced this yesterday. In the short term, running `rm -rf /tmp/sysinfo/autoserv*` on the DUT should get the tests running again.

It seems like for server tests, autotest is being unconditionally installed on the test machine(s) (see autotest_lib.server.test._install), with each autoserv-* folder being roughly 30M in size. It makes sense to first compare timestamps between any existing autotest installations on the test machine and the timestamp of the autotest installation on the machine calling the test before proceeding with installation. At the very least, existing autotest installations should be removed before sending over another one. I would be willing to take this one if no one is opposed.

What I'm wondering is why this functionality has only now started causing out of space errors. Anyone have some insight into that? 
Project Member

Comment 4 by bugdroid1@chromium.org, Jul 17

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0c31f63aa8845799054bce592e2ea5219449942e

commit 0c31f63aa8845799054bce592e2ea5219449942e
Author: Alex Khouderchah <akhouderchah@chromium.org>
Date: Tue Jul 17 19:11:27 2018

autotest: Perform server test shutdown even when receiving SIGINT

While server tests, when run normally, will clean up their temporary
installation and some large output directories, the same is not true
when a user uses ctrl-c to force-close the test.

This change modifies server tests to catch signals like SIGTERM and
SIGBREAK, and to run the cleanup as expected.

BUG= chromium:863601 
TEST=Ran server tests to completion, ran a modified server test with
     a syntax error, and interrupted running server tests with SIGINT.
     The first two cases remain unchanged, as the test framework
     performed cleanup as expected. Ensured that the third test case was
     actually triggering a cleanup.

Change-Id: Ida0334ebdf964fa4ee0ae730a2598686e8909c96
Reviewed-on: https://chromium-review.googlesource.com/1138649
Commit-Ready: Alex Khouderchah <akhouderchah@chromium.org>
Tested-by: Alex Khouderchah <akhouderchah@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/0c31f63aa8845799054bce592e2ea5219449942e/server/test.py
[modify] https://crrev.com/0c31f63aa8845799054bce592e2ea5219449942e/client/common_lib/utils.py

Status: Fixed (was: Assigned)
As an update, the unconditional installation of autotest in server tests only proved to be an issue when the test was sent a signal like SIGINT, as the test would clean up its files when ending normally or with an exception. This might explain why we weren't getting no space errors before, as it would only occur after forcing multiple test runs to stop without cleaning up.

That being said, this change will not fix an existing out-of-space issue, it would only prevent new ones. Depending on how the DUT clears its /tmp directory, it still might be necessary to run `rm -rf /tmp/autoserv-* /tmp/sysinfo/autoserv-*`
Project Member

Comment 6 by bugdroid1@chromium.org, Jul 23

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b7f9680f44f32e59c8d42d4b5e93ad7f022722b2

commit b7f9680f44f32e59c8d42d4b5e93ad7f022722b2
Author: Allen Li <ayatane@chromium.org>
Date: Mon Jul 23 20:14:56 2018

Revert "autotest: Perform server test shutdown even when receiving SIGINT"

This reverts commit 0c31f63aa8845799054bce592e2ea5219449942e.

Reason for revert: Suspect for causing autoserv leak, crbug.com/866543

Original change's description:
> autotest: Perform server test shutdown even when receiving SIGINT
> 
> While server tests, when run normally, will clean up their temporary
> installation and some large output directories, the same is not true
> when a user uses ctrl-c to force-close the test.
> 
> This change modifies server tests to catch signals like SIGTERM and
> SIGBREAK, and to run the cleanup as expected.
> 
> BUG= chromium:863601 
> TEST=Ran server tests to completion, ran a modified server test with
>      a syntax error, and interrupted running server tests with SIGINT.
>      The first two cases remain unchanged, as the test framework
>      performed cleanup as expected. Ensured that the third test case was
>      actually triggering a cleanup.
> 
> Change-Id: Ida0334ebdf964fa4ee0ae730a2598686e8909c96
> Reviewed-on: https://chromium-review.googlesource.com/1138649
> Commit-Ready: Alex Khouderchah <akhouderchah@chromium.org>
> Tested-by: Alex Khouderchah <akhouderchah@chromium.org>
> Reviewed-by: Xixuan Wu <xixuan@chromium.org>

Bug:  chromium:863601 
Change-Id: Ie9c80954de622294ecd1d311e6a75d8ff3f6d597
Reviewed-on: https://chromium-review.googlesource.com/1147321
Reviewed-by: Allen Li <ayatane@chromium.org>
Commit-Queue: Allen Li <ayatane@chromium.org>
Tested-by: Allen Li <ayatane@chromium.org>

[modify] https://crrev.com/b7f9680f44f32e59c8d42d4b5e93ad7f022722b2/server/test.py
[modify] https://crrev.com/b7f9680f44f32e59c8d42d4b5e93ad7f022722b2/client/common_lib/utils.py

Status: Assigned (was: Fixed)
Project Member

Comment 8 by bugdroid1@chromium.org, Aug 10

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/c44e777d9c6ee299d66ab1a9f6d9e6b3a392bd30

commit c44e777d9c6ee299d66ab1a9f6d9e6b3a392bd30
Author: Alex Khouderchah <akhouderchah@chromium.org>
Date: Fri Aug 10 05:04:21 2018

autotest: Remove existing autoserv dirs on startup

While server tests, when run normally, will clean up their temporary
installation and some large output directories, the same is not true
when a user uses ctrl-c to force-close the test.

This change modifies server tests to remove existing /tmp/autoserv-* and
/tmp/sysinfo/autoserv-* directories before creating new ones, such that
existing left-over directories will not cause a DUT's /tmp filesystem to
run out of space.

BUG= chromium:863601 
TEST=Ran server tests to completion, ran a modified server test with
     a syntax error, and interrupted running server tests with SIGINT.
     The first two cases remain unchanged, as the test framework
     performed cleanup as expected. Ensured that the third test case was
     actually triggering a cleanup.
TEST=Used get_tmp_dir to create multiple temp dirs, including a set of
     nested temporary directories. Then alternated between printing
     self.tmp_dirs and calling delete_all_tmp_dirs with various parent
     directories to ensure the expected behavior was occuring both on
     the host and with regards to the contents of self.tmp_dirs.

Change-Id: I82a1619d4c8976547792f3cac84b6ed41148b484
Reviewed-on: https://chromium-review.googlesource.com/1147500
Commit-Ready: Alex Khouderchah <akhouderchah@chromium.org>
Tested-by: Alex Khouderchah <akhouderchah@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/c44e777d9c6ee299d66ab1a9f6d9e6b3a392bd30/server/test.py
[modify] https://crrev.com/c44e777d9c6ee299d66ab1a9f6d9e6b3a392bd30/server/hosts/remote.py

Status: Fixed (was: Assigned)
Sounds like comment #8 means it's Fixed?

Sign in to add a comment