New issue
Advanced search Search tips

Issue 757861 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Aug 2017
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 757846



Sign in to add a comment

autoserv: crash collection crash when removing destination file

Project Member Reported by pprabhu@chromium.org, Aug 22 2017

Issue description

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/136377018-chromeos-test/chromeos6-row1-rack6-host5/debug/

This hides real errors, and in this case I suspect left the logs with some weird indentation that parsing then failed to parse --> the job got aborted.


The first log is an "informational" stacktrace, followed by the real one:

08/22 03:53:54.716 ERROR|            runner:0121| Non-critical failure: Failed to create directory summary for /dev/pstore.
Traceback (most recent call last):
  File "/usr/local/autotest/client/bin/result_tools/runner.py", line 114, in run_on_client
    timeout=_BUILD_DIR_SUMMARY_TIMEOUT)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 316, in run
    return self.run_very_slowly(*args, **kwargs)
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 482, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 307, in run_very_slowly
    ssh_failure_retry_ok)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 258, in _run
    raise error.AutoservRunError("command execution error", result)
AutoservRunError: command execution error
* Command: 
    /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_k_Osf4ssh-
    master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
    -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
    ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
    chromeos6-row1-rack6-host5 "export LIBC_FATAL_STDERR_=1; if type
    \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
    \"server[stack::run_on_client|run|wrapper] ->
    ssh_run(/usr/local/autotest/result_tools/utils.py -p /dev/pstore -m
    20000)\";fi; /usr/local/autotest/result_tools/utils.py -p /dev/pstore -m
    20000"
Exit status: 1
Duration: 0.229590892792

stdout:
2017-08-22 03:53:53,673 Running result_tools/utils on path: /dev/pstore
2017-08-22 03:53:53,673 Throttle result size to : 19 MB
stderr:
Traceback (most recent call last):
  File "/usr/local/autotest/result_tools/utils.py", line 424, in <module>
    main()
  File "/usr/local/autotest/result_tools/utils.py", line 420, in main
    execute(options.path, options.max_size_KB)
  File "/usr/local/autotest/result_tools/utils.py", line 373, in execute
    (free_space, len(summary_json)))
utils_lib.NotEnoughDiskError: Not enough disk space after saving the summary file. Available free disk: 0 bytes. Summary file size: 63 bytes.
08/22 03:53:54.717 DEBUG|      abstract_ssh:0363| get_file. source: /dev/pstore, dest: /usr/local/autotest/results/136377018-chromeos-test/chromeos6-row1-rack6-host5/crashinfo.chromeos6-row1-rack6-host5/pstore, delete_dest: False,preserve_perm: False, preserve_symlinks:False
08/22 03:53:54.717 DEBUG|      abstract_ssh:0375| Using Rsync.
08/22 03:53:54.717 DEBUG|             utils:0212| Running 'rsync -L  --timeout=1800 --rsh='/usr/bin/ssh -a -x   -o ControlPath=/tmp/_autotmp_k_Osf4ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22' -az --no-o --no-g  root@chromeos6-row1-rack6-host5:"/dev/pstore" "/usr/local/autotest/results/136377018-chromeos-test/chromeos6-row1-rack6-host5/crashinfo.chromeos6-row1-rack6-host5/pstore"'
08/22 03:53:55.264 DEBUG|          ssh_host:0296| Running (ssh) 'rm -rf /dev/pstore' from 'wrapper|get_crashinfo|collect_log_file|run|wrapper|run_very_slowly'
08/22 03:53:55.450 ERROR|             utils:0280| [stderr] rm: cannot remove '/dev/pstore': Device or resource busy
08/22 03:53:55.453 DEBUG|     ssh_multiplex:0118| Nuking ssh master_job
08/22 03:53:56.457 DEBUG|     ssh_multiplex:0123| Cleaning ssh master_tempdir
08/22 03:53:56.458 ERROR|         traceback:0013| Traceback (most recent call last):
08/22 03:53:56.458 ERROR|         traceback:0013|   File "/usr/local/autotest/server/autoserv", line 530, in run_autoserv
08/22 03:53:56.459 ERROR|         traceback:0013|     use_packaging=(not no_use_packaging))
08/22 03:53:56.459 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 891, in run
08/22 03:53:56.459 ERROR|         traceback:0013|     self._collect_crashes(namespace, collect_crashinfo)
08/22 03:53:56.459 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 759, in _collect_crashes
08/22 03:53:56.460 ERROR|         traceback:0013|     self._execute_code(crash_control_file, namespace)
08/22 03:53:56.460 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 1357, in _execute_code
08/22 03:53:56.460 ERROR|         traceback:0013|     execfile(code_file, namespace, namespace)
08/22 03:53:56.461 ERROR|         traceback:0013|   File "/usr/local/autotest/server/control_segments/crashinfo", line 18, in <module>
08/22 03:53:56.461 ERROR|         traceback:0013|     job.parallel_simple(crashinfo, machines, log=False)
08/22 03:53:56.461 ERROR|         traceback:0013|   File "/usr/local/autotest/server/server_job.py", line 666, in parallel_simple
08/22 03:53:56.462 ERROR|         traceback:0013|     log=log, timeout=timeout, return_results=return_results)
08/22 03:53:56.462 ERROR|         traceback:0013|   File "/usr/local/autotest/server/subcommand.py", line 98, in parallel_simple
08/22 03:53:56.462 ERROR|         traceback:0013|     function(arg)
08/22 03:53:56.462 ERROR|         traceback:0013|   File "/usr/local/autotest/server/control_segments/crashinfo", line 12, in crashinfo
08/22 03:53:56.463 ERROR|         traceback:0013|     crashcollect.get_crashinfo(host, test_start_time)
08/22 03:53:56.463 ERROR|         traceback:0013|   File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 482, in wrapper
08/22 03:53:56.463 ERROR|         traceback:0013|     return fn(*args, **kwargs)
08/22 03:53:56.464 ERROR|         traceback:0013|   File "/usr/local/autotest/server/crashcollect.py", line 220, in get_crashinfo
08/22 03:53:56.464 ERROR|         traceback:0013|     clean=True)
08/22 03:53:56.464 ERROR|         traceback:0013|   File "/usr/local/autotest/server/crashcollect.py", line 122, in collect_log_file
08/22 03:53:56.464 ERROR|         traceback:0013|     host.run('rm -rf %s' % (pipes.quote(log_path),))
08/22 03:53:56.465 ERROR|         traceback:0013|   File "/usr/local/autotest/server/hosts/ssh_host.py", line 316, in run
08/22 03:53:56.465 ERROR|         traceback:0013|     return self.run_very_slowly(*args, **kwargs)
08/22 03:53:56.465 ERROR|         traceback:0013|   File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 482, in wrapper
08/22 03:53:56.466 ERROR|         traceback:0013|     return fn(*args, **kwargs)
08/22 03:53:56.466 ERROR|         traceback:0013|   File "/usr/local/autotest/server/hosts/ssh_host.py", line 307, in run_very_slowly
08/22 03:53:56.466 ERROR|         traceback:0013|     ssh_failure_retry_ok)
08/22 03:53:56.467 ERROR|         traceback:0013|   File "/usr/local/autotest/server/hosts/ssh_host.py", line 258, in _run
08/22 03:53:56.467 ERROR|         traceback:0013|     raise error.AutoservRunError("command execution error", result)
08/22 03:53:56.468 ERROR|         traceback:0013| AutoservRunError: command execution error
08/22 03:53:56.468 ERROR|         traceback:0013| * Command: 
08/22 03:53:56.468 ERROR|         traceback:0013|     /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_k_Osf4ssh-
08/22 03:53:56.468 ERROR|         traceback:0013|     master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
08/22 03:53:56.469 ERROR|         traceback:0013|     -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
08/22 03:53:56.469 ERROR|         traceback:0013|     ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
08/22 03:53:56.469 ERROR|         traceback:0013|     chromeos6-row1-rack6-host5 "export LIBC_FATAL_STDERR_=1; if type
08/22 03:53:56.469 ERROR|         traceback:0013|     \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
08/22 03:53:56.470 ERROR|         traceback:0013|     \"server[stack::collect_log_file|run|wrapper] -> ssh_run(rm -rf
08/22 03:53:56.470 ERROR|         traceback:0013|     /dev/pstore)\";fi; rm -rf /dev/pstore"
08/22 03:53:56.470 ERROR|         traceback:0013| Exit status: 1
08/22 03:53:56.470 ERROR|         traceback:0013| Duration: 0.136911869049


The quickest fix is to ignore AutotestHostErrrors here: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/crashcollect.py?type=cs&q=LOG_PSTORE_DIRS&sq=package:chromeos+file:src/third_party/autotest/files&l=129
 
Blocking: 757846
Owner: pprabhu@chromium.org
Status: Started (was: Untriaged)
https://chromium-review.googlesource.com/626949
Project Member

Comment 3 by bugdroid1@chromium.org, Aug 23 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0c432d6a895a8b8bd4dcd3a4887da229dcd0ea4b

commit 0c432d6a895a8b8bd4dcd3a4887da229dcd0ea4b
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Aug 23 02:05:10 2017

autotest: Ignore failures in cleanup during crash collection

crash collection is best effort and errors in cleanup should not fail a
test. We already ignore failures, but then try to cleanup some more
without ignoring errors. FixIt.

BUG= chromium:757861 
TEST=None

Change-Id: I922d6e6d761fbc7d577f7cc542e3a97e7ad1e2bf
Reviewed-on: https://chromium-review.googlesource.com/626949
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Allen Li <ayatane@chromium.org>

[modify] https://crrev.com/0c432d6a895a8b8bd4dcd3a4887da229dcd0ea4b/server/crashcollect.py

Status: Fixed (was: Started)

Sign in to add a comment