New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 691854 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

site_sysinfo breaking backward compatibility causes autoserv crashes

Project Member Reported by shuqianz@chromium.org, Feb 14 2017

Issue description

Example build: https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-paladin/builds/1660

The failed HWTest stage for this build:
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=101129882

If you click any of the failed test in this suite and checked the debug log, you will find:

Running (ssh) 'true'
02/13 17:13:47.959 DEBUG|      abstract_ssh:0587| Host chromeos4-row6-rack9-host12 is now up
02/13 17:13:47.960 DEBUG|      abstract_ssh:0346| get_file. source: /usr/local/autotest/results/default/, dest: /usr/local/autotest/results/101132185-chromeos-test/chromeos4-row6-rack9-host12, delete_dest: False,preserve_perm: True, preserve_symlinks:True
02/13 17:13:47.960 DEBUG|      abstract_ssh:0357| Using Rsync.
02/13 17:13:47.960 DEBUG|        base_utils:0185| Running 'rsync -l  --timeout=1800 --rsh='/usr/bin/ssh -a -x   -o ControlPath=/tmp/_autotmp_1eRKtissh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22' -az --no-o --no-g root@chromeos4-row6-rack9-host12:"/usr/local/autotest/results/default/" "/usr/local/autotest/results/101132185-chromeos-test/chromeos4-row6-rack9-host12"'
02/13 17:13:48.401 DEBUG|        server_job:1370| Client state file /usr/local/autotest/results/101132185-chromeos-test/chromeos4-row6-rack9-host12/control.autoserv.state not found
02/13 17:13:48.402 DEBUG|          base_job:0392| Persistent state client.* deleted
02/13 17:13:48.403 ERROR|        server_job:0808| Exception escaped control file, job aborting:
Traceback (most recent call last):
  File "/usr/local/autotest/server/server_job.py", line 800, in run
    self._execute_code(server_control_file, namespace)
  File "/usr/local/autotest/server/server_job.py", line 1300, in _execute_code
    execfile(code_file, namespace, namespace)
  File "/usr/local/autotest/results/101132185-chromeos-test/chromeos4-row6-rack9-host12/control.srv", line 10, in <module>
    job.parallel_simple(run_client, machines)
  File "/usr/local/autotest/server/server_job.py", line 624, in parallel_simple
    return_results=return_results)
  File "/usr/local/autotest/server/subcommand.py", line 93, in parallel_simple
    function(arg)
  File "/usr/local/autotest/results/101132185-chromeos-test/chromeos4-row6-rack9-host12/control.srv", line 7, in run_client
    at.run(control, host=host, use_packaging=use_packaging)
  File "/usr/local/autotest/server/autotest.py", line 375, in run
    client_disconnect_timeout, use_packaging=use_packaging)
  File "/usr/local/autotest/server/autotest.py", line 458, in _do_run
    client_disconnect_timeout=client_disconnect_timeout)
  File "/usr/local/autotest/server/autotest.py", line 892, in execute_control
    logger, client_disconnect_timeout)
  File "/usr/local/autotest/server/autotest.py", line 837, in execute_section
    raise err
AutotestRunError: client job was aborted
02/13 17:13:48.404 INFO |        server_job:0183| INFO	----	----	timestamp=1487034828	job_abort_reason=client job was aborted	localtime=Feb 13 17:13:48	client job was aborted
02/13 17:13:48.407 ERROR|   logging_manager:0626| {'builds': "{'cros-version': 'cyan-paladin/R58-9281.0.0-rc3'}", 'job_started': 1487034802, 'hostname': 'chromeos4-row6-rack9-host12', 'status_version': 1, 'label': 'cyan-paladin/R58-9281.0.0-rc3/arc-bvt-cq/cheets_ContainerSmokeTest', 'parent_job_id': 101129882, 'drone': 'chromeos-server51.cbf.corp.google.com', 'user': 'chromeos-test', 'suite': 'arc-bvt-cq', 'job_queued': 1487034391, 'retry_original_job_id': 101130036, 'experimental': 'False', 'build': 'cyan-paladin/R58-9281.0.0-rc3'}
02/13 17:13:48.408 ERROR|   logging_manager:0626| MACHINE NAME: chromeos4-row6-rack9-host12
02/13 17:13:48.408 ERROR|   logging_manager:0626| MACHINE GROUP: cyan
02/13 17:13:48.408 ERROR|   logging_manager:0626| parsing partial test ---- SERVER_JOB
02/13 17:13:48.408 ERROR|   logging_manager:0626| 
02/13 17:13:48.408 ERROR|   logging_manager:0626| STATUS: INFO	----	----	kernel=3.18.0-13908-gcc004d0	localtime=Feb 13 17:13:30	timestamp=1487034810
02/13 17:13:48.409 ERROR|   logging_manager:0626| 
02/13 17:13:48.409 ERROR|   logging_manager:0626| STATUS: END ABORT	----	----	timestamp=1487034827	localtime=Feb 13 17:13:47	Autotest client terminated unexpectedly: DUT is pingable, could not determine if an un-expected reboot occured during the test.
02/13 17:13:48.409 ERROR|   logging_manager:0626| unexpected extra indentation, ignoring
02/13 17:13:48.409 ERROR|   logging_manager:0626| 
02/13 17:13:48.409 ERROR|   logging_manager:0626| STATUS: INFO	----	----	timestamp=1487034828	job_abort_reason=client job was aborted	localtime=Feb 13 17:13:48	client job was aborted

Similar failures occur on other boards
 
Summary: Several paladin HWTest failed due to 'AutotestRunError: client job was aborted' (was: Several paladin HWTest failed due to 'control.autoserv.state not found')
Also found from the log:
...
'job_queued': 1487032857, 'experimental': 'False', 'build': 'daisy_skate-paladin/R58-9281.0.0-rc3'}
02/13 17:00:31.525 ERROR|   logging_manager:0626| MACHINE NAME: chromeos4-row9-rack7-host13
02/13 17:00:31.526 ERROR|   logging_manager:0626| MACHINE GROUP: daisy_skate
02/13 17:00:31.526 ERROR|   logging_manager:0626| parsing partial test ---- SERVER_JOB
02/13 17:00:31.526 ERROR|   logging_manager:0626| 
02/13 17:00:31.527 ERROR|   logging_manager:0626| STATUS: INFO	----	----	kernel=3.8.11	localtime=Feb 13 17:00:12	timestamp=1487034012
02/13 17:00:31.527 ERROR|   logging_manager:0626| 
02/13 17:00:31.528 ERROR|   logging_manager:0626| STATUS: END ABORT	----	----	timestamp=1487034030	localtime=Feb 13 17:00:30	Autotest client terminated unexpectedly: DUT is pingable, could not determine if an un-expected reboot occured during the test.
02/13 17:00:31.528 ERROR|   logging_manager:0626| unexpected extra indentation, ignoring
02/13 17:00:31.528 ERROR|   logging_manager:0626| 
02/13 17:00:31.529 ERROR|   logging_manager:0626| STATUS: INFO	----	----	timestamp=1487034031	job_abort_reason=client job was aborted	localtime=Feb 13 17:00:31	client job was aborted
02/13 17:00:31.529 ERROR|   logging_manager:0626| parsing test ---- SERVER_JOB
Labels: -Pri-0 Pri-2
Owner: ayatane@chromium.org
Probably caused by bad CL: https://chromium-review.googlesource.com/c/437654/
02/13 17:14:00.122 INFO |               job:0200| Writing results to /usr/local/autotest/results/default
02/13 17:14:00.125 CRITI|               job:1212| JOB ERROR (autotest bug?): No module named site_sysinfo
Traceback (most recent call last):
  File "/usr/local/autotest/bin/job.py", line 1182, in runjob
    myjob = job(control=control, drop_caches=drop_caches, options=options)
  File "/usr/local/autotest/bin/job.py", line 1230, in __init__
    base_client_job.__init__(self, *args, **kwargs)
  File "/usr/local/autotest/bin/job.py", line 130, in __init__
    self._pre_record_init(control, options)
  File "/usr/local/autotest/bin/job.py", line 207, in _pre_record_init
    self._load_state()
  File "/usr/local/autotest/bin/job.py", line 869, in _load_state
    self._state.set_backing_file(self._state_file)
  File "/usr/local/autotest/common_lib/base_job.py", line 313, in set_backing_file
    self._synchronize_backing_file()
  File "/usr/local/autotest/common_lib/base_job.py", line 137, in wrapped_method
    return method(self, *args, **dargs)
  File "/usr/local/autotest/common_lib/base_job.py", line 158, in wrapped_method
    self._read_from_backing_file()
  File "/usr/local/autotest/common_lib/base_job.py", line 281, in _read_from_backing_file
    self.read_from_file(self._backing_file, merge=merge_backing_file)
  File "/usr/local/autotest/common_lib/base_job.py", line 228, in read_from_file
    on_disk_state = pickle.load(open(file_path))
ImportError: No module named site_sysinfo

tl;dr the pickle gets created by a version of Autotest that still has site_sysinfo, and then the client (DUT) tries to unpack it using the build of Autotest that doesn't have site_sysinfo any more.

Comment 7 by dshi@chromium.org, Feb 16 2017

right, the client side of Autotest is "versioned". We need some backwards compatibility.
No, nm, this above was different root cause I think.

Components: Infra>Client>ChromeOS
Labels: -Pri-2 Pri-3
Labels: cq-annot
Summary: site_sysinfo breaking backward compatibility causes autoserv crashes (was: Several paladin HWTest failed due to 'AutotestRunError: client job was aborted')
Status: Fixed (was: Started)

Comment 16 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment