Repair fails - /usr/bin/python: bad interpreter: No such file or directory |
|||||||
Issue descriptionIn chromeos15-row13a-rack3-host10 i am getting below errors which is https://ubercautotest.corp.google.com/afe/#tab_id=view_host&object_id=8747 https://paste.googleplex.com/5868757989720064 AutoservRunError: command execution error * Command: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_bSiFP7ssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos15 -row13a-rack3-host10 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::collect_logs|run_on_client|run] -> ssh_run(/usr/local/autotest/result_tools/utils.py -p /var/log -m 20000)\";fi; /usr/local/autotest/result_tools/utils.py -p /var/log -m 20000" Exit status: 126 Duration: 0.804832935333 stderr: bash: /usr/local/autotest/result_tools/utils.py: /usr/bin/python: bad interpreter: No such file or directory 10/03 08:17:21.030 ERROR| utils:0287| [stderr] bash: /usr/local/autotest/result_tools/utils.py: /usr/bin/python: bad interpreter: No such file or directory 10/03 08:17:21.032 ERROR| runner:0121| Non-critical failure: Failed to cleanup directory summary for /var/log. Traceback (most recent call last): File "/usr/local/autotest/client/bin/result_tools/runner.py", line 97, in run_on_client timeout=_CLEANUP_DIR_SUMMARY_TIMEOUT) File "/usr/local/autotest/server/hosts/ssh_host.py", line 335, in run return self.run_very_slowly(*args, **kwargs) File "/usr/local/autotest/server/hosts/ssh_host.py", line 324, in run_very_slowly ssh_failure_retry_ok) File "/usr/local/autotest/server/hosts/ssh_host.py", line 268, in _run raise error.AutoservRunError("command execution error", result) AutoservRunError: command execution error * Command: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_bSiFP7ssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos15 -row13a-rack3-host10 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::collect_logs|run_on_client|run] -> ssh_run(/usr/local/autotest/result_tools/utils.py -p /var/log -d)\";fi; /usr/local/autotest/result_tools/utils.py -p /var/log -d" Exit status: 126 Duration: 0.701218128204 ----------------------------- In chromeos15-row13a-rack2-host7 & Host chromeos15-row13b-rack3-host8 I am seeing below error similar to bug RootFSUpdateError: Failed to install device image using payload at http://100.90.15.229:8082/update/eve-release/R70-11021.34.0 on chromeos15-row13b-rack5-host2. : command execution error * Command: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_RctnGgssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos15 -row13b-rack5-host2 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::_run|_base_update_handler_no_retry|run] -> ssh_run(/usr/bin/update_engine_client --update --omaha_url=http://100.90.15.229:8082/update/eve- release/R70-11021.34.0)\";fi; /usr/bin/update_engine_client --update --omaha_url=http://100.90.15.229:8082/update/eve-release/R70-11021.34.0" Exit status: 1 Duration: 16.4377510548 stderr: [1002/220026:INFO:update_engine_client.cc(486)] Forcing an update by setting app_version to ForcedUpdate. [1002/220026:INFO:update_engine_client.cc(488)] Initiating update check and install. [1002/220026:INFO:update_engine_client.cc(517)] Waiting for update to complete. [1002/220040:ERROR:update_engine_client.cc(232)] Update failed, current operation is UPDATE_STATUS_IDLE, last error code is ErrorCode::kOmahaResponseInvalid(34) 10/02 22:02:13.825 ERROR| control:0074| Provision failed due to Exception. Traceback (most recent call last): File "/usr/local/autotest/results/hosts/chromeos15-row13b-rack5-host2/2483684-provision/20180210202133/control.srv", line 53, in provision_machine provision.Provision) File "/usr/local/autotest/server/cros/provision.py", line 400, in run_special_task_actions task.run_task_actions(job, host, labels) File "/usr/local/autotest/server/cros/provision.py", line 173, in run_task_actions raise SpecialTaskActionException() SpecialTaskActionException 10/02 22:02:13.826 ERROR| server_job:0825| Exception escaped control file, job aborting: Traceback (most recent call last): File "/usr/local/autotest/server/server_job.py", line 817, in run self._execute_code(server_control_file, namespace) File "/usr/local/autotest/server/server_job.py", line 1340, in _execute_code execfile(code_file, namespace, namespace) File "/usr/local/autotest/results/hosts/chromeos15-row13b-rack5-host2/2483684-provision/20180210202133/control.srv", line 108, in <module> job.parallel_simple(provision_machine, machines) File "/usr/local/autotest/server/server_job.py", line 619, in parallel_simple log=log, timeout=timeout, return_results=return_results) File "/usr/local/autotest/server/subcommand.py", line 98, in parallel_simple function(arg) File "/usr/local/autotest/results/hosts/chromeos15-row13b-rack5-host2/2483684-provision/20180210202133/control.srv", line 99, in provision_machine raise Exception('') https://paste.googleplex.com/4555855005483008 ----- Seems like they are pointing to devservers which was provisioned in https://bugs.chromium.org/p/chromium/issues/detail?id=889557 Can you please have a look.
,
Oct 3
Matt, Sridhar, please add details on failures observed and how many boards are affected?
,
Oct 3
chromeos15-row13b-rack2-host7
chromeos15-row13b-rack3-host3
is failing with below errors
bash: /usr/local/autotest/result_tools/utils.py: /usr/bin/python: bad interpreter: No such file or directory
10/03 08:17:21.030 ERROR| utils:0287| [stderr] bash: /usr/local/autotest/result_tools/utils.py: /usr/bin/python: bad interpreter: No such file or directory
10/03 08:17:21.032 ERROR| runner:0121| Non-critical failure: Failed to cleanup directory summary for /var/log.
Traceback (most recent call last):
File "/usr/local/autotest/client/bin/result_tools/runner.py", line 97, in run_on_client
timeout=_CLEANUP_DIR_SUMMARY_TIMEOUT)
File "/usr/local/autotest/server/hosts/ssh_host.py", line 335, in run
return self.run_very_slowly(*args, **kwargs)
File "/usr/local/autotest/server/hosts/ssh_host.py", line 324, in run_very_slowly
ssh_failure_retry_ok)
File "/usr/local/autotest/server/hosts/ssh_host.py", line 268, in _run
raise error.AutoservRunError("command execution error", result)
AutoservRunError: command execution error
* Command:
/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_bSiFP7ssh-master/socket
-o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
-o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos15
-row13a-rack3-host10 "export LIBC_FATAL_STDERR_=1; if type \"logger\" >
/dev/null 2>&1; then logger -tag \"autotest\"
\"server[stack::collect_logs|run_on_client|run] ->
ssh_run(/usr/local/autotest/result_tools/utils.py -p /var/log -d)\";fi;
/usr/local/autotest/result_tools/utils.py -p /var/log -d"
Exit status: 126
Duration: 0.701218128204
,
Oct 3
chromeos15-row13a-rack2-host12
chromeos15-row13a-rack3-host10
RootFSUpdateError: Failed to install device image using payload at http://100.90.15.229:8082/update/squawks-release/R70-11021.34.0 on chromeos15-row13a-rack3-host10. : command execution error
* Command:
/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_yare44ssh-master/socket
-o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
-o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos15
-row13a-rack3-host10 "export LIBC_FATAL_STDERR_=1; if type \"logger\" >
/dev/null 2>&1; then logger -tag \"autotest\"
\"server[stack::_run|_base_update_handler_no_retry|run] ->
ssh_run(/usr/bin/update_engine_client --update
--omaha_url=http://100.90.15.229:8082/update/squawks-
release/R70-11021.34.0)\";fi; /usr/bin/update_engine_client --update
--omaha_url=http://100.90.15.229:8082/update/squawks-
release/R70-11021.34.0"
Exit status: 1
Duration: 0.951797008514
Lots of repair fails with the following errors in think all are pointing to new devserver provisioned yesterday.
,
Oct 3
,
Oct 3
There is a issue 891764 , that fails to deploy the chromeos image, so it might be related, as possibly first provision job fails, then repair job is initiated, but also fails, b/c of the bad state of the image. No?
,
Oct 3
Can we have more details on the scope of this issue? 29 hosts in my pools at this time are in repairing or repair-failed state. $ atest host list chromeos15-row13* | grep -c Repair 19 $ atest host list chromeos15-audiobox* | grep -c Repair 10 40 boards in such state for the whole chromeos15 lab. Can somebody get to the bottom of this and find out if the new dev server(s) has anything to do with this, and what is actually happening?
,
Oct 3
Per this same log powerwashed was performed. 10/03 08:20:42.204 ERROR| repair:0507| Repair failed: Powerwash and then re-install the stable build via AU Why would repair job do powerwash?
,
Oct 3
I will look the devserver logs to see if we can figure out something.
,
Oct 3
> Why would repair job do powerwash? Some bad builds can have a bug that leaves behind file system corruption. The fix for such bad builds is "power wash to scrub the bad file system data, and install a new build to prevent the bug from re-corrupting the DUT." That's what happens with "repair.powerwash".
,
Oct 4
I logged on chromeos15-infra-devserver15, though it was just provisioned yesterday, it runs a very *old* version of devserver. chromeos-test@chromeos15-infra-devserver15:~/chromiumos/src/platform/dev$ git log -1 commit b066b06cbdfa25fa153353e0e1587ddbe382b35b Author: Dan Shi <dshi@google.com> Date: Fri May 26 14:31:13 2017 -0700 Force Launch Control API to return enough results for artifact lookup.
,
Oct 4
I synced the version of devserver on chromeos15-infra-devserver{15,16,17,18,20} to version
chromeos-test@chromeos15-infra-devserver20:~/chromiumos/src/platform/dev$ git log -1
commit d69ceef729eaa0510c3bf34f2eab2612eb4cdde9
Author: Nicolas Boichat <drinkcat@chromium.org>
Date: Fri Sep 14 14:28:49 2018 -0700
dut-console: Escape sequence is <enter>~., not ~.<enter>
Which is the version for all other devservers. The devserver processes were also got restarted.
But for chromeos15-infra-devserver20, I got error of 'readonly file systems' when sync the repo. Working on it.
,
Oct 4
> But for chromeos15-infra-devserver20, I got error of 'readonly file systems' when sync the repo. Working on it. s/devserver20/devserver19/
,
Oct 4
,
Oct 4
Fixed the FS issue in chromeos15-infra-devserver19 . Can you please check and let me know. FS issue is in chromeos15-infra-devserver20 also?
,
Oct 5
The devserver of chromeos15-infra-devserver19 has been added back. The issue of all other devservers should also been fixed. Feel free to reopen if you don't think so. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by mjayapal@chromium.org
, Oct 3