New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 891765 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
OOO
Closed: Oct 5
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 889557



Sign in to add a comment

Repair fails - /usr/bin/python: bad interpreter: No such file or directory

Project Member Reported by mjayapal@chromium.org, Oct 3

Issue description

In chromeos15-row13a-rack3-host10 i am getting below errors which is 

https://ubercautotest.corp.google.com/afe/#tab_id=view_host&object_id=8747

https://paste.googleplex.com/5868757989720064

AutoservRunError: command execution error
* Command: 
    /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_bSiFP7ssh-master/socket
    -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
    -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
    ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos15
    -row13a-rack3-host10 "export LIBC_FATAL_STDERR_=1; if type \"logger\" >
    /dev/null 2>&1; then logger -tag \"autotest\"
    \"server[stack::collect_logs|run_on_client|run] ->
    ssh_run(/usr/local/autotest/result_tools/utils.py -p /var/log -m
    20000)\";fi; /usr/local/autotest/result_tools/utils.py -p /var/log -m
    20000"
Exit status: 126
Duration: 0.804832935333

stderr:
bash: /usr/local/autotest/result_tools/utils.py: /usr/bin/python: bad interpreter: No such file or directory
10/03 08:17:21.030 ERROR|             utils:0287| [stderr] bash: /usr/local/autotest/result_tools/utils.py: /usr/bin/python: bad interpreter: No such file or directory
10/03 08:17:21.032 ERROR|            runner:0121| Non-critical failure: Failed to cleanup directory summary for /var/log.
Traceback (most recent call last):
  File "/usr/local/autotest/client/bin/result_tools/runner.py", line 97, in run_on_client
    timeout=_CLEANUP_DIR_SUMMARY_TIMEOUT)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 335, in run
    return self.run_very_slowly(*args, **kwargs)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 324, in run_very_slowly
    ssh_failure_retry_ok)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 268, in _run
    raise error.AutoservRunError("command execution error", result)
AutoservRunError: command execution error
* Command: 
    /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_bSiFP7ssh-master/socket
    -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
    -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
    ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos15
    -row13a-rack3-host10 "export LIBC_FATAL_STDERR_=1; if type \"logger\" >
    /dev/null 2>&1; then logger -tag \"autotest\"
    \"server[stack::collect_logs|run_on_client|run] ->
    ssh_run(/usr/local/autotest/result_tools/utils.py -p /var/log -d)\";fi;
    /usr/local/autotest/result_tools/utils.py -p /var/log -d"
Exit status: 126
Duration: 0.701218128204



-----------------------------
In chromeos15-row13a-rack2-host7 & Host chromeos15-row13b-rack3-host8  I am seeing below error similar to bug 

RootFSUpdateError: Failed to install device image using payload at http://100.90.15.229:8082/update/eve-release/R70-11021.34.0 on chromeos15-row13b-rack5-host2. : command execution error
* Command: 
    /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_RctnGgssh-master/socket
    -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
    -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
    ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos15
    -row13b-rack5-host2 "export LIBC_FATAL_STDERR_=1; if type \"logger\" >
    /dev/null 2>&1; then logger -tag \"autotest\"
    \"server[stack::_run|_base_update_handler_no_retry|run] ->
    ssh_run(/usr/bin/update_engine_client --update
    --omaha_url=http://100.90.15.229:8082/update/eve-
    release/R70-11021.34.0)\";fi; /usr/bin/update_engine_client --update
    --omaha_url=http://100.90.15.229:8082/update/eve-release/R70-11021.34.0"
Exit status: 1
Duration: 16.4377510548

stderr:
[1002/220026:INFO:update_engine_client.cc(486)] Forcing an update by setting app_version to ForcedUpdate.
[1002/220026:INFO:update_engine_client.cc(488)] Initiating update check and install.
[1002/220026:INFO:update_engine_client.cc(517)] Waiting for update to complete.
[1002/220040:ERROR:update_engine_client.cc(232)] Update failed, current operation is UPDATE_STATUS_IDLE, last error code is ErrorCode::kOmahaResponseInvalid(34)
10/02 22:02:13.825 ERROR|           control:0074| Provision failed due to Exception.
Traceback (most recent call last):
  File "/usr/local/autotest/results/hosts/chromeos15-row13b-rack5-host2/2483684-provision/20180210202133/control.srv", line 53, in provision_machine
    provision.Provision)
  File "/usr/local/autotest/server/cros/provision.py", line 400, in run_special_task_actions
    task.run_task_actions(job, host, labels)
  File "/usr/local/autotest/server/cros/provision.py", line 173, in run_task_actions
    raise SpecialTaskActionException()
SpecialTaskActionException
10/02 22:02:13.826 ERROR|        server_job:0825| Exception escaped control file, job aborting:
Traceback (most recent call last):
  File "/usr/local/autotest/server/server_job.py", line 817, in run
    self._execute_code(server_control_file, namespace)
  File "/usr/local/autotest/server/server_job.py", line 1340, in _execute_code
    execfile(code_file, namespace, namespace)
  File "/usr/local/autotest/results/hosts/chromeos15-row13b-rack5-host2/2483684-provision/20180210202133/control.srv", line 108, in <module>
    job.parallel_simple(provision_machine, machines)
  File "/usr/local/autotest/server/server_job.py", line 619, in parallel_simple
    log=log, timeout=timeout, return_results=return_results)
  File "/usr/local/autotest/server/subcommand.py", line 98, in parallel_simple
    function(arg)
  File "/usr/local/autotest/results/hosts/chromeos15-row13b-rack5-host2/2483684-provision/20180210202133/control.srv", line 99, in provision_machine
    raise Exception('')

https://paste.googleplex.com/4555855005483008

-----

Seems like they are pointing to devservers which was provisioned in https://bugs.chromium.org/p/chromium/issues/detail?id=889557

Can you please have a look. 

 
Cc: sontis@chromium.org matthewjoseph@chromium.org
Cc: pgangishetty@chromium.org
Components: Infra>Client>ChromeOS
Labels: -Pri-3 OS-Chrome Pri-2
Matt, Sridhar,
please add details on failures observed and how many boards are affected?


chromeos15-row13b-rack2-host7 
chromeos15-row13b-rack3-host3

is failing with below errors

bash: /usr/local/autotest/result_tools/utils.py: /usr/bin/python: bad interpreter: No such file or directory
10/03 08:17:21.030 ERROR|             utils:0287| [stderr] bash: /usr/local/autotest/result_tools/utils.py: /usr/bin/python: bad interpreter: No such file or directory
10/03 08:17:21.032 ERROR|            runner:0121| Non-critical failure: Failed to cleanup directory summary for /var/log.
Traceback (most recent call last):
  File "/usr/local/autotest/client/bin/result_tools/runner.py", line 97, in run_on_client
    timeout=_CLEANUP_DIR_SUMMARY_TIMEOUT)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 335, in run
    return self.run_very_slowly(*args, **kwargs)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 324, in run_very_slowly
    ssh_failure_retry_ok)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 268, in _run
    raise error.AutoservRunError("command execution error", result)
AutoservRunError: command execution error
* Command: 
    /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_bSiFP7ssh-master/socket
    -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
    -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
    ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos15
    -row13a-rack3-host10 "export LIBC_FATAL_STDERR_=1; if type \"logger\" >
    /dev/null 2>&1; then logger -tag \"autotest\"
    \"server[stack::collect_logs|run_on_client|run] ->
    ssh_run(/usr/local/autotest/result_tools/utils.py -p /var/log -d)\";fi;
    /usr/local/autotest/result_tools/utils.py -p /var/log -d"
Exit status: 126
Duration: 0.701218128204
Labels: -Pri-2 Pri-1
chromeos15-row13a-rack2-host12
chromeos15-row13a-rack3-host10

RootFSUpdateError: Failed to install device image using payload at http://100.90.15.229:8082/update/squawks-release/R70-11021.34.0 on chromeos15-row13a-rack3-host10. : command execution error
* Command: 
    /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_yare44ssh-master/socket
    -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
    -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
    ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos15
    -row13a-rack3-host10 "export LIBC_FATAL_STDERR_=1; if type \"logger\" >
    /dev/null 2>&1; then logger -tag \"autotest\"
    \"server[stack::_run|_base_update_handler_no_retry|run] ->
    ssh_run(/usr/bin/update_engine_client --update
    --omaha_url=http://100.90.15.229:8082/update/squawks-
    release/R70-11021.34.0)\";fi; /usr/bin/update_engine_client --update
    --omaha_url=http://100.90.15.229:8082/update/squawks-
    release/R70-11021.34.0"
Exit status: 1
Duration: 0.951797008514

Lots of repair fails with the following errors in think all are pointing to new devserver provisioned yesterday.


Cc: jrbarnette@chromium.org jkop@chromium.org
There is a  issue 891764 , that fails to deploy the chromeos image, so it might be related, as possibly first provision job fails, then repair job is initiated, but also fails, b/c of the bad state of the image. No?

Cc: harpreet@chromium.org dchan@chromium.org
Summary: Repair fails - /usr/bin/python: bad interpreter: No such file or directory (was: Repair fails )
Can we have more details on the scope of this issue? 29 hosts in my pools at this time are in repairing or repair-failed state.
$ atest host list chromeos15-row13* | grep -c Repair
19
$ atest host list chromeos15-audiobox* | grep -c Repair
10

40 boards in such state for the whole chromeos15 lab.


Can somebody get to the bottom of this and find out if the new dev server(s) has anything to do with this, and what is actually happening?

Comment 8 Deleted

Per this same log powerwashed was performed. 
10/03 08:20:42.204 ERROR|            repair:0507| Repair failed: Powerwash and then re-install the stable build via AU

Why would repair job do powerwash?
I will look the devserver logs to see if we can figure out something.
> Why would repair job do powerwash?

Some bad builds can have a bug that leaves behind file system corruption.
The fix for such bad builds is "power wash to scrub the bad file system
data, and install a new build to prevent the bug from re-corrupting the
DUT."  That's what happens with "repair.powerwash".

I logged on chromeos15-infra-devserver15, though it was just provisioned yesterday, it runs a very *old* version of devserver.

chromeos-test@chromeos15-infra-devserver15:~/chromiumos/src/platform/dev$ git log -1
commit b066b06cbdfa25fa153353e0e1587ddbe382b35b
Author: Dan Shi <dshi@google.com>
Date:   Fri May 26 14:31:13 2017 -0700

    Force Launch Control API to return enough results for artifact lookup.

I synced the version of devserver on chromeos15-infra-devserver{15,16,17,18,20} to version 
chromeos-test@chromeos15-infra-devserver20:~/chromiumos/src/platform/dev$ git log -1
commit d69ceef729eaa0510c3bf34f2eab2612eb4cdde9
Author: Nicolas Boichat <drinkcat@chromium.org>
Date:   Fri Sep 14 14:28:49 2018 -0700

    dut-console: Escape sequence is <enter>~., not ~.<enter>

Which is the version for all other devservers. The devserver processes were also got restarted.

But for chromeos15-infra-devserver20, I got error of 'readonly file systems' when sync the repo. Working on it.
> But for chromeos15-infra-devserver20, I got error of 'readonly file systems' when sync the repo. Working on it.

s/devserver20/devserver19/
Blockedon: 889557
Fixed the FS issue in chromeos15-infra-devserver19 . Can you please check and let me know.

FS issue is in chromeos15-infra-devserver20 also?
Status: Fixed (was: Assigned)
The devserver of chromeos15-infra-devserver19 has been added back. The issue of all other devservers should also been fixed.

Feel free to reopen if you don't think so.

Sign in to add a comment