New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 693648 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

some autoserv.DEBUG log timestamps seem wrong

Project Member Reported by semenzato@chromium.org, Feb 17 2017

Issue description

How can this happen?

https://uberchromegw.corp.google.com/i/chromeos/builders/ninja-release/builds/866
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/101766331-chromeos-test/chromeos4-row3-rack9-host2/debug/

file autoserv.DEBUG:

02/17 06:23:52.879 DEBUG|          ssh_host:0284| Running (ssh) 'test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start'
02/17 06:23:52.951 INFO |      abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_ajPK9Gssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:23:52.951 DEBUG|        base_utils:0185| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_ajPK9Gssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:23:58.014 INFO |      abstract_ssh:0824| Timed out waiting for master-ssh connection to be established.
02/17 06:25:01.341 ERROR|        base_utils:0280| [stderr] ssh: connect to host chromeos4-row3-rack9-host2 port 22: Connection timed out
02/17 06:25:01.343 INFO |            remote:0073| Failed to copy /var/log/messages at startup: ('ssh timed out', * Command: 
    /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_ajPK9Gssh-
    master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
    -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
    ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
    chromeos4-row3-rack9-host2 "export LIBC_FATAL_STDERR_=1; if type
    \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
    \"server[stack::create_target_machine|create_host|job_start] ->
    ssh_run(test ! -e /var/log/messages || cp -f /var/log/messages
    /var/tmp/messages.autotest_start)\";fi; test ! -e /var/log/messages || cp
    -f /var/log/messages /var/tmp/messages.autotest_start"
Exit status: 255
Duration: 63.2917709351

There is a log entry at 06:23:51 about starting the master ssh connection, then about 6 seconds later at 06:23:58 another entry about the timeout.  That's too soon, we would expect the delay to be at least 30 seconds.  The next log line has a more reasonable time stamp.
 
Owner: jrbarnette@chromium.org
I can't explain this behavior, and at first blush, my estimate
is that the ratio of "likely value of a correct answer" divided
by "time required to find the correct answer" is too small to
justify an investigation.

Assuming the right response isn't "WontFix", what do we hope to
gain by resolving this?

As a practical matter, the logs for the DUT at the time of the failure
show this:
    2017-02-17 06:51:49  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row3-rack9-host2/433453-provision/
    2017-02-17 06:28:22  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row3-rack9-host2/433415-repair/
    2017-02-17 06:23:48  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row3-rack9-host2/433401-provision/
    2017-02-17 06:04:09  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101765844-chromeos-test/
    2017-02-17 05:53:45  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row3-rack9-host2/433338-provision/

Basically, the failure here happened because the DUT actually went
offline, and it was fixed by disconnecting, the reconnecting AC power
(ninja devices have no battery).

Yeah we can close this, but I still need to understand the timeout logic.
Status: WontFix (was: Untriaged)

Sign in to add a comment