New issue
Advanced search Search tips

Issue 731923 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

Early failures of autotestd on DUT leads to autoserv collecting logs from the (incorrect) last test run

Project Member Reported by ihf@chromium.org, Jun 9 2017

Issue description

Lifted from
https://bugs.chromium.org/p/chromium/issues/detail?id=721867#c21

On this build the "graphics_dEQP" link from
https://luci-milo.appspot.com/buildbot/chromeos/caroline-paladin/206
[Test-Logs]: graphics_dEQP: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.

leads to "cheets_StartAndroid.stress" failure
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122453172-chromeos-test/chromeos2-row8-rack1-host20/cheets_StartAndroid.stress/

Retry or dashboard logic going wrong?
 
Owner: pprabhu@chromium.org
Status: Started (was: Untriaged)
Picking this up as secondary deputy. This is a routine infra request that deserves within-a-week progress (confusing critical UI for users).

Note that the link off of the "suite details" view in viceroy is similarly broken.

Going to the actual failing test on cautotest:
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=122453172
The job_id is actually correct!

Somehow, we ended up putting logs from the wrong test in this location.
The suite job also claims that that job_id belongs to 

06/09 12:23:41.578 DEBUG|             suite:1129| Scheduling graphics_dEQP.gles31.info
06/09 12:23:41.911 DEBUG|             suite:1474| Adding job keyval for graphics_dEQP.gles31.info=122453172-chromeos-test
Instead, we find logs for the test cheets_StartAndroid.stress.0

It's results should be in:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122453098-chromeos-test/chromeos2-row8-rack1-host20/

It's on the same DUT,
and look into debug/autoserv.DEBUG, we find that pids are different, so this is not a case of incorrectly uploaded test logs from the same test twice.
Oh fun,
autoserv.DEBUG claims that it's running graphics_dEQP test.
Also, autoserv ran ~13:08 on 06/09

but the client.DEBUG claims that it ran cheets_StartAndroid.stress at ~12:34 on 06/09

In fact, we went and collect the client results / logs from the DUT for an earlier test running cheets_StartAndroid.stress instead of graphis_dQEP.gles31


Labels: -M-61 Hotlist-Fixit
Owner: ----
Status: Available (was: Started)
Summary: Early failures of autotestd on DUT leads to autoserv collecting logs from the (incorrect) last test run (was: Wrong links on buildbot to test failures)
- autoserv tried to launch autotestd on the DUT
- failed to start (exit code 255)
- autoserv tried to ping DUT / see if it's dead
- autoserv found the DUT had died / rebooted
- autoserv gave up
- post-job tasks then collected logs from the DUT from /usr/local/autotest/results/default

BUT /usr/local/autotest/results/default had never been setup properly, since we didn't get far enough to do that.

As a result, we ended up collecting logs from the last test run on the DUT.

Relevant log from autoserv:

06/09 12:53:24.545 DEBUG|          ssh_host:0286| Running (ssh) '/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-hRB5MI 0 0'
06/09 12:53:25.000 DEBUG|     site_autotest:0194| Entered autotestd_monitor.
06/09 12:53:25.001 INFO |          autotest:1177| Entered autotestd_monitor.
06/09 12:53:25.001 DEBUG|     site_autotest:0194| Finished launching tail subprocesses.
06/09 12:53:25.001 INFO |          autotest:1177| Finished launching tail subprocesses.
06/09 12:53:25.002 DEBUG|     site_autotest:0194| Finished waiting on autotestd to start.
06/09 12:53:25.002 INFO |          autotest:1177| Finished waiting on autotestd to start.
06/09 12:53:25.534 DEBUG|     site_autotest:0194| AUTOTEST_STATUS::START	----	----	timestamp=1497038004	localtime=Jun 09 12:53:24	
06/09 12:53:25.537 INFO |        server_job:0199| START	----	----	timestamp=1497038004	localtime=Jun 09 12:53:24	
06/09 12:53:25.578 DEBUG|     site_autotest:0194| AUTOTEST_STATUS::	START	graphics_dEQP	graphics_dEQP	timestamp=1497038005	localtime=Jun 09 12:53:25	
06/09 12:53:25.579 INFO |        server_job:0199| 	START	graphics_dEQP	graphics_dEQP	timestamp=1497038005	localtime=Jun 09 12:53:25	
06/09 13:08:25.810 DEBUG|          autotest:0815| Result exit status is 255.
06/09 13:08:25.811 DEBUG|             utils:0203| Running 'ping chromeos2-row8-rack1-host20 -w1 -c1'
06/09 13:08:25.987 DEBUG|             utils:0298| [stdout] PING chromeos2-row8-rack1-host20.cros.corp.google.com (100.115.230.141) 56(84) bytes of data.
06/09 13:08:25.988 DEBUG|             utils:0298| [stdout] 64 bytes from 100.115.230.141: icmp_seq=1 ttl=59 time=0.810 ms
06/09 13:08:25.989 DEBUG|             utils:0298| [stdout] 
06/09 13:08:25.989 DEBUG|             utils:0298| [stdout] --- chromeos2-row8-rack1-host20.cros.corp.google.com ping statistics ---
06/09 13:08:25.989 DEBUG|             utils:0298| [stdout] 1 packets transmitted, 1 received, 0% packet loss, time 0ms
06/09 13:08:25.989 DEBUG|             utils:0298| [stdout] rtt min/avg/max/mdev = 0.810/0.810/0.810/0.000 ms
06/09 13:08:25.992 DEBUG|          ssh_host:0286| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
06/09 13:08:26.012 INFO |      abstract_ssh:0795| Master ssh connection to chromeos2-row8-rack1-host20 is down.
06/09 13:08:26.013 DEBUG|      abstract_ssh:0756| Nuking master_ssh_job.
06/09 13:08:26.013 DEBUG|      abstract_ssh:0762| Cleaning master_ssh_tempdir.
06/09 13:08:26.014 INFO |      abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_OOcoyyssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row8-rack1-host20'
06/09 13:08:26.014 DEBUG|             utils:0203| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_OOcoyyssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row8-rack1-host20'
06/09 13:08:27.924 DEBUG|             utils:0298| [stdout] 7bac4261-36b9-40eb-9239-449dd002cd94
06/09 13:08:27.969 INFO |        server_job:0199| 		FAIL	----	----	timestamp=1497038907	localtime=Jun 09 13:08:27	Autotest client terminated unexpectedly: DUT rebooted during the test run.
06/09 13:08:27.970 INFO |        server_job:0199| 	END FAIL	----	----	timestamp=1497038907	localtime=Jun 09 13:08:27	
06/09 13:08:27.971 INFO |        server_job:0199| END GOOD	----	----	timestamp=1497038907	localtime=Jun 09 13:08:27	
06/09 13:08:27.971 DEBUG|          autotest:0962| Autotest job finishes running. Below is the post-processing operations.


-------------------------
- This will only occur when the test in question fails anyway
  - and only if autotestd dies very early.
- This will not affect the correctness of the results.

So adding it to Fixit. We have enough information to go from on this bug now.
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>Test
Labels: -Pri-2 Pri-3

Sign in to add a comment