New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 713004 link

Starred by 5 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 713856



Sign in to add a comment

Tests passed but got aborted by AutotestAbort

Project Member Reported by nxia@chromium.org, Apr 19 2017

Issue description

https://luci-milo.appspot.com/buildbot/chromeos/x86-zgb-paladin/9650

login_SameSessionTwice

http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=113290478

04/18 22:35:53.773 DEBUG|          autotest:0960| Autotest job finishes.
04/18 22:35:53.774 ERROR|        server_job:0809| Exception escaped control file, job aborting:
Traceback (most recent call last):
  File "/usr/local/autotest/server/server_job.py", line 801, in run
    self._execute_code(server_control_file, namespace)
  File "/usr/local/autotest/server/server_job.py", line 1301, in _execute_code
    execfile(code_file, namespace, namespace)
  File "/usr/local/autotest/results/113290478-chromeos-test/chromeos6-row2-rack6-host16/control.srv", line 10, in <module>
    job.parallel_simple(run_client, machines)
  File "/usr/local/autotest/server/server_job.py", line 625, in parallel_simple
    return_results=return_results)
  File "/usr/local/autotest/server/subcommand.py", line 93, in parallel_simple
    function(arg)
  File "/usr/local/autotest/results/113290478-chromeos-test/chromeos6-row2-rack6-host16/control.srv", line 7, in run_client
    at.run(control, host=host, use_packaging=use_packaging)
  File "/usr/local/autotest/server/autotest.py", line 381, in run
    client_disconnect_timeout, use_packaging=use_packaging)
  File "/usr/local/autotest/server/autotest.py", line 464, in _do_run
    client_disconnect_timeout=client_disconnect_timeout)
  File "/usr/local/autotest/server/autotest.py", line 944, in execute_control
    raise error.AutotestRunError(msg)
AutotestRunError: Aborting - unexpected final status message from client on chromeos6-row2-rack6-host16: 	END GOOD	login_SameSessionTwice	login_SameSessionTwice	timestamp=1492579730	localtime=Apr 18 22:28:50	

 

Comment 1 by nxia@chromium.org, Apr 19 2017

Cc: owenlin@chromium.org nxia@chromium.org tbroch@chromium.org zhihongyu@chromium.org
seeing random Chrome login test failures and Chrome crashes.

Comment 2 by nxia@chromium.org, Apr 19 2017

Labels: -Pri-2 Pri-1
the other login test failures are  crbug.com/712390  and  crbug.com/712958 .
It looks like the same issue as  crbug.com/712991 (Peach-pit Paladin). The test itself has been pass. But it failed at post processing.

==============
04/18 22:29:47.322 WARNI|        base_utils:0912| run process timeout (30) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_koCfCbssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack6-host16 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true"
04/18 22:29:48.330 DEBUG|          ssh_host:0212| retrying ssh command after timeout
04/18 22:30:18.569 WARNI|        base_utils:0912| run process timeout (30) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_koCfCbssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack6-host16 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true"
04/18 22:30:19.578 DEBUG|          ssh_host:0218| retry 2: restarting master connection
04/18 22:30:19.578 DEBUG|      abstract_ssh:0744| Restarting master ssh connection
04/18 22:30:19.578 DEBUG|      abstract_ssh:0756| Nuking master_ssh_job.
04/18 22:30:20.581 DEBUG|      abstract_ssh:0762| Cleaning master_ssh_tempdir.
04/18 22:30:20.582 INFO |      abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_GgMOwMssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack6-host16'
04/18 22:30:20.583 DEBUG|        base_utils:0185| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_GgMOwMssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack6-host16'
04/18 22:30:42.035 DEBUG|      abstract_ssh:0587| Host chromeos6-row2-rack6-host16 is now up
04/18 22:30:42.036 DEBUG|      abstract_ssh:0346| get_file. source: /usr/local/autotest/results/default/, dest: /usr/local/autotest/results/113290478-chromeos-test/chromeos6-row2-rack6-host16, delete_dest: False,preserve_perm: True, preserve_symlinks:True
04/18 22:30:42.037 DEBUG|      abstract_ssh:0357| Using Rsync.
04/18 22:30:42.037 DEBUG|        base_utils:0185| Running 'rsync -l  --timeout=1800 --rsh='/usr/bin/ssh -a -x   -o ControlPath=/tmp/_autotmp_GgMOwMssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22' -az --no-o --no-g root@chromeos6-row2-rack6-host16:"/usr/local/autotest/results/default/" "/usr/local/autotest/results/113290478-chromeos-test/chromeos6-row2-rack6-host16"'
04/18 22:35:50.989 DEBUG|          ssh_host:0284| Running (ssh) 'echo A > /usr/local/autotest/tmp/_autotmp_JHXY3Dharness-fifo/autoserv.fifo'
04/18 22:35:51.498 DEBUG|          autotest:0805| Result exit status is 255.

Comment 4 by nxia@chromium.org, Apr 19 2017

Summary: Tests passed but got aborted by AutotestAbort (was: x86-zgb-paladin failed at login_SameSessionTwice)
https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/autotest.py?type=cs&q=execute_control&sq=package:%5Echromeos_(internal%7Cpublic)$&l=921


In the "status" file, the test output is correct:
START	----	----	timestamp=1492579661	localtime=Apr 18 22:27:41	
	START	login_SameSessionTwice	login_SameSessionTwice	timestamp=1492579663	localtime=Apr 18 22:27:43	
		GOOD	login_SameSessionTwice	login_SameSessionTwice	timestamp=1492579730	localtime=Apr 18 22:28:50	completed successfully
	END GOOD	login_SameSessionTwice	login_SameSessionTwice	timestamp=1492579730	localtime=Apr 18 22:28:50	
END GOOD	----	----	timestamp=1492580150	localtime=Apr 18 22:35:50	

but in the status.log, the test shows as aborted:

INFO	----	----	kernel=3.8.11	localtime=Apr 18 22:27:00	timestamp=1492579620	
START	----	----	timestamp=1492579661	localtime=Apr 18 22:27:41	
	START	login_SameSessionTwice	login_SameSessionTwice	timestamp=1492579663	localtime=Apr 18 22:27:43	
		GOOD	login_SameSessionTwice	login_SameSessionTwice	timestamp=1492579730	localtime=Apr 18 22:28:50	completed successfully
	END GOOD	login_SameSessionTwice	login_SameSessionTwice	timestamp=1492579730	localtime=Apr 18 22:28:50	
	ABORT	----	----	timestamp=1492580152	localtime=Apr 18 22:35:52	Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test.
END ABORT	----	----	timestamp=1492580152	localtime=Apr 18 22:35:52	

Comment 5 by nxia@chromium.org, Apr 19 2017

Another example:

https://luci-milo.appspot.com/buildbot/chromeos/wolf-paladin/14094

platform_DMVerityCorruption_CLIENT_JOB.0

http://cautotest/tko/retrieve_logs.cgi?job=/results/113330784-chromeos-test/



INFO	----	----	kernel=3.8.11	localtime=Apr 19 06:58:28	timestamp=1492610308	
START	----	----	timestamp=1492610413	localtime=Apr 19 07:00:13	
	START	platform_DMVerityCorruption	platform_DMVerityCorruption	timestamp=1492610413	localtime=Apr 19 07:00:13	
		GOOD	platform_DMVerityCorruption	platform_DMVerityCorruption	timestamp=1492610427	localtime=Apr 19 07:00:27	completed successfully
	END GOOD	platform_DMVerityCorruption	platform_DMVerityCorruption	timestamp=1492610427	localtime=Apr 19 07:00:27	
	ABORT	----	----	timestamp=1492610565	localtime=Apr 19 07:02:45	Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test.
END ABORT	----	----	timestamp=1492610565	localtime=Apr 19 07:02:45

Comment 6 by nxia@chromium.org, Apr 19 2017

another example:

https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/5683

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/113331037-chromeos-test/chromeos2-row1-rack8-host1/moblab_RunSuite/debug/

	START	----	----	timestamp=1492610688	localtime=Apr 19 07:04:48	
		GOOD	----	sysinfo.iteration.before	timestamp=1492610689	localtime=Apr 19 07:04:49	
		ABORT	----	----	timestamp=1492610692	localtime=Apr 19 07:04:52	Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test.
	END ABORT	----	----	timestamp=1492610692	localtime=Apr 19 07:04:52	

Comment 7 by nxia@chromium.org, Apr 19 2017

Cc: shuqianz@chromium.org

Comment 8 by nxia@chromium.org, Apr 19 2017

Labels: -Pri-1 Pri-0

Comment 9 by nxia@chromium.org, Apr 19 2017

follow #4:

why the status "END GOOD	----	----	timestamp=1492580150	localtime=Apr 18 22:35:50" wasn't added to the status.log correctly? 
Cc: dshi@chromium.org jrbarnette@chromium.org
The original ZGB failure is the same as the one described in
 bug 712464 .

Every time I've seen this, it's also been accompanied by this
error message:
    Autotest client terminated unexpectedly: DUT is pingable,
    SSHable and did NOT restart un-expectedly. We probably lost
    connectivity during the test

I suspect we have two issues:
 1) Something (a bug in the lab?) causes us to lose connectivity
    to DUTs.
 2) Our retry logic does the wrong thing with the ABORT status.

Comment 12 by nxia@chromium.org, Apr 19 2017

How was the status transferred to the shard? the dut may lose connection when the test finished so the last 'END' log wasn't reported. 
split this bug to 2 small bugs:

1. the tko parser parse out 2 statuses for the same test, which makes the test retry twice. 
2. DUT loses connection in the client side tests, which makes server fail to get right test status.

I will focus on investigating 1, currently I think it may due to bad formats of logs in the test's status.log which contains sensitive phrases like "END GOOD", and mislead the parser:

INFO	----	----	timestamp=1492610586	job_abort_reason=Aborting - unexpected final status message from client on chromeos4-row1-rack7-host9:  END GOOD platform_DMVerityCorruption platform_DMVerityCorruption timestamp=1492610427 localtime=Apr 19 07:00:27  	localtime=Apr 19 07:03:06	Aborting - unexpected final status message from client on chromeos4-row1-rack7-host9:         END GOOD        platform_DMVerityCorruption        platform_DMVerityCorruption        timestamp=1492610427        localtime=Apr 19 07:00:27            

investigation continuing.




We're pursuing a theory that the failures are caused by bandwidth
spikes while uploading crashes caused by bug 712102.

Comment 15 by nxia@chromium.org, Apr 20 2017

Cc: x...@chromium.org
Cc: hidehiko@chromium.org
Looks dup of crbug.com/712567?
Project Member

Comment 17 by bugdroid1@chromium.org, Apr 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ba28516abd333289ce2e4e4069d024f54f1d7be1

commit ba28516abd333289ce2e4e4069d024f54f1d7be1
Author: xixuan <xixuan@chromium.org>
Date: Thu Apr 20 00:48:28 2017

autotest: remove sensitive lines from error msg for further tko parsing.

The lines got from client side test may include phrase like 'END GOOD',
which will make tko parser parse the same test twice. This CL removes it
from raised error msg, so that it won't be recorded in status.log later.

BUG=chromium:713004
TEST=Ran unittest

Change-Id: Ibefc27eddd3b7c1df12aac2f66a5dec57b03c5f9
Reviewed-on: https://chromium-review.googlesource.com/482542
Reviewed-by: Ningning Xia <nxia@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/ba28516abd333289ce2e4e4069d024f54f1d7be1/server/autotest.py

Project Member

Comment 18 by bugdroid1@chromium.org, Apr 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ba28516abd333289ce2e4e4069d024f54f1d7be1

commit ba28516abd333289ce2e4e4069d024f54f1d7be1
Author: xixuan <xixuan@chromium.org>
Date: Thu Apr 20 00:48:28 2017

autotest: remove sensitive lines from error msg for further tko parsing.

The lines got from client side test may include phrase like 'END GOOD',
which will make tko parser parse the same test twice. This CL removes it
from raised error msg, so that it won't be recorded in status.log later.

BUG=chromium:713004
TEST=Ran unittest

Change-Id: Ibefc27eddd3b7c1df12aac2f66a5dec57b03c5f9
Reviewed-on: https://chromium-review.googlesource.com/482542
Reviewed-by: Ningning Xia <nxia@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/ba28516abd333289ce2e4e4069d024f54f1d7be1/server/autotest.py

Project Member

Comment 19 by bugdroid1@chromium.org, Apr 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ba28516abd333289ce2e4e4069d024f54f1d7be1

commit ba28516abd333289ce2e4e4069d024f54f1d7be1
Author: xixuan <xixuan@chromium.org>
Date: Thu Apr 20 00:48:28 2017

autotest: remove sensitive lines from error msg for further tko parsing.

The lines got from client side test may include phrase like 'END GOOD',
which will make tko parser parse the same test twice. This CL removes it
from raised error msg, so that it won't be recorded in status.log later.

BUG=chromium:713004
TEST=Ran unittest

Change-Id: Ibefc27eddd3b7c1df12aac2f66a5dec57b03c5f9
Reviewed-on: https://chromium-review.googlesource.com/482542
Reviewed-by: Ningning Xia <nxia@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/ba28516abd333289ce2e4e4069d024f54f1d7be1/server/autotest.py

We're working to pin chrome to 59.0.3065.0, which is the last
chrome version before the trouble started.  There have been two
manual uprevs since then.  We suspect that the uprev to 59.0.3068.1
has an undiscovered bug.

Comment 21 by nxia@chromium.org, Apr 20 2017

18:08nxia
https://chrome-internal-review.googlesource.com/c/358148
18:09nxia
https://chromium-review.googlesource.com/c/482707/
18:09nxia
the last PFQ uprev was a manual uprev, so cros_pinchrome discards the last stable chrome version 59.0.3065.0 and is going to pin it back to 59.0.3064.0
18:10nxia
59.0.3064.0 was only upreved one day before  59.0.3065.0  
18:10nxia
so we're going to revert to 59.0.3064.0. 
18:11nxia
and I'll file a bug to for the cros_pinchrome (it was wrote before the cros_uprevchrome tool), so it didn't handle the manual uprev version quite well 
Project Member

Comment 22 by bugdroid1@chromium.org, Apr 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/05be40f3153e8aaafe77785c66d759e6a4cec52f

commit 05be40f3153e8aaafe77785c66d759e6a4cec52f
Author: Ningning Xia <nxia@google.com>
Date: Thu Apr 20 01:18:18 2017

Chrome: Pin to version 59.0.3064.0_rc-r1

DO NOT REVERT THIS CL.
In general, reverting chrome (un)pin CLs does not do what you expect.
Instead, use `cros pinchrome` to generate new CLs.

BUG=chromium:713004
TEST=None
CQ-DEPEND=*I27cd69e3102272f369cc20d0230a7b12e4a6c394

Change-Id: I13eb0e2c4cfb4f6898a7e124fd5be013021d030b
Reviewed-on: https://chromium-review.googlesource.com/482707
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Reviewed-by: Xiaoqian Dai <xdai@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>

[add] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/profiles/default/linux/package.mask/chromepin
[modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos-base/chromeos-chrome/Manifest
[modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/amd64-generic-LATEST_RELEASE_CHROME_BINHOST.conf
[modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/daisy-LATEST_RELEASE_CHROME_BINHOST.conf
[rename] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos-base/chromium-source/chromium-source-59.0.3064.0_rc-r1.ebuild
[modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/arm-generic-LATEST_RELEASE_CHROME_BINHOST.conf
[modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/x86-generic-LATEST_RELEASE_CHROME_BINHOST.conf
[modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/veyron_jerry-LATEST_RELEASE_CHROME_BINHOST.conf
[rename] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos-base/chromeos-chrome/chromeos-chrome-59.0.3064.0_rc-r1.ebuild
[modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/host/amd64-LATEST_RELEASE_CHROME_BINHOST.conf

Project Member

Comment 23 by bugdroid1@chromium.org, Apr 20 2017

Comment 24 by ihf@chromium.org, Apr 20 2017

Cc: ihf@chromium.org
I have no idea if this Chrome is still in sync with Android.

Comment 25 by ihf@chromium.org, Apr 20 2017

Same with other changes that went in since that version. Just pinning Chrome to an old version may be pretty undefined. I wish you would have waited a little longer. But good luck. 

Comment 26 by nxia@chromium.org, Apr 20 2017

I'll monitor the CQ. I'll unpin the chrome if the version has issues. 
Cc: englab-sys-cros@google.com
 Issue 712991  has been merged into this issue.
Cc: ayatane@chromium.org
I am so confused by the logs and code: it looks like the timeout is recovered after "restarting master ssh connection".
04/18 22:30:19.578 DEBUG|      abstract_ssh:0744| Restarting master ssh connection

So the real issue is 
04/18 22:35:51.498 DEBUG|          autotest:0805| Result exit status is 255.

But I am not sure which command failed.

Given this only happens in paladin, and I saw there is a bunch of ayatane's CLs about autotest.
Do you aware of any of your changes could cause these?

I am going to throttle the tree and see if it gets better. 
The tree had already been throttled.

Comment 32 by nxia@chromium.org, Apr 20 2017

security_SandboxStatus timeout ( crbug.com/713531 ) is the only failure in the last CQ run master-paladin/14349. 

https://luci-milo.appspot.com/buildbot/chromeos/veyron_mighty-paladin/5056

https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/14349
SandboxStatus was timing out in crbug.com/706939. Maybe another bad Chrome change got in?
To clarify: issue 706939 was successfully fixed.
> To clarify: issue 706939 was successfully fixed.

Yeah, but we pinned Chrome back to 59.0.3064.0, which may
have been current around the time of the bug (it was built
around 4/7).  We should figure out whether that bug was in
that version of Chrome.  :-(

In other news, the network load that we were blaming on
"bad Chrome causing crashes" hasn't gone down.  That suggests
that either 1) pinning Chrome didn't stop the crashes or
2) the crashes weren't the source of the load.

The CQ hasn't been green since pinning Chrome, but that could
be due to problems other than this bug...

Comment 36 by x...@chromium.org, Apr 20 2017

Confirmed https://codereview.chromium.org/2783723002 is in 59.0.3064.0, but the revert CL https://codereview.chromium.org/2812743002/ is not in 59.0.3064.0.
> In other news, the network load that we were blaming on
> "bad Chrome causing crashes" hasn't gone down.  That suggests
> that either 1) pinning Chrome didn't stop the crashes or
> 2) the crashes weren't the source of the load.

Taking another look at this data, I see that the spike in network
load is outbound only.  If the problem were Chrome crashes, we'd
expect to see similar (leading) spikes in inbound traffic as we
first copied _from_ the DUTs and then trailing spikes as we copied
_to_ GS.

Also, the start of the network spikes coincided with the start of
higher disk write activity.  That suggests we suddenly started
generating new data locally on the shards, which is then offloaded.
That would suggest a change in autoserv or a server-side test.

Comment 38 by x...@chromium.org, Apr 20 2017

Cc: steve...@chromium.org
For the tko retrying the test twice, we have 2 solutions:

1. tko parser for parsing client side test is intended to parse status.log in https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/113330784-chromeos-test/chromeos4-row1-rack7-host9/ to be 2 entries: one for SERVER_JOB, one for CLIENT_JOB.0. 

2. The retry mechanism allows the second retry for the same job (https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/cros/dynamic_suite/suite.py?q=has_following_retry&sq=package:%5Echromeos_(internal%7Cpublic)$&l=188), 
but check the job's status not to be 'RETRIED' (https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/cros/dynamic_suite/suite.py?q=has_following_retry&sq=package:%5Echromeos_(internal%7Cpublic)$&l=152), 
which I feel is a bug, but don't know whether there're some other design in it.

We can either let the status.log not print line like 'INFO	----	----	timestamp=1492610586	job_abort_reason=Aborting...' anymore, or change the retry mechanism if there's no more consideration in it. I prefer the second one.
https://chromium-review.googlesource.com/c/483030/ is made for fix "retrying twice".

Comment 41 by nxia@chromium.org, Apr 20 2017

The chrome version was just unpinned  crbug.com/713531 
Project Member

Comment 42 by bugdroid1@chromium.org, Apr 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/bf854f881b4abfc79bc1eb2b9dd5aa368b40a136

commit bf854f881b4abfc79bc1eb2b9dd5aa368b40a136
Author: xixuan <xixuan@chromium.org>
Date: Thu Apr 20 19:20:24 2017

autotest: Retry only once for the same job.

Current retry strategy retry twice for the same job, and fail to execute
the second one. This CL changes the retry to only execute once.

BUG=chromium:713004
TEST=Ran unittest.

Change-Id: I4f1352963c60e80e11c2319eae52b6677288ab9c
Reviewed-on: https://chromium-review.googlesource.com/483030
Reviewed-by: Ningning Xia <nxia@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/bf854f881b4abfc79bc1eb2b9dd5aa368b40a136/server/cros/dynamic_suite/suite.py

Update on what's known:

At this point, we believe the underlying symptom (the errors
that say "probably lost connectivity") is caused by too much
network load on the shard serving the test.  We believe the
extra load is coming from Chrome crashes (see  bug 713856 ).
We believe that the asymmetric ingress/egress numbers are caused
by failed offloads being retried (the retries try to transmit
the entire results every time).

Pinning Chrome should have stopped the crashes, but there are
multiple sources of test requests, and some of them are still
requesting testing for builds that contain the Chrome bug.  That
means that load has gone down, but not yet vanished.

We've closed the lab to all canary builds that could contain the
bug; that will stop scheduled suites from using the bad builds.
We also aborted an R59 branch build that was seeing the problem.
We may need to close the lab to those builds as well.

Comment 44 by ihf@chromium.org, Apr 20 2017

Have you considered not rsyncing the chrome*.core files instead of closing the lab?

Comment 45 by ihf@chromium.org, Apr 20 2017

rsync --exclude 'chrome.201704*core'
when pulling /autotest/results/default/ will ignore chrome cores at the DUT.
The lab is only closed to builds that are known to be bad.

Finding all the code that might be syncing chrome core files is
more work and more risk, and when things got better, we'd have to
revert the change, too.

For now, the lab will remain closed to all ToT builds between 9455.0.0
and 9477.0.0.

Comment 47 by ihf@chromium.org, Apr 20 2017

Then let me unpin Chrome again so we get some nontrivial builds going.
> Then let me unpin Chrome again so we get some nontrivial builds going.

Chrome is already unpinned.  And the PFQ is testing against the latest
as well.  And we can't afford to do a manual uprev until we know for
sure that we've eliminated the source of the crashes.

I'm not sure if/how we will "no for sure".

I've been looking, or attempting to look anyway, but have not identified any crashes on the PFQ. This is not to say that they are not happening, just that if they are, they do not appear to be causing failures, and we don't seem to have any reasonable way to identify them.

I have some vague recollection of querying logs once for chrome crashes, but can not for the life of me remember how.

Project Member

Comment 50 by bugdroid1@chromium.org, Apr 21 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/cb21b2937f929fbde6ca7656df36703294a642c6

commit cb21b2937f929fbde6ca7656df36703294a642c6
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Fri Apr 21 00:16:04 2017

[autotest] Don't include /var/spool/crash in test results.

This (temporarily?) excludes /var/spool/crash from sysinfo in
test results.  Shards running tests are currently overwhelmed,
apparently by the volume of Chrome crashes.  Shut them off, to
see if we can make progress.

BUG=chromium:713004
TEST=None

Change-Id: I8c0a74771251c68ca70ac1129fa9cfa3013b539e
Reviewed-on: https://chromium-review.googlesource.com/483960
Reviewed-by: Steven Bennetts <stevenjb@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@chromium.org>

[modify] https://crrev.com/cb21b2937f929fbde6ca7656df36703294a642c6/client/bin/site_sysinfo.py

Comment 51 by nxia@chromium.org, Apr 21 2017

Blockedon: 713856

Comment 52 by nxia@chromium.org, Apr 21 2017

Re #49, have you looked into the failed test examples in this bug (#1 ~ #6), you can look into the chrome crash dump files offloaded to GS. Most of them have the dump files in the bucket. If we can make sure all the crashed have gone, we're good.

Comment 53 by nxia@chromium.org, Apr 22 2017

 Issue 712958  has been merged into this issue.

Comment 54 by nxia@chromium.org, Apr 22 2017

Cc: xixuan@chromium.org
 Issue 712390  has been merged into this issue.
Project Member

Comment 55 by bugdroid1@chromium.org, Apr 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/158b180499131e4a5ef2dc7b4c357702aeba9838

commit 158b180499131e4a5ef2dc7b4c357702aeba9838
Author: Ilja H. Friedel <ihf@chromium.org>
Date: Sat Apr 22 01:59:53 2017

Revert "[autotest] Don't include /var/spool/crash in test results."

This reverts commit cb21b2937f929fbde6ca7656df36703294a642c6.

BUG=chromium:713004
TEST=None

Change-Id: I9e5f0ee0c29d5cc9868c19577a8f0ecf6ce182ca
Reviewed-on: https://chromium-review.googlesource.com/483847
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
Tested-by: Ilja H. Friedel <ihf@chromium.org>

[modify] https://crrev.com/158b180499131e4a5ef2dc7b4c357702aeba9838/client/bin/site_sysinfo.py

Project Member

Comment 56 by bugdroid1@chromium.org, Apr 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3cf74ed60b2419125306a1e0e433d82d0c8edc3e

commit 3cf74ed60b2419125306a1e0e433d82d0c8edc3e
Author: Ilja H. Friedel <ihf@chromium.org>
Date: Sat Apr 22 02:40:06 2017

Reland "[autotest] Don't include /var/spool/crash in test results."

The uprev to 60.0.3077.0 did not contain the crash fix yet. It will be in 60.0.30787.0

This reverts commit 158b180499131e4a5ef2dc7b4c357702aeba9838.

BUG=chromium:713004
TEST=None

Change-Id: I148b9f582ec70d6edf03feb852987785f8db9835
Reviewed-on: https://chromium-review.googlesource.com/484813
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
Tested-by: Ilja H. Friedel <ihf@chromium.org>

[modify] https://crrev.com/3cf74ed60b2419125306a1e0e433d82d0c8edc3e/client/bin/site_sysinfo.py

Comment 57 by nxia@chromium.org, Apr 24 2017

Cc: chingcodes@chromium.org
Labels: -Pri-0 Pri-1
Now the lab is experiencing  crbug.com/714571  which is unrelated to this issue. Downgrade this to P1 and we will keep monitoring. 

we have 

(1) skipped offloading crash core files
(2) blocked lab tests on bad CrOS/Chrome versions and
(3) uprev'ed chrome versions
 

Comment 58 by nxia@chromium.org, Jun 8 2018

Cc: -nxia@chromium.org

Sign in to add a comment