Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 5 users
Status: Untriaged
Owner: ----
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 713856



Sign in to add a comment
Tests passed but got aborted by AutotestAbort
Project Member Reported by nxia@chromium.org, Apr 19 Back to list
https://luci-milo.appspot.com/buildbot/chromeos/x86-zgb-paladin/9650

login_SameSessionTwice

http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=113290478

04/18 22:35:53.773 DEBUG|          autotest:0960| Autotest job finishes.
04/18 22:35:53.774 ERROR|        server_job:0809| Exception escaped control file, job aborting:
Traceback (most recent call last):
  File "/usr/local/autotest/server/server_job.py", line 801, in run
    self._execute_code(server_control_file, namespace)
  File "/usr/local/autotest/server/server_job.py", line 1301, in _execute_code
    execfile(code_file, namespace, namespace)
  File "/usr/local/autotest/results/113290478-chromeos-test/chromeos6-row2-rack6-host16/control.srv", line 10, in <module>
    job.parallel_simple(run_client, machines)
  File "/usr/local/autotest/server/server_job.py", line 625, in parallel_simple
    return_results=return_results)
  File "/usr/local/autotest/server/subcommand.py", line 93, in parallel_simple
    function(arg)
  File "/usr/local/autotest/results/113290478-chromeos-test/chromeos6-row2-rack6-host16/control.srv", line 7, in run_client
    at.run(control, host=host, use_packaging=use_packaging)
  File "/usr/local/autotest/server/autotest.py", line 381, in run
    client_disconnect_timeout, use_packaging=use_packaging)
  File "/usr/local/autotest/server/autotest.py", line 464, in _do_run
    client_disconnect_timeout=client_disconnect_timeout)
  File "/usr/local/autotest/server/autotest.py", line 944, in execute_control
    raise error.AutotestRunError(msg)
AutotestRunError: Aborting - unexpected final status message from client on chromeos6-row2-rack6-host16: 	END GOOD	login_SameSessionTwice	login_SameSessionTwice	timestamp=1492579730	localtime=Apr 18 22:28:50	

 
Cc: owenlin@chromium.org nxia@chromium.org tbroch@chromium.org zhihongyu@chromium.org
seeing random Chrome login test failures and Chrome crashes.
Labels: -Pri-2 Pri-1
the other login test failures are  crbug.com/712390  and  crbug.com/712958 .
It looks like the same issue as  crbug.com/712991 (Peach-pit Paladin). The test itself has been pass. But it failed at post processing.

==============
04/18 22:29:47.322 WARNI|        base_utils:0912| run process timeout (30) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_koCfCbssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack6-host16 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true"
04/18 22:29:48.330 DEBUG|          ssh_host:0212| retrying ssh command after timeout
04/18 22:30:18.569 WARNI|        base_utils:0912| run process timeout (30) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_koCfCbssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack6-host16 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true"
04/18 22:30:19.578 DEBUG|          ssh_host:0218| retry 2: restarting master connection
04/18 22:30:19.578 DEBUG|      abstract_ssh:0744| Restarting master ssh connection
04/18 22:30:19.578 DEBUG|      abstract_ssh:0756| Nuking master_ssh_job.
04/18 22:30:20.581 DEBUG|      abstract_ssh:0762| Cleaning master_ssh_tempdir.
04/18 22:30:20.582 INFO |      abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_GgMOwMssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack6-host16'
04/18 22:30:20.583 DEBUG|        base_utils:0185| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_GgMOwMssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack6-host16'
04/18 22:30:42.035 DEBUG|      abstract_ssh:0587| Host chromeos6-row2-rack6-host16 is now up
04/18 22:30:42.036 DEBUG|      abstract_ssh:0346| get_file. source: /usr/local/autotest/results/default/, dest: /usr/local/autotest/results/113290478-chromeos-test/chromeos6-row2-rack6-host16, delete_dest: False,preserve_perm: True, preserve_symlinks:True
04/18 22:30:42.037 DEBUG|      abstract_ssh:0357| Using Rsync.
04/18 22:30:42.037 DEBUG|        base_utils:0185| Running 'rsync -l  --timeout=1800 --rsh='/usr/bin/ssh -a -x   -o ControlPath=/tmp/_autotmp_GgMOwMssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22' -az --no-o --no-g root@chromeos6-row2-rack6-host16:"/usr/local/autotest/results/default/" "/usr/local/autotest/results/113290478-chromeos-test/chromeos6-row2-rack6-host16"'
04/18 22:35:50.989 DEBUG|          ssh_host:0284| Running (ssh) 'echo A > /usr/local/autotest/tmp/_autotmp_JHXY3Dharness-fifo/autoserv.fifo'
04/18 22:35:51.498 DEBUG|          autotest:0805| Result exit status is 255.
Summary: Tests passed but got aborted by AutotestAbort (was: x86-zgb-paladin failed at login_SameSessionTwice)
https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/autotest.py?type=cs&q=execute_control&sq=package:%5Echromeos_(internal%7Cpublic)$&l=921


In the "status" file, the test output is correct:
START	----	----	timestamp=1492579661	localtime=Apr 18 22:27:41	
	START	login_SameSessionTwice	login_SameSessionTwice	timestamp=1492579663	localtime=Apr 18 22:27:43	
		GOOD	login_SameSessionTwice	login_SameSessionTwice	timestamp=1492579730	localtime=Apr 18 22:28:50	completed successfully
	END GOOD	login_SameSessionTwice	login_SameSessionTwice	timestamp=1492579730	localtime=Apr 18 22:28:50	
END GOOD	----	----	timestamp=1492580150	localtime=Apr 18 22:35:50	

but in the status.log, the test shows as aborted:

INFO	----	----	kernel=3.8.11	localtime=Apr 18 22:27:00	timestamp=1492579620	
START	----	----	timestamp=1492579661	localtime=Apr 18 22:27:41	
	START	login_SameSessionTwice	login_SameSessionTwice	timestamp=1492579663	localtime=Apr 18 22:27:43	
		GOOD	login_SameSessionTwice	login_SameSessionTwice	timestamp=1492579730	localtime=Apr 18 22:28:50	completed successfully
	END GOOD	login_SameSessionTwice	login_SameSessionTwice	timestamp=1492579730	localtime=Apr 18 22:28:50	
	ABORT	----	----	timestamp=1492580152	localtime=Apr 18 22:35:52	Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test.
END ABORT	----	----	timestamp=1492580152	localtime=Apr 18 22:35:52	
Another example:

https://luci-milo.appspot.com/buildbot/chromeos/wolf-paladin/14094

platform_DMVerityCorruption_CLIENT_JOB.0

http://cautotest/tko/retrieve_logs.cgi?job=/results/113330784-chromeos-test/



INFO	----	----	kernel=3.8.11	localtime=Apr 19 06:58:28	timestamp=1492610308	
START	----	----	timestamp=1492610413	localtime=Apr 19 07:00:13	
	START	platform_DMVerityCorruption	platform_DMVerityCorruption	timestamp=1492610413	localtime=Apr 19 07:00:13	
		GOOD	platform_DMVerityCorruption	platform_DMVerityCorruption	timestamp=1492610427	localtime=Apr 19 07:00:27	completed successfully
	END GOOD	platform_DMVerityCorruption	platform_DMVerityCorruption	timestamp=1492610427	localtime=Apr 19 07:00:27	
	ABORT	----	----	timestamp=1492610565	localtime=Apr 19 07:02:45	Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test.
END ABORT	----	----	timestamp=1492610565	localtime=Apr 19 07:02:45
another example:

https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/5683

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/113331037-chromeos-test/chromeos2-row1-rack8-host1/moblab_RunSuite/debug/

	START	----	----	timestamp=1492610688	localtime=Apr 19 07:04:48	
		GOOD	----	sysinfo.iteration.before	timestamp=1492610689	localtime=Apr 19 07:04:49	
		ABORT	----	----	timestamp=1492610692	localtime=Apr 19 07:04:52	Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test.
	END ABORT	----	----	timestamp=1492610692	localtime=Apr 19 07:04:52	

Cc: shuqianz@chromium.org
Labels: -Pri-1 Pri-0
follow #4:

why the status "END GOOD	----	----	timestamp=1492580150	localtime=Apr 18 22:35:50" wasn't added to the status.log correctly? 
Cc: dshi@chromium.org jrbarnette@chromium.org
The original ZGB failure is the same as the one described in
 bug 712464 .

Every time I've seen this, it's also been accompanied by this
error message:
    Autotest client terminated unexpectedly: DUT is pingable,
    SSHable and did NOT restart un-expectedly. We probably lost
    connectivity during the test

I suspect we have two issues:
 1) Something (a bug in the lab?) causes us to lose connectivity
    to DUTs.
 2) Our retry logic does the wrong thing with the ABORT status.

How was the status transferred to the shard? the dut may lose connection when the test finished so the last 'END' log wasn't reported. 
split this bug to 2 small bugs:

1. the tko parser parse out 2 statuses for the same test, which makes the test retry twice. 
2. DUT loses connection in the client side tests, which makes server fail to get right test status.

I will focus on investigating 1, currently I think it may due to bad formats of logs in the test's status.log which contains sensitive phrases like "END GOOD", and mislead the parser:

INFO	----	----	timestamp=1492610586	job_abort_reason=Aborting - unexpected final status message from client on chromeos4-row1-rack7-host9:  END GOOD platform_DMVerityCorruption platform_DMVerityCorruption timestamp=1492610427 localtime=Apr 19 07:00:27  	localtime=Apr 19 07:03:06	Aborting - unexpected final status message from client on chromeos4-row1-rack7-host9:         END GOOD        platform_DMVerityCorruption        platform_DMVerityCorruption        timestamp=1492610427        localtime=Apr 19 07:00:27            

investigation continuing.




We're pursuing a theory that the failures are caused by bandwidth
spikes while uploading crashes caused by bug 712102.

Cc: x...@chromium.org
Cc: hidehiko@chromium.org
Looks dup of crbug.com/712567?
Project Member Comment 17 by bugdroid1@chromium.org, Apr 20
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ba28516abd333289ce2e4e4069d024f54f1d7be1

commit ba28516abd333289ce2e4e4069d024f54f1d7be1
Author: xixuan <xixuan@chromium.org>
Date: Thu Apr 20 00:48:28 2017

autotest: remove sensitive lines from error msg for further tko parsing.

The lines got from client side test may include phrase like 'END GOOD',
which will make tko parser parse the same test twice. This CL removes it
from raised error msg, so that it won't be recorded in status.log later.

BUG=chromium:713004
TEST=Ran unittest

Change-Id: Ibefc27eddd3b7c1df12aac2f66a5dec57b03c5f9
Reviewed-on: https://chromium-review.googlesource.com/482542
Reviewed-by: Ningning Xia <nxia@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/ba28516abd333289ce2e4e4069d024f54f1d7be1/server/autotest.py

Project Member Comment 18 by bugdroid1@chromium.org, Apr 20
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ba28516abd333289ce2e4e4069d024f54f1d7be1

commit ba28516abd333289ce2e4e4069d024f54f1d7be1
Author: xixuan <xixuan@chromium.org>
Date: Thu Apr 20 00:48:28 2017

autotest: remove sensitive lines from error msg for further tko parsing.

The lines got from client side test may include phrase like 'END GOOD',
which will make tko parser parse the same test twice. This CL removes it
from raised error msg, so that it won't be recorded in status.log later.

BUG=chromium:713004
TEST=Ran unittest

Change-Id: Ibefc27eddd3b7c1df12aac2f66a5dec57b03c5f9
Reviewed-on: https://chromium-review.googlesource.com/482542
Reviewed-by: Ningning Xia <nxia@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/ba28516abd333289ce2e4e4069d024f54f1d7be1/server/autotest.py

Project Member Comment 19 by bugdroid1@chromium.org, Apr 20
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ba28516abd333289ce2e4e4069d024f54f1d7be1

commit ba28516abd333289ce2e4e4069d024f54f1d7be1
Author: xixuan <xixuan@chromium.org>
Date: Thu Apr 20 00:48:28 2017

autotest: remove sensitive lines from error msg for further tko parsing.

The lines got from client side test may include phrase like 'END GOOD',
which will make tko parser parse the same test twice. This CL removes it
from raised error msg, so that it won't be recorded in status.log later.

BUG=chromium:713004
TEST=Ran unittest

Change-Id: Ibefc27eddd3b7c1df12aac2f66a5dec57b03c5f9
Reviewed-on: https://chromium-review.googlesource.com/482542
Reviewed-by: Ningning Xia <nxia@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/ba28516abd333289ce2e4e4069d024f54f1d7be1/server/autotest.py

We're working to pin chrome to 59.0.3065.0, which is the last
chrome version before the trouble started.  There have been two
manual uprevs since then.  We suspect that the uprev to 59.0.3068.1
has an undiscovered bug.

18:08nxia
https://chrome-internal-review.googlesource.com/c/358148
18:09nxia
https://chromium-review.googlesource.com/c/482707/
18:09nxia
the last PFQ uprev was a manual uprev, so cros_pinchrome discards the last stable chrome version 59.0.3065.0 and is going to pin it back to 59.0.3064.0
18:10nxia
59.0.3064.0 was only upreved one day before  59.0.3065.0  
18:10nxia
so we're going to revert to 59.0.3064.0. 
18:11nxia
and I'll file a bug to for the cros_pinchrome (it was wrote before the cros_uprevchrome tool), so it didn't handle the manual uprev version quite well 
Project Member Comment 22 by bugdroid1@chromium.org, Apr 20
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/05be40f3153e8aaafe77785c66d759e6a4cec52f

commit 05be40f3153e8aaafe77785c66d759e6a4cec52f
Author: Ningning Xia <nxia@google.com>
Date: Thu Apr 20 01:18:18 2017

Chrome: Pin to version 59.0.3064.0_rc-r1

DO NOT REVERT THIS CL.
In general, reverting chrome (un)pin CLs does not do what you expect.
Instead, use `cros pinchrome` to generate new CLs.

BUG=chromium:713004
TEST=None
CQ-DEPEND=*I27cd69e3102272f369cc20d0230a7b12e4a6c394

Change-Id: I13eb0e2c4cfb4f6898a7e124fd5be013021d030b
Reviewed-on: https://chromium-review.googlesource.com/482707
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Reviewed-by: Xiaoqian Dai <xdai@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>

[add] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/profiles/default/linux/package.mask/chromepin
[modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos-base/chromeos-chrome/Manifest
[modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/amd64-generic-LATEST_RELEASE_CHROME_BINHOST.conf
[modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/daisy-LATEST_RELEASE_CHROME_BINHOST.conf
[rename] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos-base/chromium-source/chromium-source-59.0.3064.0_rc-r1.ebuild
[modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/arm-generic-LATEST_RELEASE_CHROME_BINHOST.conf
[modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/x86-generic-LATEST_RELEASE_CHROME_BINHOST.conf
[modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/veyron_jerry-LATEST_RELEASE_CHROME_BINHOST.conf
[rename] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos-base/chromeos-chrome/chromeos-chrome-59.0.3064.0_rc-r1.ebuild
[modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/host/amd64-LATEST_RELEASE_CHROME_BINHOST.conf

Project Member Comment 23 by bugdroid1@chromium.org, Apr 20
Cc: ihf@chromium.org
I have no idea if this Chrome is still in sync with Android.
Same with other changes that went in since that version. Just pinning Chrome to an old version may be pretty undefined. I wish you would have waited a little longer. But good luck. 
I'll monitor the CQ. I'll unpin the chrome if the version has issues. 
Cc: englab-sys-cros@google.com
 Issue 712991  has been merged into this issue.
Cc: ayatane@chromium.org
I am so confused by the logs and code: it looks like the timeout is recovered after "restarting master ssh connection".
04/18 22:30:19.578 DEBUG|      abstract_ssh:0744| Restarting master ssh connection

So the real issue is 
04/18 22:35:51.498 DEBUG|          autotest:0805| Result exit status is 255.

But I am not sure which command failed.

Given this only happens in paladin, and I saw there is a bunch of ayatane's CLs about autotest.
Do you aware of any of your changes could cause these?

I am going to throttle the tree and see if it gets better. 
The tree had already been throttled.
SandboxStatus was timing out in crbug.com/706939. Maybe another bad Chrome change got in?
To clarify: issue 706939 was successfully fixed.
> To clarify: issue 706939 was successfully fixed.

Yeah, but we pinned Chrome back to 59.0.3064.0, which may
have been current around the time of the bug (it was built
around 4/7).  We should figure out whether that bug was in
that version of Chrome.  :-(

In other news, the network load that we were blaming on
"bad Chrome causing crashes" hasn't gone down.  That suggests
that either 1) pinning Chrome didn't stop the crashes or
2) the crashes weren't the source of the load.

The CQ hasn't been green since pinning Chrome, but that could
be due to problems other than this bug...

Confirmed https://codereview.chromium.org/2783723002 is in 59.0.3064.0, but the revert CL https://codereview.chromium.org/2812743002/ is not in 59.0.3064.0.
> In other news, the network load that we were blaming on
> "bad Chrome causing crashes" hasn't gone down.  That suggests
> that either 1) pinning Chrome didn't stop the crashes or
> 2) the crashes weren't the source of the load.

Taking another look at this data, I see that the spike in network
load is outbound only.  If the problem were Chrome crashes, we'd
expect to see similar (leading) spikes in inbound traffic as we
first copied _from_ the DUTs and then trailing spikes as we copied
_to_ GS.

Also, the start of the network spikes coincided with the start of
higher disk write activity.  That suggests we suddenly started
generating new data locally on the shards, which is then offloaded.
That would suggest a change in autoserv or a server-side test.

Cc: steve...@chromium.org
For the tko retrying the test twice, we have 2 solutions:

1. tko parser for parsing client side test is intended to parse status.log in https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/113330784-chromeos-test/chromeos4-row1-rack7-host9/ to be 2 entries: one for SERVER_JOB, one for CLIENT_JOB.0. 

2. The retry mechanism allows the second retry for the same job (https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/cros/dynamic_suite/suite.py?q=has_following_retry&sq=package:%5Echromeos_(internal%7Cpublic)$&l=188), 
but check the job's status not to be 'RETRIED' (https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/cros/dynamic_suite/suite.py?q=has_following_retry&sq=package:%5Echromeos_(internal%7Cpublic)$&l=152), 
which I feel is a bug, but don't know whether there're some other design in it.

We can either let the status.log not print line like 'INFO	----	----	timestamp=1492610586	job_abort_reason=Aborting...' anymore, or change the retry mechanism if there's no more consideration in it. I prefer the second one.
https://chromium-review.googlesource.com/c/483030/ is made for fix "retrying twice".
The chrome version was just unpinned  crbug.com/713531 
Project Member Comment 42 by bugdroid1@chromium.org, Apr 20
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/bf854f881b4abfc79bc1eb2b9dd5aa368b40a136

commit bf854f881b4abfc79bc1eb2b9dd5aa368b40a136
Author: xixuan <xixuan@chromium.org>
Date: Thu Apr 20 19:20:24 2017

autotest: Retry only once for the same job.

Current retry strategy retry twice for the same job, and fail to execute
the second one. This CL changes the retry to only execute once.

BUG=chromium:713004
TEST=Ran unittest.

Change-Id: I4f1352963c60e80e11c2319eae52b6677288ab9c
Reviewed-on: https://chromium-review.googlesource.com/483030
Reviewed-by: Ningning Xia <nxia@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/bf854f881b4abfc79bc1eb2b9dd5aa368b40a136/server/cros/dynamic_suite/suite.py

Update on what's known:

At this point, we believe the underlying symptom (the errors
that say "probably lost connectivity") is caused by too much
network load on the shard serving the test.  We believe the
extra load is coming from Chrome crashes (see  bug 713856 ).
We believe that the asymmetric ingress/egress numbers are caused
by failed offloads being retried (the retries try to transmit
the entire results every time).

Pinning Chrome should have stopped the crashes, but there are
multiple sources of test requests, and some of them are still
requesting testing for builds that contain the Chrome bug.  That
means that load has gone down, but not yet vanished.

We've closed the lab to all canary builds that could contain the
bug; that will stop scheduled suites from using the bad builds.
We also aborted an R59 branch build that was seeing the problem.
We may need to close the lab to those builds as well.

Have you considered not rsyncing the chrome*.core files instead of closing the lab?
rsync --exclude 'chrome.201704*core'
when pulling /autotest/results/default/ will ignore chrome cores at the DUT.
The lab is only closed to builds that are known to be bad.

Finding all the code that might be syncing chrome core files is
more work and more risk, and when things got better, we'd have to
revert the change, too.

For now, the lab will remain closed to all ToT builds between 9455.0.0
and 9477.0.0.

Then let me unpin Chrome again so we get some nontrivial builds going.
> Then let me unpin Chrome again so we get some nontrivial builds going.

Chrome is already unpinned.  And the PFQ is testing against the latest
as well.  And we can't afford to do a manual uprev until we know for
sure that we've eliminated the source of the crashes.

I'm not sure if/how we will "no for sure".

I've been looking, or attempting to look anyway, but have not identified any crashes on the PFQ. This is not to say that they are not happening, just that if they are, they do not appear to be causing failures, and we don't seem to have any reasonable way to identify them.

I have some vague recollection of querying logs once for chrome crashes, but can not for the life of me remember how.

Project Member Comment 50 by bugdroid1@chromium.org, Apr 21
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/cb21b2937f929fbde6ca7656df36703294a642c6

commit cb21b2937f929fbde6ca7656df36703294a642c6
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Fri Apr 21 00:16:04 2017

[autotest] Don't include /var/spool/crash in test results.

This (temporarily?) excludes /var/spool/crash from sysinfo in
test results.  Shards running tests are currently overwhelmed,
apparently by the volume of Chrome crashes.  Shut them off, to
see if we can make progress.

BUG=chromium:713004
TEST=None

Change-Id: I8c0a74771251c68ca70ac1129fa9cfa3013b539e
Reviewed-on: https://chromium-review.googlesource.com/483960
Reviewed-by: Steven Bennetts <stevenjb@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@chromium.org>

[modify] https://crrev.com/cb21b2937f929fbde6ca7656df36703294a642c6/client/bin/site_sysinfo.py

Blockedon: 713856
Re #49, have you looked into the failed test examples in this bug (#1 ~ #6), you can look into the chrome crash dump files offloaded to GS. Most of them have the dump files in the bucket. If we can make sure all the crashed have gone, we're good.
 Issue 712958  has been merged into this issue.
Cc: xixuan@chromium.org
 Issue 712390  has been merged into this issue.
Project Member Comment 55 by bugdroid1@chromium.org, Apr 22
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/158b180499131e4a5ef2dc7b4c357702aeba9838

commit 158b180499131e4a5ef2dc7b4c357702aeba9838
Author: Ilja H. Friedel <ihf@chromium.org>
Date: Sat Apr 22 01:59:53 2017

Revert "[autotest] Don't include /var/spool/crash in test results."

This reverts commit cb21b2937f929fbde6ca7656df36703294a642c6.

BUG=chromium:713004
TEST=None

Change-Id: I9e5f0ee0c29d5cc9868c19577a8f0ecf6ce182ca
Reviewed-on: https://chromium-review.googlesource.com/483847
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
Tested-by: Ilja H. Friedel <ihf@chromium.org>

[modify] https://crrev.com/158b180499131e4a5ef2dc7b4c357702aeba9838/client/bin/site_sysinfo.py

Project Member Comment 56 by bugdroid1@chromium.org, Apr 22
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3cf74ed60b2419125306a1e0e433d82d0c8edc3e

commit 3cf74ed60b2419125306a1e0e433d82d0c8edc3e
Author: Ilja H. Friedel <ihf@chromium.org>
Date: Sat Apr 22 02:40:06 2017

Reland "[autotest] Don't include /var/spool/crash in test results."

The uprev to 60.0.3077.0 did not contain the crash fix yet. It will be in 60.0.30787.0

This reverts commit 158b180499131e4a5ef2dc7b4c357702aeba9838.

BUG=chromium:713004
TEST=None

Change-Id: I148b9f582ec70d6edf03feb852987785f8db9835
Reviewed-on: https://chromium-review.googlesource.com/484813
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
Tested-by: Ilja H. Friedel <ihf@chromium.org>

[modify] https://crrev.com/3cf74ed60b2419125306a1e0e433d82d0c8edc3e/client/bin/site_sysinfo.py

Cc: chingcodes@chromium.org
Labels: -Pri-0 Pri-1
Now the lab is experiencing  crbug.com/714571  which is unrelated to this issue. Downgrade this to P1 and we will keep monitoring. 

we have 

(1) skipped offloading crash core files
(2) blocked lab tests on bad CrOS/Chrome versions and
(3) uprev'ed chrome versions
 
Sign in to add a comment