Tests passed but got aborted by AutotestAbort |
|||||||||||||
Issue descriptionhttps://luci-milo.appspot.com/buildbot/chromeos/x86-zgb-paladin/9650 login_SameSessionTwice http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=113290478 04/18 22:35:53.773 DEBUG| autotest:0960| Autotest job finishes. 04/18 22:35:53.774 ERROR| server_job:0809| Exception escaped control file, job aborting: Traceback (most recent call last): File "/usr/local/autotest/server/server_job.py", line 801, in run self._execute_code(server_control_file, namespace) File "/usr/local/autotest/server/server_job.py", line 1301, in _execute_code execfile(code_file, namespace, namespace) File "/usr/local/autotest/results/113290478-chromeos-test/chromeos6-row2-rack6-host16/control.srv", line 10, in <module> job.parallel_simple(run_client, machines) File "/usr/local/autotest/server/server_job.py", line 625, in parallel_simple return_results=return_results) File "/usr/local/autotest/server/subcommand.py", line 93, in parallel_simple function(arg) File "/usr/local/autotest/results/113290478-chromeos-test/chromeos6-row2-rack6-host16/control.srv", line 7, in run_client at.run(control, host=host, use_packaging=use_packaging) File "/usr/local/autotest/server/autotest.py", line 381, in run client_disconnect_timeout, use_packaging=use_packaging) File "/usr/local/autotest/server/autotest.py", line 464, in _do_run client_disconnect_timeout=client_disconnect_timeout) File "/usr/local/autotest/server/autotest.py", line 944, in execute_control raise error.AutotestRunError(msg) AutotestRunError: Aborting - unexpected final status message from client on chromeos6-row2-rack6-host16: END GOOD login_SameSessionTwice login_SameSessionTwice timestamp=1492579730 localtime=Apr 18 22:28:50
,
Apr 19 2017
,
Apr 19 2017
It looks like the same issue as crbug.com/712991 (Peach-pit Paladin). The test itself has been pass. But it failed at post processing. ============== 04/18 22:29:47.322 WARNI| base_utils:0912| run process timeout (30) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_koCfCbssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack6-host16 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true" 04/18 22:29:48.330 DEBUG| ssh_host:0212| retrying ssh command after timeout 04/18 22:30:18.569 WARNI| base_utils:0912| run process timeout (30) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_koCfCbssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack6-host16 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true" 04/18 22:30:19.578 DEBUG| ssh_host:0218| retry 2: restarting master connection 04/18 22:30:19.578 DEBUG| abstract_ssh:0744| Restarting master ssh connection 04/18 22:30:19.578 DEBUG| abstract_ssh:0756| Nuking master_ssh_job. 04/18 22:30:20.581 DEBUG| abstract_ssh:0762| Cleaning master_ssh_tempdir. 04/18 22:30:20.582 INFO | abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_GgMOwMssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack6-host16' 04/18 22:30:20.583 DEBUG| base_utils:0185| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_GgMOwMssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row2-rack6-host16' 04/18 22:30:42.035 DEBUG| abstract_ssh:0587| Host chromeos6-row2-rack6-host16 is now up 04/18 22:30:42.036 DEBUG| abstract_ssh:0346| get_file. source: /usr/local/autotest/results/default/, dest: /usr/local/autotest/results/113290478-chromeos-test/chromeos6-row2-rack6-host16, delete_dest: False,preserve_perm: True, preserve_symlinks:True 04/18 22:30:42.037 DEBUG| abstract_ssh:0357| Using Rsync. 04/18 22:30:42.037 DEBUG| base_utils:0185| Running 'rsync -l --timeout=1800 --rsh='/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_GgMOwMssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22' -az --no-o --no-g root@chromeos6-row2-rack6-host16:"/usr/local/autotest/results/default/" "/usr/local/autotest/results/113290478-chromeos-test/chromeos6-row2-rack6-host16"' 04/18 22:35:50.989 DEBUG| ssh_host:0284| Running (ssh) 'echo A > /usr/local/autotest/tmp/_autotmp_JHXY3Dharness-fifo/autoserv.fifo' 04/18 22:35:51.498 DEBUG| autotest:0805| Result exit status is 255.
,
Apr 19 2017
https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/autotest.py?type=cs&q=execute_control&sq=package:%5Echromeos_(internal%7Cpublic)$&l=921 In the "status" file, the test output is correct: START ---- ---- timestamp=1492579661 localtime=Apr 18 22:27:41 START login_SameSessionTwice login_SameSessionTwice timestamp=1492579663 localtime=Apr 18 22:27:43 GOOD login_SameSessionTwice login_SameSessionTwice timestamp=1492579730 localtime=Apr 18 22:28:50 completed successfully END GOOD login_SameSessionTwice login_SameSessionTwice timestamp=1492579730 localtime=Apr 18 22:28:50 END GOOD ---- ---- timestamp=1492580150 localtime=Apr 18 22:35:50 but in the status.log, the test shows as aborted: INFO ---- ---- kernel=3.8.11 localtime=Apr 18 22:27:00 timestamp=1492579620 START ---- ---- timestamp=1492579661 localtime=Apr 18 22:27:41 START login_SameSessionTwice login_SameSessionTwice timestamp=1492579663 localtime=Apr 18 22:27:43 GOOD login_SameSessionTwice login_SameSessionTwice timestamp=1492579730 localtime=Apr 18 22:28:50 completed successfully END GOOD login_SameSessionTwice login_SameSessionTwice timestamp=1492579730 localtime=Apr 18 22:28:50 ABORT ---- ---- timestamp=1492580152 localtime=Apr 18 22:35:52 Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test. END ABORT ---- ---- timestamp=1492580152 localtime=Apr 18 22:35:52
,
Apr 19 2017
Another example: https://luci-milo.appspot.com/buildbot/chromeos/wolf-paladin/14094 platform_DMVerityCorruption_CLIENT_JOB.0 http://cautotest/tko/retrieve_logs.cgi?job=/results/113330784-chromeos-test/ INFO ---- ---- kernel=3.8.11 localtime=Apr 19 06:58:28 timestamp=1492610308 START ---- ---- timestamp=1492610413 localtime=Apr 19 07:00:13 START platform_DMVerityCorruption platform_DMVerityCorruption timestamp=1492610413 localtime=Apr 19 07:00:13 GOOD platform_DMVerityCorruption platform_DMVerityCorruption timestamp=1492610427 localtime=Apr 19 07:00:27 completed successfully END GOOD platform_DMVerityCorruption platform_DMVerityCorruption timestamp=1492610427 localtime=Apr 19 07:00:27 ABORT ---- ---- timestamp=1492610565 localtime=Apr 19 07:02:45 Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test. END ABORT ---- ---- timestamp=1492610565 localtime=Apr 19 07:02:45
,
Apr 19 2017
another example: https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/5683 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/113331037-chromeos-test/chromeos2-row1-rack8-host1/moblab_RunSuite/debug/ START ---- ---- timestamp=1492610688 localtime=Apr 19 07:04:48 GOOD ---- sysinfo.iteration.before timestamp=1492610689 localtime=Apr 19 07:04:49 ABORT ---- ---- timestamp=1492610692 localtime=Apr 19 07:04:52 Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test. END ABORT ---- ---- timestamp=1492610692 localtime=Apr 19 07:04:52
,
Apr 19 2017
,
Apr 19 2017
,
Apr 19 2017
follow #4: why the status "END GOOD ---- ---- timestamp=1492580150 localtime=Apr 18 22:35:50" wasn't added to the status.log correctly?
,
Apr 19 2017
,
Apr 19 2017
The original ZGB failure is the same as the one described in bug 712464 . Every time I've seen this, it's also been accompanied by this error message: Autotest client terminated unexpectedly: DUT is pingable, SSHable and did NOT restart un-expectedly. We probably lost connectivity during the test I suspect we have two issues: 1) Something (a bug in the lab?) causes us to lose connectivity to DUTs. 2) Our retry logic does the wrong thing with the ABORT status.
,
Apr 19 2017
How was the status transferred to the shard? the dut may lose connection when the test finished so the last 'END' log wasn't reported.
,
Apr 19 2017
split this bug to 2 small bugs: 1. the tko parser parse out 2 statuses for the same test, which makes the test retry twice. 2. DUT loses connection in the client side tests, which makes server fail to get right test status. I will focus on investigating 1, currently I think it may due to bad formats of logs in the test's status.log which contains sensitive phrases like "END GOOD", and mislead the parser: INFO ---- ---- timestamp=1492610586 job_abort_reason=Aborting - unexpected final status message from client on chromeos4-row1-rack7-host9: END GOOD platform_DMVerityCorruption platform_DMVerityCorruption timestamp=1492610427 localtime=Apr 19 07:00:27 localtime=Apr 19 07:03:06 Aborting - unexpected final status message from client on chromeos4-row1-rack7-host9: END GOOD platform_DMVerityCorruption platform_DMVerityCorruption timestamp=1492610427 localtime=Apr 19 07:00:27 investigation continuing.
,
Apr 20 2017
We're pursuing a theory that the failures are caused by bandwidth spikes while uploading crashes caused by bug 712102.
,
Apr 20 2017
,
Apr 20 2017
,
Apr 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ba28516abd333289ce2e4e4069d024f54f1d7be1 commit ba28516abd333289ce2e4e4069d024f54f1d7be1 Author: xixuan <xixuan@chromium.org> Date: Thu Apr 20 00:48:28 2017 autotest: remove sensitive lines from error msg for further tko parsing. The lines got from client side test may include phrase like 'END GOOD', which will make tko parser parse the same test twice. This CL removes it from raised error msg, so that it won't be recorded in status.log later. BUG=chromium:713004 TEST=Ran unittest Change-Id: Ibefc27eddd3b7c1df12aac2f66a5dec57b03c5f9 Reviewed-on: https://chromium-review.googlesource.com/482542 Reviewed-by: Ningning Xia <nxia@chromium.org> Commit-Queue: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/ba28516abd333289ce2e4e4069d024f54f1d7be1/server/autotest.py
,
Apr 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ba28516abd333289ce2e4e4069d024f54f1d7be1 commit ba28516abd333289ce2e4e4069d024f54f1d7be1 Author: xixuan <xixuan@chromium.org> Date: Thu Apr 20 00:48:28 2017 autotest: remove sensitive lines from error msg for further tko parsing. The lines got from client side test may include phrase like 'END GOOD', which will make tko parser parse the same test twice. This CL removes it from raised error msg, so that it won't be recorded in status.log later. BUG=chromium:713004 TEST=Ran unittest Change-Id: Ibefc27eddd3b7c1df12aac2f66a5dec57b03c5f9 Reviewed-on: https://chromium-review.googlesource.com/482542 Reviewed-by: Ningning Xia <nxia@chromium.org> Commit-Queue: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/ba28516abd333289ce2e4e4069d024f54f1d7be1/server/autotest.py
,
Apr 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/ba28516abd333289ce2e4e4069d024f54f1d7be1 commit ba28516abd333289ce2e4e4069d024f54f1d7be1 Author: xixuan <xixuan@chromium.org> Date: Thu Apr 20 00:48:28 2017 autotest: remove sensitive lines from error msg for further tko parsing. The lines got from client side test may include phrase like 'END GOOD', which will make tko parser parse the same test twice. This CL removes it from raised error msg, so that it won't be recorded in status.log later. BUG=chromium:713004 TEST=Ran unittest Change-Id: Ibefc27eddd3b7c1df12aac2f66a5dec57b03c5f9 Reviewed-on: https://chromium-review.googlesource.com/482542 Reviewed-by: Ningning Xia <nxia@chromium.org> Commit-Queue: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/ba28516abd333289ce2e4e4069d024f54f1d7be1/server/autotest.py
,
Apr 20 2017
We're working to pin chrome to 59.0.3065.0, which is the last chrome version before the trouble started. There have been two manual uprevs since then. We suspect that the uprev to 59.0.3068.1 has an undiscovered bug.
,
Apr 20 2017
18:08nxia https://chrome-internal-review.googlesource.com/c/358148 18:09nxia https://chromium-review.googlesource.com/c/482707/ 18:09nxia the last PFQ uprev was a manual uprev, so cros_pinchrome discards the last stable chrome version 59.0.3065.0 and is going to pin it back to 59.0.3064.0 18:10nxia 59.0.3064.0 was only upreved one day before 59.0.3065.0 18:10nxia so we're going to revert to 59.0.3064.0. 18:11nxia and I'll file a bug to for the cros_pinchrome (it was wrote before the cros_uprevchrome tool), so it didn't handle the manual uprev version quite well
,
Apr 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/05be40f3153e8aaafe77785c66d759e6a4cec52f commit 05be40f3153e8aaafe77785c66d759e6a4cec52f Author: Ningning Xia <nxia@google.com> Date: Thu Apr 20 01:18:18 2017 Chrome: Pin to version 59.0.3064.0_rc-r1 DO NOT REVERT THIS CL. In general, reverting chrome (un)pin CLs does not do what you expect. Instead, use `cros pinchrome` to generate new CLs. BUG=chromium:713004 TEST=None CQ-DEPEND=*I27cd69e3102272f369cc20d0230a7b12e4a6c394 Change-Id: I13eb0e2c4cfb4f6898a7e124fd5be013021d030b Reviewed-on: https://chromium-review.googlesource.com/482707 Reviewed-by: Richard Barnette <jrbarnette@google.com> Reviewed-by: Xiaoqian Dai <xdai@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> [add] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/profiles/default/linux/package.mask/chromepin [modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos-base/chromeos-chrome/Manifest [modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/amd64-generic-LATEST_RELEASE_CHROME_BINHOST.conf [modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/daisy-LATEST_RELEASE_CHROME_BINHOST.conf [rename] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos-base/chromium-source/chromium-source-59.0.3064.0_rc-r1.ebuild [modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/arm-generic-LATEST_RELEASE_CHROME_BINHOST.conf [modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/x86-generic-LATEST_RELEASE_CHROME_BINHOST.conf [modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/target/veyron_jerry-LATEST_RELEASE_CHROME_BINHOST.conf [rename] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos-base/chromeos-chrome/chromeos-chrome-59.0.3064.0_rc-r1.ebuild [modify] https://crrev.com/05be40f3153e8aaafe77785c66d759e6a4cec52f/chromeos/binhost/host/amd64-LATEST_RELEASE_CHROME_BINHOST.conf
,
Apr 20 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/overlays/chromeos-partner-overlay/+/84f7acd0870d893cafca32723344e817925952fd commit 84f7acd0870d893cafca32723344e817925952fd Author: Ningning Xia <nxia@google.com> Date: Thu Apr 20 01:18:22 2017
,
Apr 20 2017
I have no idea if this Chrome is still in sync with Android.
,
Apr 20 2017
Same with other changes that went in since that version. Just pinning Chrome to an old version may be pretty undefined. I wish you would have waited a little longer. But good luck.
,
Apr 20 2017
I'll monitor the CQ. I'll unpin the chrome if the version has issues.
,
Apr 20 2017
,
Apr 20 2017
Issue 712991 has been merged into this issue.
,
Apr 20 2017
I am so confused by the logs and code: it looks like the timeout is recovered after "restarting master ssh connection". 04/18 22:30:19.578 DEBUG| abstract_ssh:0744| Restarting master ssh connection So the real issue is 04/18 22:35:51.498 DEBUG| autotest:0805| Result exit status is 255. But I am not sure which command failed. Given this only happens in paladin, and I saw there is a bunch of ayatane's CLs about autotest. Do you aware of any of your changes could cause these? I am going to throttle the tree and see if it gets better.
,
Apr 20 2017
The tree had already been throttled.
,
Apr 20 2017
security_SandboxStatus timeout ( crbug.com/713531 ) is the only failure in the last CQ run master-paladin/14349. https://luci-milo.appspot.com/buildbot/chromeos/veyron_mighty-paladin/5056 https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/14349
,
Apr 20 2017
SandboxStatus was timing out in crbug.com/706939. Maybe another bad Chrome change got in?
,
Apr 20 2017
To clarify: issue 706939 was successfully fixed.
,
Apr 20 2017
> To clarify: issue 706939 was successfully fixed. Yeah, but we pinned Chrome back to 59.0.3064.0, which may have been current around the time of the bug (it was built around 4/7). We should figure out whether that bug was in that version of Chrome. :-( In other news, the network load that we were blaming on "bad Chrome causing crashes" hasn't gone down. That suggests that either 1) pinning Chrome didn't stop the crashes or 2) the crashes weren't the source of the load. The CQ hasn't been green since pinning Chrome, but that could be due to problems other than this bug...
,
Apr 20 2017
Confirmed https://codereview.chromium.org/2783723002 is in 59.0.3064.0, but the revert CL https://codereview.chromium.org/2812743002/ is not in 59.0.3064.0.
,
Apr 20 2017
> In other news, the network load that we were blaming on > "bad Chrome causing crashes" hasn't gone down. That suggests > that either 1) pinning Chrome didn't stop the crashes or > 2) the crashes weren't the source of the load. Taking another look at this data, I see that the spike in network load is outbound only. If the problem were Chrome crashes, we'd expect to see similar (leading) spikes in inbound traffic as we first copied _from_ the DUTs and then trailing spikes as we copied _to_ GS. Also, the start of the network spikes coincided with the start of higher disk write activity. That suggests we suddenly started generating new data locally on the shards, which is then offloaded. That would suggest a change in autoserv or a server-side test.
,
Apr 20 2017
,
Apr 20 2017
For the tko retrying the test twice, we have 2 solutions: 1. tko parser for parsing client side test is intended to parse status.log in https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/113330784-chromeos-test/chromeos4-row1-rack7-host9/ to be 2 entries: one for SERVER_JOB, one for CLIENT_JOB.0. 2. The retry mechanism allows the second retry for the same job (https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/cros/dynamic_suite/suite.py?q=has_following_retry&sq=package:%5Echromeos_(internal%7Cpublic)$&l=188), but check the job's status not to be 'RETRIED' (https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/cros/dynamic_suite/suite.py?q=has_following_retry&sq=package:%5Echromeos_(internal%7Cpublic)$&l=152), which I feel is a bug, but don't know whether there're some other design in it. We can either let the status.log not print line like 'INFO ---- ---- timestamp=1492610586 job_abort_reason=Aborting...' anymore, or change the retry mechanism if there's no more consideration in it. I prefer the second one.
,
Apr 20 2017
https://chromium-review.googlesource.com/c/483030/ is made for fix "retrying twice".
,
Apr 20 2017
The chrome version was just unpinned crbug.com/713531
,
Apr 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/bf854f881b4abfc79bc1eb2b9dd5aa368b40a136 commit bf854f881b4abfc79bc1eb2b9dd5aa368b40a136 Author: xixuan <xixuan@chromium.org> Date: Thu Apr 20 19:20:24 2017 autotest: Retry only once for the same job. Current retry strategy retry twice for the same job, and fail to execute the second one. This CL changes the retry to only execute once. BUG=chromium:713004 TEST=Ran unittest. Change-Id: I4f1352963c60e80e11c2319eae52b6677288ab9c Reviewed-on: https://chromium-review.googlesource.com/483030 Reviewed-by: Ningning Xia <nxia@chromium.org> Commit-Queue: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/bf854f881b4abfc79bc1eb2b9dd5aa368b40a136/server/cros/dynamic_suite/suite.py
,
Apr 20 2017
Update on what's known: At this point, we believe the underlying symptom (the errors that say "probably lost connectivity") is caused by too much network load on the shard serving the test. We believe the extra load is coming from Chrome crashes (see bug 713856 ). We believe that the asymmetric ingress/egress numbers are caused by failed offloads being retried (the retries try to transmit the entire results every time). Pinning Chrome should have stopped the crashes, but there are multiple sources of test requests, and some of them are still requesting testing for builds that contain the Chrome bug. That means that load has gone down, but not yet vanished. We've closed the lab to all canary builds that could contain the bug; that will stop scheduled suites from using the bad builds. We also aborted an R59 branch build that was seeing the problem. We may need to close the lab to those builds as well.
,
Apr 20 2017
Have you considered not rsyncing the chrome*.core files instead of closing the lab?
,
Apr 20 2017
rsync --exclude 'chrome.201704*core' when pulling /autotest/results/default/ will ignore chrome cores at the DUT.
,
Apr 20 2017
The lab is only closed to builds that are known to be bad. Finding all the code that might be syncing chrome core files is more work and more risk, and when things got better, we'd have to revert the change, too. For now, the lab will remain closed to all ToT builds between 9455.0.0 and 9477.0.0.
,
Apr 20 2017
Then let me unpin Chrome again so we get some nontrivial builds going.
,
Apr 20 2017
> Then let me unpin Chrome again so we get some nontrivial builds going. Chrome is already unpinned. And the PFQ is testing against the latest as well. And we can't afford to do a manual uprev until we know for sure that we've eliminated the source of the crashes.
,
Apr 20 2017
I'm not sure if/how we will "no for sure". I've been looking, or attempting to look anyway, but have not identified any crashes on the PFQ. This is not to say that they are not happening, just that if they are, they do not appear to be causing failures, and we don't seem to have any reasonable way to identify them. I have some vague recollection of querying logs once for chrome crashes, but can not for the life of me remember how.
,
Apr 21 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/cb21b2937f929fbde6ca7656df36703294a642c6 commit cb21b2937f929fbde6ca7656df36703294a642c6 Author: Richard Barnette <jrbarnette@chromium.org> Date: Fri Apr 21 00:16:04 2017 [autotest] Don't include /var/spool/crash in test results. This (temporarily?) excludes /var/spool/crash from sysinfo in test results. Shards running tests are currently overwhelmed, apparently by the volume of Chrome crashes. Shut them off, to see if we can make progress. BUG=chromium:713004 TEST=None Change-Id: I8c0a74771251c68ca70ac1129fa9cfa3013b539e Reviewed-on: https://chromium-review.googlesource.com/483960 Reviewed-by: Steven Bennetts <stevenjb@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> Tested-by: Richard Barnette <jrbarnette@chromium.org> [modify] https://crrev.com/cb21b2937f929fbde6ca7656df36703294a642c6/client/bin/site_sysinfo.py
,
Apr 21 2017
,
Apr 21 2017
Re #49, have you looked into the failed test examples in this bug (#1 ~ #6), you can look into the chrome crash dump files offloaded to GS. Most of them have the dump files in the bucket. If we can make sure all the crashed have gone, we're good.
,
Apr 22 2017
Issue 712958 has been merged into this issue.
,
Apr 22 2017
,
Apr 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/158b180499131e4a5ef2dc7b4c357702aeba9838 commit 158b180499131e4a5ef2dc7b4c357702aeba9838 Author: Ilja H. Friedel <ihf@chromium.org> Date: Sat Apr 22 01:59:53 2017 Revert "[autotest] Don't include /var/spool/crash in test results." This reverts commit cb21b2937f929fbde6ca7656df36703294a642c6. BUG=chromium:713004 TEST=None Change-Id: I9e5f0ee0c29d5cc9868c19577a8f0ecf6ce182ca Reviewed-on: https://chromium-review.googlesource.com/483847 Reviewed-by: Ilja H. Friedel <ihf@chromium.org> Tested-by: Ilja H. Friedel <ihf@chromium.org> [modify] https://crrev.com/158b180499131e4a5ef2dc7b4c357702aeba9838/client/bin/site_sysinfo.py
,
Apr 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3cf74ed60b2419125306a1e0e433d82d0c8edc3e commit 3cf74ed60b2419125306a1e0e433d82d0c8edc3e Author: Ilja H. Friedel <ihf@chromium.org> Date: Sat Apr 22 02:40:06 2017 Reland "[autotest] Don't include /var/spool/crash in test results." The uprev to 60.0.3077.0 did not contain the crash fix yet. It will be in 60.0.30787.0 This reverts commit 158b180499131e4a5ef2dc7b4c357702aeba9838. BUG=chromium:713004 TEST=None Change-Id: I148b9f582ec70d6edf03feb852987785f8db9835 Reviewed-on: https://chromium-review.googlesource.com/484813 Reviewed-by: Ilja H. Friedel <ihf@chromium.org> Tested-by: Ilja H. Friedel <ihf@chromium.org> [modify] https://crrev.com/3cf74ed60b2419125306a1e0e433d82d0c8edc3e/client/bin/site_sysinfo.py
,
Apr 24 2017
Now the lab is experiencing crbug.com/714571 which is unrelated to this issue. Downgrade this to P1 and we will keep monitoring. we have (1) skipped offloading crash core files (2) blocked lab tests on bad CrOS/Chrome versions and (3) uprev'ed chrome versions
,
Jun 8 2018
|
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by nxia@chromium.org
, Apr 19 2017