New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 779209 link

Starred by 1 user

Issue metadata

Status: Unconfirmed
Owner: ----
Cc:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 2
Type: Feature



Sign in to add a comment

Stop on first failure in VMTests

Project Member Reported by pprabhu@chromium.org, Oct 27 2017

Issue description

Filed for https://luci-milo.appspot.com/buildbot/chromeos/betty-arc64-paladin/828

45 minutes is too long VMTest. And I think we're waiting around for the same failures over and over again. Perhaps we should fail-fast in VMTests on the CQ (fail at the first error encountered?)

10/27 13:57:21.904 INFO | test_runner_utils:0199| autoserv| run process timeout (299.999954939) fired on: /usr/bin/ssh -a -x  -F /dev/null -i /dev/null  -o ControlPath=/tmp/_autotmp_FWlc45ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9228 127.0.0.1 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-tHCEyq 0 0)\";fi; /usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-tHCEyq 0 0"

10/27 14:02:38.024 INFO | test_runner_utils:0199| autoserv| Running 'rsync -L  --timeout=1800 --rsh='/usr/bin/ssh -a -x  -F /dev/null -i /dev/null -o ControlPath=/tmp/_autotmp_3WhkZfssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9228' -az --no-o --no-g  "/build/betty-arc64/usr/local/build/autotest/packages/packages.checksum" "root@127.0.0.1:"/usr/local/autotest/packages.checksum""'
10/27 14:02:38.125 INFO | test_runner_utils:0199| autoserv| Running (ssh) 'echo B > /usr/local/autotest/tmp/_autotmp_3039WVharness-fifo/autoserv.fifo' from '_wait_for_commands|process_output|write|_process_line|run|run_very_slowly'
10/27 14:05:17.680 INFO | test_runner_utils:0199| autoserv| AUTOTEST_STATUS::		FAIL	security_NetworkListeners	security_NetworkListeners	timestamp=1509131116	localtime=Oct 27 14:05:16	Android did not boot!

10/27 14:09:20.291 INFO | test_runner_utils:0199| autoserv| Running (ssh) 'echo B > /usr/local/autotest/tmp/_autotmp_0hb4e3harness-fifo/autoserv.fifo' from '_wait_for_commands|process_output|write|_process_line|run|run_very_slowly'
10/27 14:12:35.687 INFO | test_runner_utils:0199| autoserv| AUTOTEST_STATUS::		GOOD	login_Cryptohome	login_Cryptohome	timestamp=1509131554

10/27 14:22:48.897 INFO | test_runner_utils:0199| autoserv| Running (ssh) 'echo B > /usr/local/autotest/tmp/_autotmp_i2C_5iharness-fifo/autoserv.fifo' from '_wait_for_commands|process_output|write|_process_line|run|run_very_slowly'
10/27 14:34:17.647 INFO | test_runner_utils:0199| autoserv| AUTOTEST_STATUS::		FAIL	login_CryptohomeIncognito	login_CryptohomeIncognito	
 
Cc: akes...@chromium.org dgarr...@chromium.org davidri...@chromium.org jrbarnette@chromium.org
Labels: -Type-Bug -Pri-3 Pri-2 Type-Feature
Owner: pprabhu@chromium.org
Summary: Stop on first failure in VMTests (was: betty vmtest failure mode is slow)
Looking at the worst offender (12 minutes) login_CryptohomeIncognito: https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/betty-arc64-paladin/R64-10071.0.0-rc3/vm_test_results_1/smoke_suite/test_harness/all/SimpleTestVerify/1_autotest_tests/results-30-login_CryptohomeIncognito/debug/

The client test itself is written with timeouts that push it to 10 minutes.

This is less of a problem in the lab where we shard tests across boards.
The root cause in all these failures is that ARC wasn't booting (That CQ failed, and all CLs got kicked out anyway).

In VMTests, we can't shard. And I think it's difficult to optimize for speed for VMTests while at the same time optimizing for robustness in the lab.

I think we should simply fail fast in VMTests
Impact assessment: betty and betty-arc64 paladin become slowest in absence of failures on reef/reef-uni (they're currently experimental), and when failures happen in VMTest.
e.g.: https://viceroy.corp.google.com/chromeos/build_details?build_config=master-paladin&build_number=16733
https://viceroy.corp.google.com/chromeos/build_details?build_config=master-paladin&build_number=16732

Haven't seen it super frequently (yet). But the VMTest time is pushing the limits.
Cc: zamorzaev@chromium.org
Owner: ----
+Alex who is working on redoing how VMTests are run.

Sign in to add a comment