Issue metadata
Sign in to add a comment
|
CTS test flake killed CQ run |
||||||||||||||||||||||
Issue descriptionHi ARC Constable, Test flakes are killing us, collectively, so please treat this as a P1. This CQ run was killed by a CTS test flake: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8935091831380015664 Specifically: https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/6996 The error message is: 09/17 19:51:37.871 DEBUG| utils:0219| Running 'rsync -L --timeout=1800 --rsh='/usr/bin/ssh -a -x -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22' -az --no-o --no-g root@chromeos4-row9-rack11-host11:"/tmp/autoserv-6xwIEU/sysinfo.pickle" "/tmp/tmpMUrH0e"' 09/17 19:51:38.934 DEBUG| test:0391| after_iteration_hooks completed 09/17 19:51:39.031 WARNI| test:0606| The test failed with the following exception Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/test.py", line 600, in _exec _call_test_function(self.execute, *p_args, **p_dargs) File "/usr/local/autotest/client/common_lib/test.py", line 800, in _call_test_function return func(*args, **dargs) File "/usr/local/autotest/client/common_lib/test.py", line 464, in execute postprocess_profiled_run, args, dargs) File "/usr/local/autotest/client/common_lib/test.py", line 371, in _call_run_once self.run_once(*args, **dargs) File "/usr/local/autotest/server/site_tests/cheets_CTS_N/cheets_CTS_N.py", line 164, in run_once precondition_commands=precondition_commands) File "/usr/local/autotest/server/cros/tradefed_test.py", line 1100, in _run_tradefed_with_retries commands) File "/usr/local/autotest/server/cros/tradefed_test.py", line 920, in _run_and_parse_tradefed output = self._run_tradefed(commands) File "/usr/local/autotest/server/site_tests/cheets_CTS_N/cheets_CTS_N.py", line 98, in _run_tradefed stderr_tee=utils.TEE_TO_LOGS) File "/usr/local/autotest/server/cros/tradefed_test.py", line 772, in _run return utils.run(*args, **kwargs) File "/usr/local/autotest/client/common_lib/utils.py", line 751, in run timeout)[0] File "/usr/local/autotest/client/common_lib/utils.py", line 833, in join_bg_jobs "Command(s) did not complete within %d seconds" % timeout) CmdTimeoutError: Command </tmp/autotest-tradefed-install_WEdaoH/8ff0dfe9d5ce36d93c5867cc13002e48/android-cts-7.1_r20-linux_x86-arm/android-cts/tools/cts-tradefed run commandAndExit cts --module CtsAccountManagerTestCases --precondition-arg skip-media-download --disable-reboot --logcat-on-failure -s 'chromeos4-row9-rack11-host11:22'> failed, rc=-15, Command(s) did not complete within 360 seconds * Command: /tmp/autotest-tradefed-install_WEdaoH/8ff0dfe9d5ce36d93c5867cc13002e48 /android-cts-7.1_r20-linux_x86-arm/android-cts/tools/cts-tradefed run commandAndExit cts --module CtsAccountManagerTestCases --precondition-arg skip-media-download --disable-reboot --logcat-on-failure -s 'chromeos4-row9-rack11-host11:22' Exit status: -15 Duration: 361.128597021
,
Sep 18
,
Sep 18
+Ilja, +Po-hsien Not a dup of Bug 885186 (which was test crash+failure, and moreover was on ARC-P). This one is timeout. It's too strange that the command is taking 4 minutes to start, before even running a single line of test case... 09/17 19:42:47.198 DEBUG| utils:0219| Running '/tmp/autotest-tradefed-install_WEdaoH/8ff0dfe9d5ce36d93c5867cc13002e48/android-cts-7.1_r20-linux_x86-arm/android-cts/tools/cts-tradefed run commandAndExit cts --module CtsAccountManagerTestCases --precondition-arg skip-media-download --disable-reboot --logcat-on-failure -s 'chromeos4-row9-rack11-host11:22'' 09/17 19:46:35.168 DEBUG| utils:0287| [stdout] Android Compatibility Test Suite 7.1_r20 (4892579) 09/17 19:46:35.392 DEBUG| utils:0287| [stdout] Non-interactive mode: Running initial command then exiting. 09/17 19:46:59.964 DEBUG| utils:0287| [stdout] Using commandline arguments as starting command: [run, commandAndExit, cts, --module, CtsAccountManagerTestCases, --precondition-arg, skip-media-download, --disable-reboot, --logcat-on-failure, -s, chromeos4-row9-rack11-host11:22] 09/17 19:47:01.607 DEBUG| utils:0287| [stdout] 09-17 19:47:01 I/DeviceManager: Detected new device chromeos4-row9-rack11-host11:22 09/17 19:48:28.724 DEBUG| utils:0287| [stdout] 09-17 19:48:28 I/TestInvocation: Starting invocation for 'cts' on build '4892579' cts --module CtsAccountManagerTestCases --precondition-arg skip-media-download --disable-reboot --logcat-on-failure -s chromeos4-row9-rack11-host11:22 Compatibility Test Suite 4892579 CTS 1537238906807 /tmp/autotest-tradefed-install_WEdaoH/8ff0dfe9d5ce36d93c5867cc13002e48/android-cts-7.1_r20-linux_x86-arm/android-cts/tools/../.. 7.1_r20 cts https://androidpartner.googleapis.com/v1/dynamicconfig/suites/CTS/modules/{module}/version/{version}?key=AIzaSyAbwX5JRlmsLeygY2WWihpIJPXFLueOQ3U 2018.09.17_19.48.26 on device chromeos4-row9-rack11-host11:22 09/17 19:48:33.937 DEBUG| utils:0287| [stdout] 09-17 19:48:33 I/ResultReporter: Initializing result directory 09/17 19:48:33.937 DEBUG| utils:0287| [stdout] 09-17 19:48:33 I/ResultReporter: Results Directory: /tmp/autotest-tradefed-install_WEdaoH/8ff0dfe9d5ce36d93c5867cc13002e48/android-cts-7.1_r20-linux_x86-arm/android-cts/tools/../../android-cts/results/2018.09.17_19.48.26 09/17 19:48:34.072 DEBUG| utils:0287| [stdout] 09-17 19:48:33 I/ResultReporter: Created log dir /tmp/autotest-tradefed-install_WEdaoH/8ff0dfe9d5ce36d93c5867cc13002e48/android-cts-7.1_r20-linux_x86-arm/android-cts/tools/../../android-cts/logs/2018.09.17_19.48.26 09/17 19:48:47.750 WARNI| utils:0938| run process timeout (360) fired on: /tmp/autotest-tradefed-install_WEdaoH/8ff0dfe9d5ce36d93c5867cc13002e48/android-cts-7.1_r20-linux_x86-arm/android-cts/tools/cts-tradefed run commandAndExit cts --module CtsAccountManagerTestCases --precondition-arg skip-media-download --disable-reboot --logcat-on-failure -s 'chromeos4-row9-rack11-host11:22'
,
Sep 18
At least the server that run the test was quite busy during the period (19:42 - 19:48) https://viceroy.corp.google.com/chromeos/machines?duration=3h&hostname=cros-full-0010&refresh=-1&scheduler_host=cros-full-0036&sentinel_host=chromeos-server156&staging_master=chromeos-staging-master2&utc_end=1537243684 but I'm not sure it explains this delay fully.
,
Sep 18
Yes, the suite picture attached already is a strong hint for super slow server. Find a job here: https://stainless.corp.google.com/browse/chromeos-autotest-results/239069303-chromeos-test/hostless/ Say this one https://stainless.corp.google.com/browse/chromeos-autotest-results/239069315-chromeos-test/ ssp logs says it runs on 09/17 19:28:05.747 DEBUG| autoserv:0714| autoserv is running in drone cros-full-0010.mtv.corp.google.com. Test log 09/17 19:32:36.726 INFO | cheets_CTS_N:0088| RUN: ./cts-tradefed run commandAndExit cts --module CtsAccelerationTestCases --precondition-arg skip-media-download --disable-reboot --logcat-on-failure -s chromeos4-row9-rack11-host22:22 09/17 19:32:36.734 DEBUG| utils:0219| Running '/tmp/autotest-tradefed-install_Md45va/8ff0dfe9d5ce36d93c5867cc13002e48/android-cts-7.1_r20-linux_x86-arm/android-cts/tools/cts-tradefed run commandAndExit cts --module CtsAccelerationTestCases --precondition-arg skip-media-download --disable-reboot --logcat-on-failure -s 'chromeos4-row9-rack11-host22:22'' 09/17 19:36:02.998 DEBUG| utils:0287| [stdout] Android Compatibility Test Suite 7.1_r20 (4892579) 09/17 19:36:03.342 DEBUG| utils:0287| [stdout] Non-interactive mode: Running initial command then exiting. 09/17 19:36:43.196 DEBUG| utils:0287| [stdout] Using commandline arguments as starting command: [run, commandAndExit, cts, --module, CtsAccelerationTestCases, --precondition-arg, skip-media-download, --disable-reboot, --logcat-on-failure, -s, chromeos4-row9-rack11-host22:22] 09/17 19:37:10.854 DEBUG| utils:0287| [stdout] 09-17 19:37:10 I/DeviceManager: Detected new device chromeos4-row9-rack11-host22:22 and after it hit its internal 6 minutes timeout and already failed it took 10+ minutes to clean up: 09/17 19:39:36.383 DEBUG| test:0386| Starting after_iteration_hooks for cheets_CTS_N.CtsAccelerationTestCases [...] 09/17 19:50:59.014 DEBUG| logging_manager:0627| Logging subprocess finished Viceroy tells us that cros-full-0010 had CPU at 100% and a load spike to 8 around that time (also attachment). https://viceroy.corp.google.com/chromeos/machines/?hostname=cros-full-0010&duration=6h&refresh=-1&scheduler_host=cros-full-0036&sentinel_host=chromeos-server156&staging_master=chromeos-staging-master2&utc_end=1537257454
,
Sep 18
Aviv, what do we know about the current state of the Ganeti servers? Ever since about September 4 we see load spikes on the cros-full-0010 server. Could this be skylab?
,
Sep 18
Looking at cros-full-0003 it starts being overloaded around 08/28 (same for 0004, 0005). So this is not a single change. But it may roughly coincide with the ganeti problems that supposedly were fixed (issue 878403). So it might be they just dialed the problems back a bit and we have now servers that perform less way during peak needs?
,
Sep 20
-> rasputin@ from ganeti Did anything change around Aug 28 that might affect load or virtualization performance on our ganeti instances?
,
Sep 24
Aug 28 is the same day that https://bugs.chromium.org/p/chromium/issues/detail?id=878403 happened, but that was fixed. This still being bad is coincidental. The node this instance is on is heavily loaded and the cluster is overallocated. As mentioned in https://bugs.chromium.org/p/chromium/issues/detail?id=878403, the solution is to move instances to different clusters, but that is blocked on testing that in https://bugs.chromium.org/p/chromium/issues/detail?id=882562. The faster we run those tests the faster we can move some of your instances to less-loaded clusters. :) Btw, a big reason for the higher load over the last 1-2 months is chromecast folks pushing their eurekabuild instances really hard in preparation for the Oct 9 event.
,
Sep 24
,
Sep 26
I saw another load spike today on cros-full-0005 at 17:26 killing multiple server jobs. https://viceroy.corp.google.com/chromeos/machines/?hostname=cros-full-0005&duration=246613&refresh=-1&scheduler_host=cros-full-0036&sentinel_host=chromeos-server156&staging_master=chromeos-staging-master2 http://cautotest-prod/afe/#tab_id=view_job&object_id=241800306
,
Sep 26
Saw same thing with latest CQ run cheets_CTS_N.CtsAppUsageHostTestCases: ERROR: Command </tmp/autotest-tradefed-install_5_qs5k/b82bb2d2a0c6a6fa4e3c8fd7d875dfc7/android-cts-7.1_r21-linux_x86-arm/android-cts/tools/cts-tradefed run commandAndExit cts --module CtsAppUsageHostTestCases --disable-reboot --logcat-on-failure -s 'chromeos6-row2-rack24-host19:22'> failed, rc=-15, Command(s) did not complete within 360 seconds
,
Oct 1
,
Oct 9
,
Oct 18
,
Dec 12
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by jclinton@chromium.org
, Sep 18