New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 885199 link

Starred by 2 users

Issue metadata

Status: ExternalDependency
Merged: issue 885186
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 882562

Blocking:
issue 889403



Sign in to add a comment

CTS test flake killed CQ run

Project Member Reported by jclinton@chromium.org, Sep 18

Issue description

Hi ARC Constable, 

Test flakes are killing us, collectively, so please treat this as a P1. This CQ run was killed by a CTS test flake:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8935091831380015664

Specifically: https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/6996

The error message is:
09/17 19:51:37.871 DEBUG|             utils:0219| Running 'rsync -L  --timeout=1800 --rsh='/usr/bin/ssh -a -x  -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22' -az --no-o --no-g  root@chromeos4-row9-rack11-host11:"/tmp/autoserv-6xwIEU/sysinfo.pickle" "/tmp/tmpMUrH0e"'
09/17 19:51:38.934 DEBUG|              test:0391| after_iteration_hooks completed
09/17 19:51:39.031 WARNI|              test:0606| The test failed with the following exception
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 600, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 800, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 464, in execute
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 371, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/server/site_tests/cheets_CTS_N/cheets_CTS_N.py", line 164, in run_once
    precondition_commands=precondition_commands)
  File "/usr/local/autotest/server/cros/tradefed_test.py", line 1100, in _run_tradefed_with_retries
    commands)
  File "/usr/local/autotest/server/cros/tradefed_test.py", line 920, in _run_and_parse_tradefed
    output = self._run_tradefed(commands)
  File "/usr/local/autotest/server/site_tests/cheets_CTS_N/cheets_CTS_N.py", line 98, in _run_tradefed
    stderr_tee=utils.TEE_TO_LOGS)
  File "/usr/local/autotest/server/cros/tradefed_test.py", line 772, in _run
    return utils.run(*args, **kwargs)
  File "/usr/local/autotest/client/common_lib/utils.py", line 751, in run
    timeout)[0]
  File "/usr/local/autotest/client/common_lib/utils.py", line 833, in join_bg_jobs
    "Command(s) did not complete within %d seconds" % timeout)
CmdTimeoutError: Command </tmp/autotest-tradefed-install_WEdaoH/8ff0dfe9d5ce36d93c5867cc13002e48/android-cts-7.1_r20-linux_x86-arm/android-cts/tools/cts-tradefed run commandAndExit cts --module CtsAccountManagerTestCases --precondition-arg skip-media-download --disable-reboot --logcat-on-failure -s 'chromeos4-row9-rack11-host11:22'> failed, rc=-15, Command(s) did not complete within 360 seconds
* Command: 
    /tmp/autotest-tradefed-install_WEdaoH/8ff0dfe9d5ce36d93c5867cc13002e48
    /android-cts-7.1_r20-linux_x86-arm/android-cts/tools/cts-tradefed run
    commandAndExit cts --module CtsAccountManagerTestCases --precondition-arg
    skip-media-download --disable-reboot --logcat-on-failure -s
    'chromeos4-row9-rack11-host11:22'
Exit status: -15
Duration: 361.128597021

 
Cc: kazu@chromium.org
Mergedinto: 885186
Status: Duplicate (was: Assigned)
Cc: ihf@chromium.org pwang@chromium.org
Status: Assigned (was: Duplicate)
+Ilja, +Po-hsien

Not a dup of  Bug 885186  (which was test crash+failure, and moreover was on ARC-P).
This one is timeout.

It's too strange that the command is taking 4 minutes to start, before even running a single line of test case...

09/17 19:42:47.198 DEBUG|             utils:0219| Running '/tmp/autotest-tradefed-install_WEdaoH/8ff0dfe9d5ce36d93c5867cc13002e48/android-cts-7.1_r20-linux_x86-arm/android-cts/tools/cts-tradefed run commandAndExit cts --module CtsAccountManagerTestCases --precondition-arg skip-media-download --disable-reboot --logcat-on-failure -s 'chromeos4-row9-rack11-host11:22''
09/17 19:46:35.168 DEBUG|             utils:0287| [stdout] Android Compatibility Test Suite 7.1_r20 (4892579)
09/17 19:46:35.392 DEBUG|             utils:0287| [stdout] Non-interactive mode: Running initial command then exiting.
09/17 19:46:59.964 DEBUG|             utils:0287| [stdout] Using commandline arguments as starting command: [run, commandAndExit, cts, --module, CtsAccountManagerTestCases, --precondition-arg, skip-media-download, --disable-reboot, --logcat-on-failure, -s, chromeos4-row9-rack11-host11:22]
09/17 19:47:01.607 DEBUG|             utils:0287| [stdout] 09-17 19:47:01 I/DeviceManager: Detected new device chromeos4-row9-rack11-host11:22
09/17 19:48:28.724 DEBUG|             utils:0287| [stdout] 09-17 19:48:28 I/TestInvocation: Starting invocation for 'cts' on build '4892579' cts --module CtsAccountManagerTestCases --precondition-arg skip-media-download --disable-reboot --logcat-on-failure -s chromeos4-row9-rack11-host11:22 Compatibility Test Suite 4892579 CTS 1537238906807 /tmp/autotest-tradefed-install_WEdaoH/8ff0dfe9d5ce36d93c5867cc13002e48/android-cts-7.1_r20-linux_x86-arm/android-cts/tools/../.. 7.1_r20 cts https://androidpartner.googleapis.com/v1/dynamicconfig/suites/CTS/modules/{module}/version/{version}?key=AIzaSyAbwX5JRlmsLeygY2WWihpIJPXFLueOQ3U 2018.09.17_19.48.26 on device chromeos4-row9-rack11-host11:22
09/17 19:48:33.937 DEBUG|             utils:0287| [stdout] 09-17 19:48:33 I/ResultReporter: Initializing result directory
09/17 19:48:33.937 DEBUG|             utils:0287| [stdout] 09-17 19:48:33 I/ResultReporter: Results Directory: /tmp/autotest-tradefed-install_WEdaoH/8ff0dfe9d5ce36d93c5867cc13002e48/android-cts-7.1_r20-linux_x86-arm/android-cts/tools/../../android-cts/results/2018.09.17_19.48.26
09/17 19:48:34.072 DEBUG|             utils:0287| [stdout] 09-17 19:48:33 I/ResultReporter: Created log dir /tmp/autotest-tradefed-install_WEdaoH/8ff0dfe9d5ce36d93c5867cc13002e48/android-cts-7.1_r20-linux_x86-arm/android-cts/tools/../../android-cts/logs/2018.09.17_19.48.26
09/17 19:48:47.750 WARNI|             utils:0938| run process timeout (360) fired on: /tmp/autotest-tradefed-install_WEdaoH/8ff0dfe9d5ce36d93c5867cc13002e48/android-cts-7.1_r20-linux_x86-arm/android-cts/tools/cts-tradefed run commandAndExit cts --module CtsAccountManagerTestCases --precondition-arg skip-media-download --disable-reboot --logcat-on-failure -s 'chromeos4-row9-rack11-host11:22'
Cc: erosky@chromium.org
Owner: akes...@chromium.org
Yes, the suite picture attached already is a strong hint for super slow server.

Find a job here:
https://stainless.corp.google.com/browse/chromeos-autotest-results/239069303-chromeos-test/hostless/

Say this one
https://stainless.corp.google.com/browse/chromeos-autotest-results/239069315-chromeos-test/
ssp logs  says it runs on
09/17 19:28:05.747 DEBUG|          autoserv:0714| autoserv is running in drone cros-full-0010.mtv.corp.google.com.

Test log
09/17 19:32:36.726 INFO |      cheets_CTS_N:0088| RUN: ./cts-tradefed run commandAndExit cts --module CtsAccelerationTestCases --precondition-arg skip-media-download --disable-reboot --logcat-on-failure -s chromeos4-row9-rack11-host22:22
09/17 19:32:36.734 DEBUG|             utils:0219| Running '/tmp/autotest-tradefed-install_Md45va/8ff0dfe9d5ce36d93c5867cc13002e48/android-cts-7.1_r20-linux_x86-arm/android-cts/tools/cts-tradefed run commandAndExit cts --module CtsAccelerationTestCases --precondition-arg skip-media-download --disable-reboot --logcat-on-failure -s 'chromeos4-row9-rack11-host22:22''
09/17 19:36:02.998 DEBUG|             utils:0287| [stdout] Android Compatibility Test Suite 7.1_r20 (4892579)
09/17 19:36:03.342 DEBUG|             utils:0287| [stdout] Non-interactive mode: Running initial command then exiting.
09/17 19:36:43.196 DEBUG|             utils:0287| [stdout] Using commandline arguments as starting command: [run, commandAndExit, cts, --module, CtsAccelerationTestCases, --precondition-arg, skip-media-download, --disable-reboot, --logcat-on-failure, -s, chromeos4-row9-rack11-host22:22]
09/17 19:37:10.854 DEBUG|             utils:0287| [stdout] 09-17 19:37:10 I/DeviceManager: Detected new device chromeos4-row9-rack11-host22:22

and after it hit its internal 6 minutes timeout and already failed it took 10+ minutes to clean up:
09/17 19:39:36.383 DEBUG|              test:0386| Starting after_iteration_hooks for cheets_CTS_N.CtsAccelerationTestCases
[...]
09/17 19:50:59.014 DEBUG|   logging_manager:0627| Logging subprocess finished


Viceroy tells us that cros-full-0010 had CPU at 100% and a load spike to 8 around that time (also attachment).
https://viceroy.corp.google.com/chromeos/machines/?hostname=cros-full-0010&duration=6h&refresh=-1&scheduler_host=cros-full-0036&sentinel_host=chromeos-server156&staging_master=chromeos-staging-master2&utc_end=1537257454



885199_suite.png
92.0 KB View Download
885199_CPU.png
20.5 KB View Download
885199_load.png
17.2 KB View Download
Aviv, what do we know about the current state of the Ganeti servers? Ever since about September 4 we see load spikes on the cros-full-0010 server. Could this be skylab?
885199_load_long_time.png
101 KB View Download
Looking at cros-full-0003 it starts being overloaded around 08/28 (same for 0004, 0005). So this is not a single change. But it may roughly coincide with the ganeti problems that supposedly were fixed (issue 878403). So it might be they just dialed the problems back a bit and we have now servers that perform less way during peak needs?
Cc: akes...@chromium.org
Owner: rasputin@google.com
-> rasputin@ from ganeti

Did anything change around Aug 28 that might affect load or virtualization performance on our ganeti instances?
Owner: akes...@chromium.org
Aug 28 is the same day that https://bugs.chromium.org/p/chromium/issues/detail?id=878403 happened, but that was fixed. This still being bad is coincidental.

The node this instance is on is heavily loaded and the cluster is overallocated. As mentioned in https://bugs.chromium.org/p/chromium/issues/detail?id=878403, the solution is to move instances to different clusters, but that is blocked on testing that in https://bugs.chromium.org/p/chromium/issues/detail?id=882562. The faster we run those tests the faster we can move some of your instances to less-loaded clusters. :)

Btw, a big reason for the higher load over the last 1-2 months is chromecast folks pushing their eurekabuild instances really hard in preparation for the Oct 9 event.


Cc: rasputin@google.com
Saw same thing with latest CQ run

cheets_CTS_N.CtsAppUsageHostTestCases: ERROR: Command </tmp/autotest-tradefed-install_5_qs5k/b82bb2d2a0c6a6fa4e3c8fd7d875dfc7/android-cts-7.1_r21-linux_x86-arm/android-cts/tools/cts-tradefed run commandAndExit cts --module CtsAppUsageHostTestCases --disable-reboot --logcat-on-failure -s 'chromeos6-row2-rack24-host19:22'> failed, rc=-15, Command(s) did not complete within 360 seconds
Blocking: 889403
Blockedon: 882562
Owner: ----
Status: ExternalDependency (was: Assigned)
Components: -Infra>Client>ChromeOS>CI

Sign in to add a comment