betty-pre-cq flaky due to GCE server unresponsiveness (cros-standard149-c2) |
|||||||||
Issue descriptionI logged into cros-standard149-c2. This is a machine with 32 cores, lots of RAM, generally half speed of Z840 in my benchmarks. Running VMTest it sometimes spikes the CPU and the whole server becomes fully unresponsive for 2-4 minutes (on a second shell). If this happens while tradefed runs, or something else with a heartbeat/timeout we get test failures. Otherwise it may just continue. cros-standard149-c2 https://pantheon.corp.google.com/compute/instancesDetail/zones/us-central1-b/instances/cros-standard149-c2?q=search&project=chromeos-bot&graph=GCE_CPU&duration=P1D Start VM: ./bin/cros_start_vm --no_graphics --image_path=../build/images/betty/latest/chromiumos_qemu_image.bin Run tests: test_that --iterations 100 --board=betty localhost:9222 cheets_CTS_N.x86.CtsDramTestCases | tee t (each iteration is one little peak in the attached picture) 1) Example hang during Chrome login - test passes due to 5 minute Chrome login timeout > hang time (rightmost large peak in attached picture) 15:51:53 INFO | autoserv| END GOOD ---- ---- timestamp=1513893070 localtime=Dec 21 15:51:10 15:51:53 INFO | autoserv| Got lock of exit_code_file. 15:51:53 INFO | autoserv| Released lock of exit_code_file and closed it. 15:51:55 INFO | autoserv| Client complete 15:51:56 INFO | autoserv| Running outside of lab, adding extra debug options. 15:51:56 INFO | autoserv| Ensure Android is running (timeout=300)... 15:51:56 INFO | autoserv| Not using --dont_override_profile to start Chrome. [Nearly 4 minutes to login instead of 30-60s. Server fully unresponsive on a second terminal running top.] 15:55:43 INFO | autoserv| Setting up adb connection. 15:55:43 INFO | autoserv| [stderr] SELinux: Loaded file_contexts contexts from /file_contexts.bin. 15:55:47 INFO | autoserv| adb connect localhost:9222: 15:55:47 INFO | autoserv| * daemon not running. starting it now on port 5037 * 15:55:47 INFO | autoserv| * daemon started successfully * 15:55:47 INFO | autoserv| connected to localhost:9222 15:55:47 INFO | autoserv| 15:55:47 INFO | autoserv| adb devices: 15:55:47 INFO | autoserv| List of devices attached 15:55:47 INFO | autoserv| localhost:9222 offline 15:55:47 INFO | autoserv| 15:55:47 INFO | autoserv| 15:55:48 INFO | autoserv| adb connect localhost:9222: 15:55:48 INFO | autoserv| already connected to localhost:9222 15:55:48 INFO | autoserv| 15:55:48 INFO | autoserv| adb devices: 15:55:48 INFO | autoserv| List of devices attached 15:55:48 INFO | autoserv| localhost:9222 device 15:55:48 INFO | autoserv| 15:55:48 INFO | autoserv| 15:55:48 INFO | autoserv| Successfully setup adb connection. 15:55:48 INFO | autoserv| Disabling the adb install dialog. 15:55:49 INFO | autoserv| Disable adb dialog: 15:55:49 INFO | autoserv| Cleaning up repository. 15:55:49 INFO | autoserv| Running module.CtsDramTestCases: 15:55:49 INFO | autoserv| RUN: ./cts-tradefed list results 15:55:50 INFO | autoserv| END: ./cts-tradefed list results 15:55:50 INFO | autoserv| 15:55:50 INFO | autoserv| RUN: ./cts-tradefed run commandAndExit cts --module CtsDramTestCases --precondition-arg skip-media-download --log-level-display=DEBUG --screenshot-o 15:56:54 INFO | autoserv| RESULT: run(t=26, p=26, f=0, ne=0, w=0) 2) Example hang during tradefed run, test fails due to 3 minutes of hang (leftmost large peak in attached picture). 14:17:12 INFO | autoserv| Ensure Android is running (timeout=300)... 14:17:12 INFO | autoserv| Not using --dont_override_profile to start Chrome. [Login takes 32s - no problem.] 14:17:44 INFO | autoserv| Setting up adb connection. 14:17:45 INFO | autoserv| [stderr] SELinux: Loaded file_contexts contexts from /file_contexts.bin. 14:17:48 INFO | autoserv| adb connect localhost:9222: 14:17:48 INFO | autoserv| * daemon not running. starting it now on port 5037 * 14:17:48 INFO | autoserv| * daemon started successfully * 14:17:48 INFO | autoserv| connected to localhost:9222 14:17:48 INFO | autoserv| 14:17:48 INFO | autoserv| adb devices: 14:17:48 INFO | autoserv| List of devices attached 14:17:48 INFO | autoserv| localhost:9222 device 14:17:48 INFO | autoserv| 14:17:48 INFO | autoserv| 14:17:48 INFO | autoserv| Successfully setup adb connection. 14:17:48 INFO | autoserv| Disabling the adb install dialog. 14:17:49 INFO | autoserv| Disable adb dialog: 14:17:50 INFO | autoserv| Cleaning up repository. 14:17:50 INFO | autoserv| Running module.CtsDramTestCases: 14:17:50 INFO | autoserv| RUN: ./cts-tradefed list results 14:17:50 INFO | autoserv| END: ./cts-tradefed list results 14:17:50 INFO | autoserv| 14:17:50 INFO | autoserv| RUN: ./cts-tradefed run commandAndExit cts --module CtsDramTestCases --precondition-arg skip-media-download --log-level-display=DEBUG --screenshot-on-failure --disable-reboot --logcat-on-failure [3 minutes of nothing instead of 1 minute of test] 14:21:08 INFO | autoserv| END: ./cts-tradefed run commandAndExit cts --module CtsDramTestCases --precondition-arg skip-media-download --log-level-display=DEBUG --screenshot-on-failure --disable-reboot --logcat-on-failure 14:21:08 INFO | autoserv| 14:21:08 INFO | autoserv| Tradefed identified results and logs with 2017.12.21_14.17.51. 14:21:08 INFO | autoserv| Collecting tradefed testResult.xml and logs to /tmp/test_that_results_gWQDCN/results-006-cheets_CTS_N.arm.CtsDramTestCases/cheets_CTS_N.CtsDramTestCases/results/android-cts. 14:21:08 INFO | autoserv| tests=0, passed=0, failed=0, not_executed=0, waived=0 14:21:08 INFO | autoserv| RESULT: run(t=0, p=0, f=0, ne=0, w=0) [...] 14:21:26 INFO | autoserv| TestFail: Error: Could not find any tests in module. Next 30 jobs/tests were all good. So tradefed_test could have retried and passed. But still, the server shouldn't be AWOL.
,
Dec 21 2017
,
Jan 4 2018
Should we loop in the GCE team? KVM support on GCE is not fully public yet.
,
Jan 5 2018
scottvw@ can direct you to where/how to provide feedback to the GCE team related to this.
,
Jan 8 2018
,
Jan 9 2018
,
Jan 9 2018
,
Jan 9 2018
,
Jan 12 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/crosutils/+/ed1d0ab255c364eabbcfa9c842bca304949e5c59 commit ed1d0ab255c364eabbcfa9c842bca304949e5c59 Author: Ilja H. Friedel <ihf@chromium.org> Date: Fri Jan 12 03:37:38 2018 cros_vm: disable ksmd before starting VM. ksmd often leads to soft kernel lockups on 3.13 at least when running VMs on GCE. Disabling it seems to avoid those. This can increase memory usage but we don't really care (2/4GB). BUG=b:71727384, chromium:797109 TEST=./bin/cros_start_vm --no_graphics --image_path=../build/images/${BOARD}/latest/chromiumos_qemu_image.bin INFO : Disabling currently enabled ksmd to work around kernel lockups. Change-Id: I51b73b77c1b864e1d080c4f33419de00a17d1139 Reviewed-on: https://chromium-review.googlesource.com/861306 Commit-Ready: Ilja H. Friedel <ihf@chromium.org> Tested-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Stéphane Marchesin <marcheu@chromium.org> [modify] https://crrev.com/ed1d0ab255c364eabbcfa9c842bca304949e5c59/lib/cros_vm_lib.sh
,
Jan 12 2018
,
Jan 12 2018
Can someone who already knows all the details say if bug #795976 is a dupe of this one?
,
Jan 12 2018
It probably is.
,
Jan 17 2018
The ratio between VMTest attempt 1 and VMTest attempt 2 used to be about 3..4 to 1. Now it crashed to about 30..40 to 1. Probably all fixed. https://pcon.corp.google.com/p#chrome-infra/queryplayground?yAxisMin=0&yAxisMin2=0&oldHeatmap=false&query=CBCCAaMCCpgCCBCCAZICCvoBCAEY9wOKAdUBCsEBCA96vAEKuQEqJwoTbWV0cmljOmJ1aWxkX2NvbmZpZxIQCAEiDGJldHR5LXByZS1jcSodCglob3N0X25hbWUSDggCIgphdXRvZ2VuOi4qGAEqHgoLbWV0cmljOm5hbWUSDQgBIglCb290c3RyYXAYATJPChltb25hcmNoLmFjcXVpc2l0aW9ucy5UYXNrEjIKMC9jaHJvbWUvaW5mcmEvY2hyb21lb3MvY2J1aWxkYm90L3N0YWdlL2R1cmF0aW9ucyCAmJq8BCoCIAEwgMDd7sECkgEZCgYIARigjQYSABoLbWV0cmljOm5hbWVgABITCFESCwgBGet8DXBWNDI_UgIIHxIGCBZSAggf&names=Query%201&duration=12w&repository=force&title=betty-pre-cq%20stage-hours%20(1%20week%20window)&yAxisLabel=stage-hours&endtime=1516215633&showEditor=true
,
Jan 17 2018
,
Sep 6
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by ihf@chromium.org
, Dec 21 2017