New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 797109 link

Starred by 4 users

Issue metadata

Status: Verified
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

betty-pre-cq flaky due to GCE server unresponsiveness (cros-standard149-c2)

Project Member Reported by ihf@chromium.org, Dec 21 2017

Issue description

I logged into cros-standard149-c2. This is a machine with 32 cores, lots of RAM, generally half speed of Z840 in my benchmarks. Running VMTest it sometimes spikes the CPU and the whole server becomes fully unresponsive for 2-4 minutes (on a second shell). If this happens while tradefed runs, or something else with a heartbeat/timeout we get test failures. Otherwise it may just continue.

cros-standard149-c2
https://pantheon.corp.google.com/compute/instancesDetail/zones/us-central1-b/instances/cros-standard149-c2?q=search&project=chromeos-bot&graph=GCE_CPU&duration=P1D

Start VM:
./bin/cros_start_vm --no_graphics --image_path=../build/images/betty/latest/chromiumos_qemu_image.bin

Run tests:
test_that --iterations 100 --board=betty localhost:9222 cheets_CTS_N.x86.CtsDramTestCases | tee t

(each iteration is one little peak in the attached picture)


1) Example hang during Chrome login - test passes due to 5 minute Chrome login timeout > hang time (rightmost large peak in attached picture)

15:51:53 INFO | autoserv| END GOOD      ----    ----    timestamp=1513893070    localtime=Dec 21 15:51:10
15:51:53 INFO | autoserv| Got lock of exit_code_file.
15:51:53 INFO | autoserv| Released lock of exit_code_file and closed it.
15:51:55 INFO | autoserv| Client complete
15:51:56 INFO | autoserv| Running outside of lab, adding extra debug options.
15:51:56 INFO | autoserv| Ensure Android is running (timeout=300)...
15:51:56 INFO | autoserv| Not using --dont_override_profile to start Chrome.

[Nearly 4 minutes to login instead of 30-60s. Server fully unresponsive on a second terminal running top.]

15:55:43 INFO | autoserv| Setting up adb connection.
15:55:43 INFO | autoserv| [stderr] SELinux: Loaded file_contexts contexts from /file_contexts.bin.
15:55:47 INFO | autoserv| adb connect localhost:9222:
15:55:47 INFO | autoserv| * daemon not running. starting it now on port 5037 *
15:55:47 INFO | autoserv| * daemon started successfully *
15:55:47 INFO | autoserv| connected to localhost:9222
15:55:47 INFO | autoserv| 
15:55:47 INFO | autoserv| adb devices:
15:55:47 INFO | autoserv| List of devices attached
15:55:47 INFO | autoserv| localhost:9222        offline
15:55:47 INFO | autoserv| 
15:55:47 INFO | autoserv| 
15:55:48 INFO | autoserv| adb connect localhost:9222:
15:55:48 INFO | autoserv| already connected to localhost:9222
15:55:48 INFO | autoserv| 
15:55:48 INFO | autoserv| adb devices:
15:55:48 INFO | autoserv| List of devices attached
15:55:48 INFO | autoserv| localhost:9222        device
15:55:48 INFO | autoserv| 
15:55:48 INFO | autoserv| 
15:55:48 INFO | autoserv| Successfully setup adb connection.
15:55:48 INFO | autoserv| Disabling the adb install dialog.
15:55:49 INFO | autoserv| Disable adb dialog:
15:55:49 INFO | autoserv| Cleaning up repository.
15:55:49 INFO | autoserv| Running module.CtsDramTestCases:
15:55:49 INFO | autoserv| RUN: ./cts-tradefed list results
15:55:50 INFO | autoserv| END: ./cts-tradefed list results
15:55:50 INFO | autoserv| 
15:55:50 INFO | autoserv| RUN: ./cts-tradefed run commandAndExit cts --module CtsDramTestCases --precondition-arg skip-media-download --log-level-display=DEBUG --screenshot-o
15:56:54 INFO | autoserv| RESULT: run(t=26, p=26, f=0, ne=0, w=0)


2) Example hang during tradefed run, test fails due to 3 minutes of hang (leftmost large peak in attached picture).

14:17:12 INFO | autoserv| Ensure Android is running (timeout=300)...
14:17:12 INFO | autoserv| Not using --dont_override_profile to start Chrome.

[Login takes 32s - no problem.]

14:17:44 INFO | autoserv| Setting up adb connection.
14:17:45 INFO | autoserv| [stderr] SELinux: Loaded file_contexts contexts from /file_contexts.bin.
14:17:48 INFO | autoserv| adb connect localhost:9222:
14:17:48 INFO | autoserv| * daemon not running. starting it now on port 5037 *
14:17:48 INFO | autoserv| * daemon started successfully *
14:17:48 INFO | autoserv| connected to localhost:9222
14:17:48 INFO | autoserv| 
14:17:48 INFO | autoserv| adb devices:
14:17:48 INFO | autoserv| List of devices attached
14:17:48 INFO | autoserv| localhost:9222        device
14:17:48 INFO | autoserv| 
14:17:48 INFO | autoserv| 
14:17:48 INFO | autoserv| Successfully setup adb connection.
14:17:48 INFO | autoserv| Disabling the adb install dialog.
14:17:49 INFO | autoserv| Disable adb dialog:
14:17:50 INFO | autoserv| Cleaning up repository.
14:17:50 INFO | autoserv| Running module.CtsDramTestCases:
14:17:50 INFO | autoserv| RUN: ./cts-tradefed list results
14:17:50 INFO | autoserv| END: ./cts-tradefed list results
14:17:50 INFO | autoserv| 
14:17:50 INFO | autoserv| RUN: ./cts-tradefed run commandAndExit cts --module CtsDramTestCases --precondition-arg skip-media-download --log-level-display=DEBUG --screenshot-on-failure --disable-reboot --logcat-on-failure

[3 minutes of nothing instead of 1 minute of test]

14:21:08 INFO | autoserv| END: ./cts-tradefed run commandAndExit cts --module CtsDramTestCases --precondition-arg skip-media-download --log-level-display=DEBUG --screenshot-on-failure --disable-reboot --logcat-on-failure
14:21:08 INFO | autoserv| 
14:21:08 INFO | autoserv| Tradefed identified results and logs with 2017.12.21_14.17.51.
14:21:08 INFO | autoserv| Collecting tradefed testResult.xml and logs to /tmp/test_that_results_gWQDCN/results-006-cheets_CTS_N.arm.CtsDramTestCases/cheets_CTS_N.CtsDramTestCases/results/android-cts.
14:21:08 INFO | autoserv| tests=0, passed=0, failed=0, not_executed=0, waived=0
14:21:08 INFO | autoserv| RESULT: run(t=0, p=0, f=0, ne=0, w=0)
[...]
14:21:26 INFO | autoserv| TestFail: Error: Could not find any tests in module.


Next 30 jobs/tests were all good. So tradefed_test could have retried and passed. But still, the server shouldn't be AWOL.


 
cros-standard149-c2.png
24.9 KB View Download

Comment 1 by ihf@chromium.org, Dec 21 2017

Cc: ihf@chromium.org ayatane@chromium.org norvez@chromium.org jkop@chromium.org jrbarnette@chromium.org kinaba@chromium.org
 Issue 794707  has been merged into this issue.

Comment 2 by ihf@chromium.org, Dec 21 2017

Cc: pwang@chromium.org
Should we loop in the GCE team? KVM support on GCE is not fully public yet.
Cc: nxia@chromium.org
scottvw@ can direct you to where/how to provide feedback to the GCE team related to this.
Owner: jrbarnette@chromium.org

Comment 6 by ihf@chromium.org, Jan 9 2018

Owner: ihf@chromium.org

Comment 7 by ihf@chromium.org, Jan 9 2018

Status: ExternalDependency (was: Started)
I created b/71727384 with some more detail. 

Comment 8 by ihf@chromium.org, Jan 9 2018

Cc: scottvw@google.com
Project Member

Comment 9 by bugdroid1@chromium.org, Jan 12 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/crosutils/+/ed1d0ab255c364eabbcfa9c842bca304949e5c59

commit ed1d0ab255c364eabbcfa9c842bca304949e5c59
Author: Ilja H. Friedel <ihf@chromium.org>
Date: Fri Jan 12 03:37:38 2018

cros_vm: disable ksmd before starting VM.

ksmd often leads to soft kernel lockups on 3.13 at least when running
VMs on GCE. Disabling it seems to avoid those. This can increase memory
usage but we don't really care (2/4GB).

BUG=b:71727384,  chromium:797109 
TEST=./bin/cros_start_vm --no_graphics --image_path=../build/images/${BOARD}/latest/chromiumos_qemu_image.bin
INFO    : Disabling currently enabled ksmd to work around kernel lockups.

Change-Id: I51b73b77c1b864e1d080c4f33419de00a17d1139
Reviewed-on: https://chromium-review.googlesource.com/861306
Commit-Ready: Ilja H. Friedel <ihf@chromium.org>
Tested-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Stéphane Marchesin <marcheu@chromium.org>

[modify] https://crrev.com/ed1d0ab255c364eabbcfa9c842bca304949e5c59/lib/cros_vm_lib.sh

Comment 10 by ihf@chromium.org, Jan 12 2018

Status: Fixed (was: ExternalDependency)
Can someone who already knows all the details say if  bug #795976  is a dupe of this one?
It probably is.

Comment 14 by ihf@chromium.org, Jan 17 2018

betty vmtest2.png
531 KB View Download
Labels: -Restrict-View-Google

Sign in to add a comment