New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 625225 link

Starred by 4 users

Issue metadata

Status: Archived
Owner:
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 626198
issue 626504



Sign in to add a comment

HWTest [sanity] and [arc-bvt-cq] sometimes slow on cheets builders

Project Member Reported by steve...@chromium.org, Jul 1 2016

Issue description

While investigating pfq builders timing out, I noticed that the HWTest [sanity] stage was taking about twice as long as usual with no obvious explanation.

Here is an example builder:
https://uberchromegw.corp.google.com/i/chromeos.chrome/builders/cyan-cheets-tot-chrome-pfq-cheets-informational/builds/340

HWTest [sanity]:
  start:    1:41:28 
  duration: 1:00:38 
  finish:   2:42:06 
HWTest [arc-bvt-cq]:
  start:    2:42:12 
  duration: 2:24:56 
  finish:   5:07:08

The fact that both [sanity] and [arc-bvt-cq] were much longer than normal (typical run is 0:32 for [sanity] and 0:45 for [arc-bvt-cq] leands me to suspect a load issue.

Comparing the stdio from the [sanity] test to a typical run, the out put looks similar, but with a mysterious time gap between:

Testing ended at 2016-07-01 05:58:52

and

06:30:12: INFO: Running cidb query on pid 26734 ...



 
Looking at status.log for 'Suite job' and 'dummy_PassServer.sanity' show both of them ending at 05:58.

And looking at autoserv.Debug, the last log line is at 06:04, still 25 minutes before the stage completes.

Labels: Hotlist-CrOS-Gardener
Summary: HWTest [sanity] and [[arc-bvt-cq] sometimes slow on cheets builders (was: HWTest [sanity] sometimes slow (cheets only?))
Summary: HWTest [sanity] and [arc-bvt-cq] sometimes slow on cheets builders (was: HWTest [sanity] and [[arc-bvt-cq] sometimes slow on cheets builders)
Here is a snippet of the output we are seeing in the [arc-bvt-cq] stage:

06:30:41: INFO: RunCommand: /b/cbuild.cyan-cheets/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpF6_CDZ/tmpZZLsNb/temp_summary.json --raw-cmd --task-name cyan-cheets-tot-chrome-pfq-cheets-informational/R53-8530.0.0-b340-arc-bvt-cq --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 18600 --io-timeout 18600 --hard-timeout 18600 --expiration 1200 -- /usr/local/autotest/site_utils/run_suite.py --build cyan-cheets-tot-chrome-pfq-cheets-informational/R53-8530.0.0-b340 --board cyan-cheets --suite_name arc-bvt-cq --pool bvt --num 3 --file_bugs True --priority PFQ --timeout_mins 250 --retry True --minimum_duts 3 --suite_min_duts 3 --offload_failures_only False -m 68216040
06:48:57: INFO: Refreshing due to a 401 (attempt 1/2)
06:48:57: INFO: Refreshing access_token
06:54:06: INFO: Refreshing due to a 401 (attempt 1/2)
06:54:06: INFO: Refreshing access_token
07:48:57: INFO: Refreshing due to a 401 (attempt 1/2)
07:48:57: INFO: Refreshing access_token
07:55:06: INFO: Refreshing due to a 401 (attempt 1/2)
07:55:06: INFO: Refreshing access_token
08:49:57: INFO: Refreshing due to a 401 (attempt 1/2)
08:49:57: INFO: Refreshing access_token
08:54:41: WARNING: Killing tasks: [<_BackgroundTask(_BackgroundTask-5:7:4, started)>]
08:54:41: WARNING: Killing 24555 (sig=24 SIGXCPU)


That is a time period of > 2 hours, clearly something is wrong here but it is not clear to me what.

There is some extra debugging info in the output, I will see if I can use that to track down a prior or existing bug that may be the root cause.

Cc: vapier@chromium.org pho...@chromium.org
+vipier@, phobbs@ for chromite debugging.


Cc: jrbarnette@chromium.org
+jrbarnette@ (infra deputy)

Cc: cywang@chromium.org khmel@chromium.org

Comment 9 Deleted

As I describe in  Issue 624736 , it should be the issue in scheduler of cautotest.
Cc: bhthompson@chromium.org jamescook@chromium.org
 Issue 626035  has been merged into this issue.
Cc: shey...@chromium.org
+sheyang@ (infra trooper this week).
Cc: sbasi@chromium.org
+sbasi@
btw,  Here is  jrbarnette@'s comment that related to this issue:
https://bugs.chromium.org/p/chromium/issues/detail?id=621352#c12 
I looked into the veyron_minnie_cheets-chrome-pfq arc-bvt-cq lot, it shows all tests completed and passed.
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=68617660
No error in autoserv.ERROR, autoser.info shows all tests starts/ends/passed. 
https://00e9e64bac1031d2536fcf643641c0ec69bc052977313af27c-apidata.googleusercontent.com/download/storage/v1_internal/b/chromeos-autotest-results/o/68617660-chromeos-test%2Fhostless%2Fdebug%2Fautoserv.INFO?qk=AD5uMEuULhgNpBkmsxkxunpSOUOxm2d3XfHTe_W62tGgzbeRcugqN4UvHPN1h-fyJPZ6tNlrReRddPvT7CK7xgAEbVhlghYQImb513Ro2dBniDw0c8ZMi6TkHgN2TK8JtW7Gd20vNDA6hrkDK9VQaXoGnV8W5K6CKh3yIYGjoIzSFre_GUD-fBzK_S15n6m6VpjM8bMkIA2nmXwfouiz8SbQPd7R4F7C-q9qxGq_XFI9qYiCZQtzsX887dhlK-B-NGeLE7mXAyQ_U2J5b_AfEfWEU4p5r0iDRS8pGyDDujgmWCHgTRv9BPajM5RtdHPAanT5ZV4T-vb1_GRwgGdgT58_H88DTpBSSQeQ6yvDpXIEfsHt2QS_EtCCCY8f2wnbZbhemXpPH0suhD8AdToVfEv1OAKNemRB9Vls4W9rLrTWbvkDwUfj9WFbeR_bkTXbigiz1_hZg4kEUHnGu1xJeGWQRhFj-BeFE_F9j0qY4hcZvJYbVQxHrOMn5e9RuSNtKiMOzSZMXwdOX0131NBvrKboAX2zCK3MApjMrJ_qVU28MTCPYjOwUU_wOV_xX9hljf46lCW_2mZnuJYczpcUCZQlr_6p6WnTkvq_5OfULMeWjQUdfhF1DfljnMpofFhO46AvCVtqxvE3yJg6nA4g52NEDYgLTJFlFzC-8VKgyPnBYIBtQxAr83gkRTy6A5rYaU-rp4nODkWzPO77jTlwFfM8przvLH37YMAohIs8de41lvFlPaCxwTvqtXcWkbRaJRLXLYQPdTS5R_0DuXMCDZe_Oy16R0NCPS0TSdNIrxS_lHKyr89-qt-ViPDw4wWqSXHqwC3SQJbfZzQJd4mfjusglNpYdxvTozSXnFCEi23_czfmYyIl6Beo57OK0uMDw6id1k1TQLr0h_L-cRRZ57J-CL443cH_Rw

There are 2 tests looks liked skipped video_ChromeHWDecodeUsed.vp9 and video_ChromeCameraMJpegHWDecodeUsed due to "Test skipped, no DUTs with expected labels found". I am not sure if this is somehow related to timeout issue.
"Test skipped, no DUTs with expected labels found" would only be a problem if all the tests said that due to the DUT Pools being misconfigured. Likely these tests require some hardware component that's missing and not required to run on veyron_minnie.
This happened again on veyron_minnie_cheets-chrome-pfq
https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-cheets-chrome-pfq/builds/396

While I saw the build failure, I checked the details, it looks like one of the test desktopui_ExitOnSupervisedUserCrash was still running at the time. It eventually completed with success but passed the timeout limit. 

07/06 16:28:46.735 INFO |        server_job:0128| START	68676125-chromeos-test/chromeos4-row9-rack10-host16/desktopui_ExitOnSupervisedUserCrash	desktopui_ExitOnSupervisedUserCrash	timestamp=1467847612	localtime=Jul 06 16:26:52	
07/06 16:28:46.736 INFO |        server_job:0128| 	GOOD	68676125-chromeos-test/chromeos4-row9-rack10-host16/desktopui_ExitOnSupervisedUserCrash	desktopui_ExitOnSupervisedUserCrash	timestamp=1467847649	localtime=Jul 06 16:27:29	completed successfully
07/06 16:28:46.736 INFO |        server_job:0128| END GOOD	68676125-chromeos-test/chromeos4-row9-rack10-host16/desktopui_ExitOnSupervisedUserCrash	desktopui_ExitOnSupervisedUserCrash	timestamp=1467847649	localtime=Jul 06 16:27:29	
07/06 16:28:46.736 DEBUG|             suite:1095| Adding status keyval for desktopui_ExitOnSupervisedUserCrash=68676125-chromeos-test

The master has reported timeout way before the test started:
@@@STEP_FAILURE@@@
16:00:52: ERROR: Timeout occurred- waited 16061 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time.

@@@STEP_FAILURE@@@
16:00:52: ERROR: Timeout occurred- waited 15965 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time.

Somehow, the test runs too slow on DUT.
Infra team is fixing some issue in cautotest scheduler which might be the root cause of this flakiness of the suite timeout, so please wait and keep watching the PFQs.
Looks good on cyan-cheets-chrome-pfq
https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-cheets-chrome-pfq/builds/392
It took less than 1 hour to finish the HWTest.

veyron-minnie-cheets-chrome-pfq has a failed test due to other issue, but it didn't timeout. 
https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-cheets-chrome-pfq/builds/398

Will keep watching.
To update: the infra team found a suspiciously bad query running on our database for 2 weeks. After killing it, the test scheduler sped up and we expect this also sped up our apache based RPCs.

For more information PTAL at  crbug.com/626198 
sbasi@, thanks for tracking this down. I will keep watching for the timeout this week. If it won't happen for the future runs, I will close it. 
Blockedon: 626198
https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-cheets-chrome-pfq/builds/394 failed for arc-bvt-cq tests with most tests aborted, timeout error seen in the log. Autotest details showed most tests aborted, only a few succeeded.
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=68798997

Could this be related the timeout issue we are fixing here?
Honestly it looks like you're overloading the cyan-cheets pool:bvt

Theres only 4 DUTs and they were all running tests non-stop.

You also had a second arc-bvt-cq suite you were contending with at roughly the same time:
http://cautotest/afe/#tab_id=view_job&object_id=68793207

for cyan-cheets-tot-chrome-pfq-cheets-informational


What is that builder?

cyan-cheets-tot-chrome-pfq-cheets-informational is picking cls and making continuous build, looks like we need to add more DUTs. 
> Theres only 4 DUTs and they were all running tests non-stop.

D'oh!  I reduced the count to 4 because of bug 625080.  That said,
the change happened on Wednesday afternoon, and this problem was
reported on the Friday before.  So, supply maybe wasn't the original
problem.

Owner: jen...@chromium.org
Status: Assigned (was: Untriaged)
It would not surprise me if we had more than one problem. The symptom is pretty generic (tests taking too long) so it is very possible that one problem ( issue 626198 ) was masking the other problem (not enough DUTs) until the first problem was fixed.

Do we have more DUTs available, and/or should we create a separate issue to track that?

Assigning to active gardener to for tracking the progress on the underlying causes.

I've increased pool:bvt for cyan-cheets back to 8.  That means
that by and large, tests requiring pool:suites will starve.

I note that in general, we've had notably worse performance for
the past couple of weeks.  Those problems cleared up starting
somewhere around Wednesday afternoon.  That could explain this
problem.

Blockedon: 626504
Two problems identified for the slow bvt-tests. One is the original issue that ties to  issue 626198 , the long running sql query; the other is due to a one-time collision that both informational bot and pfq bot running at the same time, contending DUTs, which I opened a separate  issue 626504 . Please move the DUTs issue to 626504. 
Cc: jen...@chromium.org
Owner: ihf@chromium.org
See timeout happened again:
https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-cheets-chrome-pfq/builds/412/steps/HWTest%20%5Barc-bvt-cq%5D/logs/stdio

This issue was tracked down in  crbug.com/626198 , but it happened again now. ihf@, please contact jrbarnette@ or this week's deputy to investigate. 
The same symptom occurred on cyan-cheets at the same time:

https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-cheets-chrome-pfq/builds/407

ihf@ - could you investigate these two failures and see if you can determine the root cause? It may be that there is another issue independent of the two mentioned in comment #30.


Comment 33 by ihf@chromium.org, Jul 12 2016

Status: Started (was: Assigned)
I sent out an email with the result of my investigation. I see very negative scaling going from running the 20 tests on 1 DUT to running it on 7 DUTs. The tests run in constant time (lab compared to my local machine), but the drone/reset/provision overhead results in an overall slowdown from using multiple DUTs.
Ping - please provide an update to your high priority bug. This bug is stale. Is it really P-1?
Status: Archived (was: Started)
I believe the issues discussed here are either resolved, or discussed in more currently-alive bugs.

Sign in to add a comment