HWTest [sanity] and [arc-bvt-cq] sometimes slow on cheets builders |
|||||||||||||||
Issue descriptionWhile investigating pfq builders timing out, I noticed that the HWTest [sanity] stage was taking about twice as long as usual with no obvious explanation. Here is an example builder: https://uberchromegw.corp.google.com/i/chromeos.chrome/builders/cyan-cheets-tot-chrome-pfq-cheets-informational/builds/340 HWTest [sanity]: start: 1:41:28 duration: 1:00:38 finish: 2:42:06 HWTest [arc-bvt-cq]: start: 2:42:12 duration: 2:24:56 finish: 5:07:08 The fact that both [sanity] and [arc-bvt-cq] were much longer than normal (typical run is 0:32 for [sanity] and 0:45 for [arc-bvt-cq] leands me to suspect a load issue. Comparing the stdio from the [sanity] test to a typical run, the out put looks similar, but with a mysterious time gap between: Testing ended at 2016-07-01 05:58:52 and 06:30:12: INFO: Running cidb query on pid 26734 ...
,
Jul 1 2016
And looking at autoserv.Debug, the last log line is at 06:04, still 25 minutes before the stage completes.
,
Jul 1 2016
,
Jul 1 2016
,
Jul 1 2016
Here is a snippet of the output we are seeing in the [arc-bvt-cq] stage: 06:30:41: INFO: RunCommand: /b/cbuild.cyan-cheets/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpF6_CDZ/tmpZZLsNb/temp_summary.json --raw-cmd --task-name cyan-cheets-tot-chrome-pfq-cheets-informational/R53-8530.0.0-b340-arc-bvt-cq --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 18600 --io-timeout 18600 --hard-timeout 18600 --expiration 1200 -- /usr/local/autotest/site_utils/run_suite.py --build cyan-cheets-tot-chrome-pfq-cheets-informational/R53-8530.0.0-b340 --board cyan-cheets --suite_name arc-bvt-cq --pool bvt --num 3 --file_bugs True --priority PFQ --timeout_mins 250 --retry True --minimum_duts 3 --suite_min_duts 3 --offload_failures_only False -m 68216040 06:48:57: INFO: Refreshing due to a 401 (attempt 1/2) 06:48:57: INFO: Refreshing access_token 06:54:06: INFO: Refreshing due to a 401 (attempt 1/2) 06:54:06: INFO: Refreshing access_token 07:48:57: INFO: Refreshing due to a 401 (attempt 1/2) 07:48:57: INFO: Refreshing access_token 07:55:06: INFO: Refreshing due to a 401 (attempt 1/2) 07:55:06: INFO: Refreshing access_token 08:49:57: INFO: Refreshing due to a 401 (attempt 1/2) 08:49:57: INFO: Refreshing access_token 08:54:41: WARNING: Killing tasks: [<_BackgroundTask(_BackgroundTask-5:7:4, started)>] 08:54:41: WARNING: Killing 24555 (sig=24 SIGXCPU) That is a time period of > 2 hours, clearly something is wrong here but it is not clear to me what. There is some extra debugging info in the output, I will see if I can use that to track down a prior or existing bug that may be the root cause.
,
Jul 1 2016
+vipier@, phobbs@ for chromite debugging.
,
Jul 1 2016
+jrbarnette@ (infra deputy)
,
Jul 3 2016
,
Jul 4 2016
As I describe in Issue 624736 , it should be the issue in scheduler of cautotest.
,
Jul 6 2016
,
Jul 6 2016
+sheyang@ (infra trooper this week).
,
Jul 6 2016
+sbasi@
,
Jul 6 2016
https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-cheets-chrome-pfq/builds/395/steps/HWTest%20%5Barc-bvt-cq%5D/logs/stdio Also see similar timeout issue on veyron_minnie-cheets-chrome-pfq
,
Jul 6 2016
btw, Here is jrbarnette@'s comment that related to this issue: https://bugs.chromium.org/p/chromium/issues/detail?id=621352#c12
,
Jul 6 2016
I looked into the veyron_minnie_cheets-chrome-pfq arc-bvt-cq lot, it shows all tests completed and passed. http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=68617660 No error in autoserv.ERROR, autoser.info shows all tests starts/ends/passed. https://00e9e64bac1031d2536fcf643641c0ec69bc052977313af27c-apidata.googleusercontent.com/download/storage/v1_internal/b/chromeos-autotest-results/o/68617660-chromeos-test%2Fhostless%2Fdebug%2Fautoserv.INFO?qk=AD5uMEuULhgNpBkmsxkxunpSOUOxm2d3XfHTe_W62tGgzbeRcugqN4UvHPN1h-fyJPZ6tNlrReRddPvT7CK7xgAEbVhlghYQImb513Ro2dBniDw0c8ZMi6TkHgN2TK8JtW7Gd20vNDA6hrkDK9VQaXoGnV8W5K6CKh3yIYGjoIzSFre_GUD-fBzK_S15n6m6VpjM8bMkIA2nmXwfouiz8SbQPd7R4F7C-q9qxGq_XFI9qYiCZQtzsX887dhlK-B-NGeLE7mXAyQ_U2J5b_AfEfWEU4p5r0iDRS8pGyDDujgmWCHgTRv9BPajM5RtdHPAanT5ZV4T-vb1_GRwgGdgT58_H88DTpBSSQeQ6yvDpXIEfsHt2QS_EtCCCY8f2wnbZbhemXpPH0suhD8AdToVfEv1OAKNemRB9Vls4W9rLrTWbvkDwUfj9WFbeR_bkTXbigiz1_hZg4kEUHnGu1xJeGWQRhFj-BeFE_F9j0qY4hcZvJYbVQxHrOMn5e9RuSNtKiMOzSZMXwdOX0131NBvrKboAX2zCK3MApjMrJ_qVU28MTCPYjOwUU_wOV_xX9hljf46lCW_2mZnuJYczpcUCZQlr_6p6WnTkvq_5OfULMeWjQUdfhF1DfljnMpofFhO46AvCVtqxvE3yJg6nA4g52NEDYgLTJFlFzC-8VKgyPnBYIBtQxAr83gkRTy6A5rYaU-rp4nODkWzPO77jTlwFfM8przvLH37YMAohIs8de41lvFlPaCxwTvqtXcWkbRaJRLXLYQPdTS5R_0DuXMCDZe_Oy16R0NCPS0TSdNIrxS_lHKyr89-qt-ViPDw4wWqSXHqwC3SQJbfZzQJd4mfjusglNpYdxvTozSXnFCEi23_czfmYyIl6Beo57OK0uMDw6id1k1TQLr0h_L-cRRZ57J-CL443cH_Rw There are 2 tests looks liked skipped video_ChromeHWDecodeUsed.vp9 and video_ChromeCameraMJpegHWDecodeUsed due to "Test skipped, no DUTs with expected labels found". I am not sure if this is somehow related to timeout issue.
,
Jul 6 2016
"Test skipped, no DUTs with expected labels found" would only be a problem if all the tests said that due to the DUT Pools being misconfigured. Likely these tests require some hardware component that's missing and not required to run on veyron_minnie.
,
Jul 6 2016
This happened again on veyron_minnie_cheets-chrome-pfq https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-cheets-chrome-pfq/builds/396 While I saw the build failure, I checked the details, it looks like one of the test desktopui_ExitOnSupervisedUserCrash was still running at the time. It eventually completed with success but passed the timeout limit. 07/06 16:28:46.735 INFO | server_job:0128| START 68676125-chromeos-test/chromeos4-row9-rack10-host16/desktopui_ExitOnSupervisedUserCrash desktopui_ExitOnSupervisedUserCrash timestamp=1467847612 localtime=Jul 06 16:26:52 07/06 16:28:46.736 INFO | server_job:0128| GOOD 68676125-chromeos-test/chromeos4-row9-rack10-host16/desktopui_ExitOnSupervisedUserCrash desktopui_ExitOnSupervisedUserCrash timestamp=1467847649 localtime=Jul 06 16:27:29 completed successfully 07/06 16:28:46.736 INFO | server_job:0128| END GOOD 68676125-chromeos-test/chromeos4-row9-rack10-host16/desktopui_ExitOnSupervisedUserCrash desktopui_ExitOnSupervisedUserCrash timestamp=1467847649 localtime=Jul 06 16:27:29 07/06 16:28:46.736 DEBUG| suite:1095| Adding status keyval for desktopui_ExitOnSupervisedUserCrash=68676125-chromeos-test The master has reported timeout way before the test started: @@@STEP_FAILURE@@@ 16:00:52: ERROR: Timeout occurred- waited 16061 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time. @@@STEP_FAILURE@@@ 16:00:52: ERROR: Timeout occurred- waited 15965 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time. Somehow, the test runs too slow on DUT.
,
Jul 7 2016
Infra team is fixing some issue in cautotest scheduler which might be the root cause of this flakiness of the suite timeout, so please wait and keep watching the PFQs.
,
Jul 7 2016
Looks good on cyan-cheets-chrome-pfq https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-cheets-chrome-pfq/builds/392 It took less than 1 hour to finish the HWTest. veyron-minnie-cheets-chrome-pfq has a failed test due to other issue, but it didn't timeout. https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-cheets-chrome-pfq/builds/398 Will keep watching.
,
Jul 7 2016
To update: the infra team found a suspiciously bad query running on our database for 2 weeks. After killing it, the test scheduler sped up and we expect this also sped up our apache based RPCs. For more information PTAL at crbug.com/626198
,
Jul 7 2016
sbasi@, thanks for tracking this down. I will keep watching for the timeout this week. If it won't happen for the future runs, I will close it.
,
Jul 7 2016
,
Jul 7 2016
https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-cheets-chrome-pfq/builds/394 failed for arc-bvt-cq tests with most tests aborted, timeout error seen in the log. Autotest details showed most tests aborted, only a few succeeded. http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=68798997 Could this be related the timeout issue we are fixing here?
,
Jul 7 2016
Honestly it looks like you're overloading the cyan-cheets pool:bvt Theres only 4 DUTs and they were all running tests non-stop. You also had a second arc-bvt-cq suite you were contending with at roughly the same time: http://cautotest/afe/#tab_id=view_job&object_id=68793207 for cyan-cheets-tot-chrome-pfq-cheets-informational What is that builder?
,
Jul 7 2016
cyan-cheets-tot-chrome-pfq-cheets-informational is picking cls and making continuous build, looks like we need to add more DUTs.
,
Jul 8 2016
> Theres only 4 DUTs and they were all running tests non-stop. D'oh! I reduced the count to 4 because of bug 625080. That said, the change happened on Wednesday afternoon, and this problem was reported on the Friday before. So, supply maybe wasn't the original problem.
,
Jul 8 2016
It would not surprise me if we had more than one problem. The symptom is pretty generic (tests taking too long) so it is very possible that one problem ( issue 626198 ) was masking the other problem (not enough DUTs) until the first problem was fixed. Do we have more DUTs available, and/or should we create a separate issue to track that? Assigning to active gardener to for tracking the progress on the underlying causes.
,
Jul 8 2016
I've increased pool:bvt for cyan-cheets back to 8. That means that by and large, tests requiring pool:suites will starve. I note that in general, we've had notably worse performance for the past couple of weeks. Those problems cleared up starting somewhere around Wednesday afternoon. That could explain this problem.
,
Jul 8 2016
Two problems identified for the slow bvt-tests. One is the original issue that ties to issue 626198 , the long running sql query; the other is due to a one-time collision that both informational bot and pfq bot running at the same time, contending DUTs, which I opened a separate issue 626504 . Please move the DUTs issue to 626504.
,
Jul 11 2016
See timeout happened again: https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-cheets-chrome-pfq/builds/412/steps/HWTest%20%5Barc-bvt-cq%5D/logs/stdio This issue was tracked down in crbug.com/626198 , but it happened again now. ihf@, please contact jrbarnette@ or this week's deputy to investigate.
,
Jul 11 2016
The same symptom occurred on cyan-cheets at the same time: https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-cheets-chrome-pfq/builds/407 ihf@ - could you investigate these two failures and see if you can determine the root cause? It may be that there is another issue independent of the two mentioned in comment #30.
,
Jul 12 2016
I sent out an email with the result of my investigation. I see very negative scaling going from running the 20 tests on 1 DUT to running it on 7 DUTs. The tests run in constant time (lab compared to my local machine), but the drone/reset/provision overhead results in an overall slowdown from using multiple DUTs.
,
Jan 18 2017
Ping - please provide an update to your high priority bug. This bug is stale. Is it really P-1?
,
Oct 6 2017
I believe the issues discussed here are either resolved, or discussed in more currently-alive bugs. |
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by steve...@chromium.org
, Jul 1 2016