paygen tests from previous canary delayed bvt-arc from canary, causing timeout |
||||||||
Issue descriptionFor instance: https://uberchromegw.corp.google.com/i/chromeos/builders/edgar-release/builds/1803 https://uberchromegw.corp.google.com/i/chromeos/builders/squawks-release/builds/2083 The logs mention "infra failure" and "lab failure".
,
Dec 28 2017
I don't understand what the ask of the bug here exactly is. What you're seeing are suite aborts because they didn't finish in time. Example: http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=165748502 So, all tests did not pass. Some were aborted because they didn't run in the given time.
,
Dec 28 2017
Let's keep drilling into that example: https://viceroy.corp.google.com/chromeos/suite_details?job_id=165748502 So there were 6 DUTs, as required for the bvt suite (is it the same for arc? I don't know +ihf, +jrbarnette). Also, one DUT went through one failure-repair cycle. We should be able to deal with that in the allotted. There is a possibility that tests ran slower than usual here, and that caused timeout. But I'll need to dig into some data to determine either way. I do see a large number autoupdate tests each taking 20+ minutes. They are pretty much hogging the DUTs towards the latter half of the suite. Is that intentional?
,
Dec 28 2017
I don't have enough evidence to support that infra was actually to blame here (by the shard being too slow yet), so I'm not taking ownership yet.
,
Dec 28 2017
,
Dec 29 2017
I don't quite see why this wound NOT be an infra problem, but let's look. Let's look at edgar-release build 1803. The timing diagram is here: https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=edgar-release&build_number=1803&milestone_version=&experimental=False&duration=1d&refresh=-1 10:12 Build starts (cleanup) 10:52 buildpackages starts 11:58 buildimage starts 12:43 bvt-arc starts 15:53 bvt-arc is timed out so bvt-arc started 2h 41m into the build, then lasted 3h 10m and was timed out. Comparing with successful build 1802: 02:12 start 05:02 bvt-arc start 07:14 bvt-arc ends Interestingly, in the earlier build getting to bvt-arc took longer (3h instead of 2.5h) but the tests ran faster (finished in 2h 10m instead of timing out at 3h 10m).
,
Dec 29 2017
The only way this wouldn't be caused by slowness is if we added more tests to the suite in question. Working assumption is that it is because of slowness (since there were enough DUTs in each case)
,
Dec 29 2017
Ah. I noticed one thing that I think is quite relevant: we're running bvt-arc and paygen tests in parallel and in the failing case, some of the paygen tests got intermixed with the bvt-arc tests. These tests (the autoupdate tests I was complaining about earlier) take a long time, hogged the DUTs, and caused the bvt-arc suite to timeout. paygen tests have longer timeout, and should have run later. Compare: good case: build timeline: https://viceroy.corp.google.com/chromeos/build_details?build_config=edgar-release&build_number=1802 bvt-arc suite timeline: https://viceroy.corp.google.com/chromeos/suite_details?job_id=165673594 bad case: build timeline: https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=edgar-release&build_number=1803&milestone_version=&experimental=False&duration=1d&refresh=-1 bvt-arc suite timeline: https://viceroy.corp.google.com/chromeos/suite_details?job_id=165751331
,
Dec 29 2017
For the bad case (build 1803) Here are the logs from the suite job for arc-bvt: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/165751331-chromeos-test/hostless/debug/ and for paygen-canary-test suite: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/165761928-chromeos-test/hostless/debug/ We see that the bvt-arc tests were all scheduled ~12:45: 12/27 12:45:19.986 DEBUG| suite:1120| Scheduling cheets_GTS.5.1_r2.GtsPlacementTestCases 12/27 12:45:20.446 DEBUG| suite:1390| Adding job keyval for cheets_GTS.5.1_r2.GtsPlacementTestCases=165751485-chromeos-test 12/27 12:45:20.446 DEBUG| suite:1120| Scheduling cheets_GTS.5.1_r2.GtsNetTestCases 12/27 12:45:20.894 DEBUG| suite:1390| Adding job keyval for cheets_GTS.5.1_r2.GtsNetTestCases=165751488-chromeos-test 1 ... 12/27 12:45:37.762 DEBUG| suite:1390| Adding job keyval for cheets_CameraOrientation=165751625-chromeos-test 12/27 12:45:37.763 DEBUG| suite:1192| Scheduled 38 tests, writing the total to keyval. 12/27 12:45:37.763 DEBUG| dynamic_suite:0622| Waiting on suite. whereas the canary tests were scheduled ~13:20 12/27 13:19:17.451 DEBUG| suite:1172| Discovered 4 tests. 12/27 13:19:17.454 INFO | server_job:0218| INFO ---- Start paygen_au_canary timestamp=1514409557 localtime=Dec 27 13:19:17 12/27 13:19:17.455 DEBUG| suite:1120| Scheduling autoupdate_EndToEndTest_paygen_au_canary_full_10255.0.0 12/27 13:19:17.893 DEBUG| suite:1390| Adding job keyval for autoupdate_EndToEndTest_paygen_au_canary_full_10255.0.0=165762248-chromeos-test 12/27 13:19:17.893 DEBUG| suite:1120| Scheduling autoupdate_EndToEndTest_paygen_au_canary_full_10254.0.0 12/27 13:19:18.313 DEBUG| suite:1390| Adding job keyval for autoupdate_EndToEndTest_paygen_au_canary_full_10254.0.0=165762250-chromeos-test 12/27 13:19:18.314 DEBUG| suite:1120| Scheduling autoupdate_EndToEndTest_paygen_au_canary_delta_10254.0.0 12/27 13:19:18.735 DEBUG| suite:1390| Adding job keyval for autoupdate_EndToEndTest_paygen_au_canary_delta_10254.0.0=165762253-chromeos-test 12/27 13:19:18.735 DEBUG| suite:1120| Scheduling autoupdate_EndToEndTest_paygen_au_canary_delta_10255.0.0 12/27 13:19:19.170 DEBUG| suite:1390| Adding job keyval for autoupdate_EndToEndTest_paygen_au_canary_delta_10255.0.0=165762256-chromeos-test 12/27 13:19:19.171 DEBUG| suite:1192| Scheduled 4 tests, writing the total to keyval. 12/27 13:19:19.171 DEBUG| dynamic_suite:0622| Waiting on suite. 1 Still, they ended up getting mixed up. I can't quite explain this. Although it is not impossible, the autotest scheduler is in practice FIFO, all other things being equal. And in this case, the scheduled test jobs should have had the same job dependencies.
,
Dec 29 2017
The timeout must be specific to the tests, because I see much longer builds. For instance, build 1719 took a total of 6h 35m and succeeded. Maybe we should just increase the bvt-arc timeout? Why is there so much variability though?
,
Dec 29 2017
In fact, we see that _no_ bvt-arc tests were run in the middle 40 minutes in the bad case, which single-handedly explains the timeout later: https://viceroy.corp.google.com/chromeos/suite_details?job_id=165673594
,
Dec 29 2017
OK, no priority inversion. I looked at one of the interjecting paygen jobs, it's parent suite is: http://cautotest/afe/#tab_id=view_job&object_id=165683967 compare that with the bvt-arc suite: http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=165751331 That paygen job is a delayed job from the _last_ canary. :'( So, paygen jobs got so delayed from the last canary that they got run in the middle of the bvt-arc from the next one, killing it.
,
Dec 29 2017
Overall problem is still capacity (6 DUTs are not enough for the load we're putting on bvt) or throughput problem (there were DUTs sitting around but infra was not using them well). Need to determine which.
,
Dec 29 2017
OK I will stop working on this. This is what I was about to submit, but you've already answered those questions. Thanks :) ------------------------------------ Sorry, I had not noticed #8 and #9 when I entered $10. You're ahead. I was just checking devserver overload for edgar 1803 (it's chromeos4-devserver4) and there is no overload. So arc-bvt timed out because it was sharing DUTs with paygen? For that build I see bvt-arc starting at 12:43 and paygen at 12:53. https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=edgar-release&build_number=1803&milestone_version=&experimental=False&duration=1d&refresh=-1 That's very similar to successful builds. Paygen starts 9 or 10 minutes later.
,
Dec 29 2017
,
Dec 29 2017
,
Dec 29 2017
Note issue 796439 comment #2. Richard claims there were idle DUTs during the delay.
,
Jan 4 2018
Similar failures for: cave-release: Build details: https://luci-milo.appspot.com/buildbot/chromeos/cave-release/1824 Build timing: https://viceroy.corp.google.com/chromeos/build_details?build_id=2181758 bvt-arc timing: https://viceroy.corp.google.com/chromeos/suite_details?build_id=2181758 veyron_minnie: Build details: https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-release/1825 Build timing: https://viceroy.corp.google.com/chromeos/build_details?build_id=2181783 bvt-arc timing: https://viceroy.corp.google.com/chromeos/suite_details?build_id=2181783 snappy-release: Build details: https://luci-milo.appspot.com/buildbot/chromeos/snappy-release/1344 Build timing: https://viceroy.corp.google.com/chromeos/build_details?build_id=2181701 bvt-arc timing: https://viceroy.corp.google.com/chromeos/suite_details?build_id=2181701
,
Jan 4 2018
> Note issue 796439 comment #2. Richard claims there were idle DUTs during the delay. The defining feature of bug 796439 isn't the idle DUTs. The defining feature is that all of the scheduled tests ran to successful completion on the shard, but the master never saw results from the shard and reported the jobs as aborted. The DUTs were idle merely because they had finished work. Someone should check whether that's what's happening here. Honor of first investigation goes to the primary deputy.
,
Jan 4 2018
It doesn't look like the tests got a DUT even on the shard http://chromeos-server104.mtv.corp.google.com/afe/#tab_id=view_job&object_id=165751495
,
Jan 8 2018
,
Jan 17 2018
I no longer know what this bug is about. Later analysis (and the bug summary) suggests that tests were slow to start because of other prior testing that wasn't done. But that wasn't the cause of the first failure. It's also not clear to me that there are any problems still remaining to be fixed... As a practical matter, we've had problems recently with shards that start to run slow because of a problematic query. See bug 798489. That became a particular issue shortly after this bug was filed, and the shards for edgar and quawks (server104 and server112, respectively) are specifically named as problems. So, likely this is a dup of that. In any event: WontFix, for want of a current, actionable problem. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by semenzato@chromium.org
, Dec 28 2017