remove never- or seldom- failing tests from bvt-inline/bvt-cq |
||||||||||||||
Issue description
dshi@ came up with a dremel query to answer the question "what tests have failed on paladins lately". Results for 9 days and 9 days history (those are the precomputed dremel tables) pasted below.
I propose we remove tests from bvt-inline/bvt-cq if they are not in these sets. (or not in them often enough to justify failing in cq), and move them to bvt-perbuild.
dremel> SELECT name, failed from
-> (SELECT
-> sum(if(status=1, 1, 0)) as failed,
-> FROM sponge.testcase.last9days
-> WHERE target.invocation.user = 'chromeos-test'
-> AND name NOT LIKE '%SERVER_JOB%'
-> AND name NOT LIKE '%CLIENT_JOB%'
-> AND target.build_target LIKE '%paladin%'
-> AND name <> 'provision'
-> AND status=1
-> GROUP BY name);
+-----------------------------------------------------------------+--------+
| name | failed |
+-----------------------------------------------------------------+--------+
| login_CryptohomeIncognito | 2 |
| jetstream_LocalApi | 36 |
| jetstream_WanCustomDns | 28 |
| jetstream_GuestFirewall | 382 |
| cheets_CTS_N.CtsOpenGLTestCases | 42 |
| security_StatefulPermissions | 59 |
| security_ProfilePermissions.login | 111 |
| cheets_CTS_N.CtsAccountManagerTestCases | 42 |
| jetstream_NetworkInterfaces | 30 |
| jetstream_DiagnosticReport | 24 |
| video_ChromeRTCHWDecodeUsed.mjpeg | 2 |
| graphics_dEQP | 4 |
| login_OwnershipApi | 2 |
| desktopui_MashLogin | 2 |
| cheets_SELinuxTest | 34 |
| cheets_ContainerSmokeTest | 12 |
| video_ChromeHWDecodeUsed.h264 | 4 |
| cheets_DownloadsFilesystem | 36 |
| moblab_RunSuite | 19 |
| provision_AutoUpdate.double | 2 |
| hardware_RamFio | 1 |
| platform_DebugDaemonGetPerfData | 6 |
| cheets_StartAndroid.stress | 2 |
| desktopui_ExitOnSupervisedUserCrash | 2 |
| cheets_CTS_N.CtsOpenGlPerf2TestCases | 40 |
| login_MultipleSessions | 26 |
| graphics_Idle | 2 |
| cheets_SettingsBridge | 4 |
| p2p_ServeFiles | 2 |
| jetstream_ApiServerAttestation | 27 |
| security_NetworkListeners | 2 |
| cheets_CTS_N.CtsDramTestCases | 44 |
| jetstream_ApiServerDeveloperConfiguration | 34 |
| platform_CUPSDaemon | 2 |
| cheets_CTS.com.android.cts.dram | 18 |
| cheets_KeyboardTest | 2 |
| login_RemoteOwnership | 2 |
| jetstream_GcdCommands | 383 |
| cheets_GTS.GtsPlacementTestCases | 4 |
| cheets_CTS.android.core.tests.libcore.package.harmony_java_math | 25 |
| login_UserPolicyKeys | 2 |
| video_ChromeHWDecodeUsed.vp8 | 2 |
| cheets_GTS.GtsNetTestCases | 6 |
| jetstream_GuestInterfaces | 388 |
| jetstream_PrioritizedDevice | 28 |
| cheets_GTS.GtsAdminTestCases | 6 |
+-----------------------------------------------------------------+--------+
WARNING: 1.7% of data was not scanned (see "settings min_completion_ratio")
46 rows in result set (47.90 sec)
Scan rate: 500.27M rows/sec, SWE cost: 2.82737s
dremel> SELECT name, failed from
(SELECT
sum(if(status=1, 1, 0)) as failed,
FROM sponge.testcase.last90days
WHERE target.invocation.user = 'chromeos-test'
AND name NOT LIKE '%SERVER_JOB%'
AND name NOT LIKE '%CLIENT_JOB%'
AND target.build_target LIKE '%paladin%'
AND name <> 'provision'
AND status=1
GROUP BY name);
+-----------------------------------------------------------------+--------+
| name | failed |
+-----------------------------------------------------------------+--------+
| security_SandboxLinuxUnittests | 44 |
| jetstream_LocalApi | 709 |
| telemetry_LoginTest | 69 |
| jetstream_WanCustomDns | 607 |
| login_LoginSuccess | 65 |
| desktopui_ExitOnSupervisedUserCrash.arc | 4 |
| security_Minijail0 | 52 |
| logging_CrashSender | 5 |
| video_VideoSanity.vp9 | 2 |
| logging_UserCrash | 15 |
| login_UserPolicyKeys | 3 |
| login_OwnershipNotRetaken | 128 |
| cheets_CTS.android.core.tests.libcore.package.harmony_java_math | 155 |
| logging_UdevCrash | 4 |
| login_OwnershipTaken | 133 |
| jetstream_PrioritizedDevice | 705 |
| security_RootCA | 2 |
| login_MultipleSessions | 290 |
| cheets_DownloadsFilesystem | 102 |
| graphics_SanAngeles | 33 |
| platform_FilePerms | 36 |
| video_ChromeHWDecodeUsed.vp8 | 30 |
| security_ProfilePermissions.login | 115 |
| cheets_SELinuxTest | 116 |
| desktopui_MashLogin | 85 |
| login_OwnershipApi | 314 |
| graphics_Idle.arc | 6 |
| audio_CrasSanity | 10 |
| security_NetworkListeners | 102 |
| security_AccountsBaseline | 112 |
| cheets_CTS_N.CtsAccountManagerTestCases | 50 |
| platform_DebugDaemonGetPerfData | 102 |
| jetstream_GuestInterfaces | 241 |
| security_ModuleLocking | 2 |
| security_StatefulPermissions | 59 |
| hardware_RamFio | 3 |
| login_GuestAndActualSession | 219 |
| security_SandboxStatus | 245 |
| cheets_GTS.GtsAdminTestCases | 10 |
| platform_CheckCriticalProcesses | 36 |
| security_RuntimeExecStack | 2 |
| cheets_StartAndroid.stress | 2 |
| jetstream_GcdCommands | 217 |
| security_SandboxedServices | 73 |
| platform_MemCheck | 60 |
| security_SuidBinaries.fscap | 1 |
| cheets_NotificationTest | 86 |
| jetstream_ApiServerAttestation | 738 |
| video_ChromeRTCHWDecodeUsed | 365 |
| login_LogoutProcessCleanup | 2 |
| cheets_MediaPlayerVideoHWDecodeUsed | 44 |
| video_VideoSanity.h264 | 2 |
| desktopui_ExitOnSupervisedUserCrash | 72 |
| cheets_SettingsBridge | 68 |
| cheets_KeyboardTest | 75 |
| security_DbusOwners | 2 |
| platform_Perf | 6 |
| login_MultiUserPolicy | 2 |
| cheets_CTS_N.CtsOpenGLTestCases | 42 |
| cheets_CTS_N.CtsOpenGlPerf2TestCases | 44 |
| desktopui_Respawn | 2 |
| cheets_ContainerSmokeTest | 90 |
| cheets_CTSHelper.stress | 6 |
| jetstream_ApiServerDeveloperConfiguration | 727 |
| platform_CUPSDaemon | 34 |
| security_SuidBinaries.suid | 4 |
| jetstream_NetworkInterfaces | 107 |
| video_VideoSanity | 118 |
| graphics_dEQP | 26 |
| dummy_Fail.Fail | 4 |
| video_ChromeHWDecodeUsed.vp9 | 18 |
| hardware_Memtester.quick | 7 |
| p2p_ServeFiles | 22 |
| video_ChromeRTCHWEncodeUsed | 18 |
| cheets_GTS.GtsNetTestCases | 12 |
| video_ChromeHWDecodeUsed.h264 | 93 |
| login_Cryptohome | 6 |
| cheets_CTS_N.CtsDramTestCases | 54 |
| video_ChromeRTCHWDecodeUsed.vp8 | 4 |
| login_RemoteOwnership | 318 |
| login_RetrieveActiveSessions | 10 |
| dummy_Fail.Error | 4 |
| cheets_GTS.GtsPlacementTestCases | 10 |
| video_ChromeHWDecodeUsed | 177 |
| jetstream_DiagnosticReport | 721 |
| login_OwnershipRetaken | 311 |
| moblab_RunSuite | 58 |
| provision_AutoUpdate.double | 59 |
| jetstream_GuestFirewall | 161 |
| login_CryptohomeIncognito | 4 |
| telemetry_LoginTest.arc | 5 |
| video_ChromeRTCHWDecodeUsed.mjpeg | 4 |
| security_RestartJob | 2 |
| graphics_Idle | 59 |
| security_ProfilePermissions.guest | 2 |
| security_ptraceRestrictions | 17 |
| cheets_CTS.com.android.cts.dram | 87 |
| platform_ToolchainOptions | 7 |
| graphics_Gbm | 10 |
| security_ASLR | 2 |
+-----------------------------------------------------------------+--------+
WARNING: 2.0% of data was not scanned (see "settings min_completion_ratio")
100 rows in result set (565.47 sec)
Scan rate: 554.60M rows/sec, SWE cost: 33.103s
,
May 15 2017
I think we probably should have a meeting to discuss this. Before making any decisions, it would be useful to know exactly what tests are being slated for removal, and how much time would be saved by removing them. Also, can we look back longer than 90 days, say a year.
,
May 15 2017
,
May 15 2017
afaik the dremel tables cover those two fixed histories (9 and 90). Authoritative data is in tko for 6 months, but dshi struggled to find any efficient ways to query it. Note that all (re)moved tests would still run in bvt-perbuild and should be treated as release blockers.
,
May 15 2017
I am generally not in favor of diluting test coverage especially from BVTs since that's what gets the most attention. Is this being driven by attempts to reduce CQ time? How much time do we expect to save?
,
May 15 2017
I am open to reorganizing the BVT, especially together with the PFQ efforts to reorganize coverage there. I am also open to moving the graphics_* tests into bvt-perbuild, as we are going to use an alternate alerting system going forward and can detect and fix failures out of band using the upcoming graphics rotation. I am not open to removing cheets_StartAndroid, CTS and GTS tests from PFQ though. In fact I am fairly sure your query isn't working as intended, as these tests did detect this failure last week https://luci-milo.appspot.com/buildbot/chromeos/veyron_mighty-paladin/5247 [Test-Logs]: cheets_CTS_N.CtsAccountManagerTestCases: retry_count: 2, FAIL: Error: Could not find any tests in module. [Test-Logs]: cheets_CTS_N.CtsDramTestCases: retry_count: 2, FAIL: Error: Could not find any tests in module. [Test-Logs]: cheets_CTS_N.CtsOpenGLTestCases: retry_count: 2, FAIL: Error: Could not find any tests in module. [Test-Logs]: cheets_CTS_N.CtsOpenGlPerf2TestCases: retry_count: 2, FAIL: Error: Could not find any tests in module.
,
May 15 2017
,
May 15 2017
RE 4: Yes, the desire is to reduce CQ time. RE 6: dshi what's the time lag on this dremel data. Is it a few days or a week behind?
,
May 15 2017
Sponge data should be up to date. I don't think there is any delay once the result is parsed, basically, no delay comparing to TKO data.
,
May 15 2017
Mhh, maybe I misunderstood. You want to remove tests that are not in either list above. Which tests are those? Also, why does jetstream_* fail so often and why does it not pose problems?
,
May 15 2017
That said, to reduce time in the CQ/PFQs and increase reliability of uprev I am ok to reduce the coverage to the absolute minimum. This will require individual teams watching for daily regressions, a work they traditionally they were unable to reliably. The minimum viable testing in my mind is to cover gating problems (a failing gate would shut out many more tests in bvt-perbuild or daily/weekly suites). I consider gates: 1) System still updates (kernel boots). 2) System still updates after an update (new updater is working). 3) Chrome starts and logs in *reliably*. 4) Android starts *reliably*. 5) GPU acceleration is still available. 6) CTS/GTS testing is not broken. But there may be more gates.
,
May 15 2017
The jetstream tests are only on the whirlwind paladin which is currently marked experimental. Yes, I want to remove tests that are not in the lists above. And possibly also some of the ones that ARE in the list but with fewer than 5-10 failures in the last 90 days.
,
May 15 2017
Just doing this by numbers has no meaning.
,
May 15 2017
I spot checked two random builds to try to understand how long it takes for the BVTs to run and found that the time taken seems to vary a lot. On the first build below, it took about 22 minutes vs the second it took 1 hour 4 minutes. Is most of the time delta here due to long scheduler ticks? https://viceroy.corp.google.com/chromeos/suite_details?build_id=1517015 https://viceroy.corp.google.com/chromeos/suite_details?build_id=1515486
,
May 15 2017
This may be scheduler ticks, as the same tests take longer in one view than the other. There is also great inefficiency starting any autotest server tests (ssp), which spins up an lxc instance from scratch and adds about 3-5 minutes to each autotest server test. See issue 720219 (Ben is going to look into reducing that). Also not caching files when starting jobs eats a lot of time. There are many options to make autotest work faster. Reducing test coverage reduces cycle time without having to make autotest faster. (Why not do both?) The main benefit of reducing test coverage in the CQ/PFQ in my mind is that it reduces flakes as well and makes troubleshooting by sheriffs/gardeners simpler, as the remaining few tests will be better understood.
,
May 15 2017
You're comparing elm to kevin, so it's not quite apples-to-apples. Also, the build timeline says the comparison is more like 41m vs. 1:33 https://viceroy.corp.google.com/chromeos/build_details?build_id=1517015 https://viceroy.corp.google.com/chromeos/build_details?build_id=1515486 As far as tick time, one of those in on shard 98, the other on 99. Both shards are ticking at a similar rate http://shortn/_t6VvtiK3iw 144 chromeos-server98.mtv.corp.google.com board:kevin, board:veyron_minnie, board:caroline 145 chromeos-server99.mtv.corp.google.com board:elm, board:oak, board:chell
,
May 16 2017
davidriley: would it be possible to come up with a distribution of BVT test time to help understand how much board to board variance vs run to run variance there is?
,
May 16 2017
Re c#17: bvt-inline (important): http://shortn/_a7F0kB6MEg bvt-cq: http://shortn/_J4IFR6G85C arc-bvt-cq: http://shortn/_UOzbTYqMVT The bucketing at the high end unfortunately sucks a bit.
,
May 16 2017
How to read this, anything between an hour and four?
,
May 16 2017
I find the graphs in 18 very telling. The median time has actually trended down, but the variance has increased and the upper tail has gone up. The CQ cares more about the upper tail since it usually has to wait for one of the slowest slaves.
,
May 16 2017
Luigi, could you please help analyze the time spent in hwtest phase of the BVT across boards and for each board across multiple runs? Once we have that data lets revisit whether we want/need to re-partition the tests and if so which ones.
,
May 16 2017
,
May 16 2017
I am taking a look. About comment #11: do we need AU tests in the CQ? 1. AU tests are among the flakiest (how many of the provision_Autoupdate failures counted above are false negatives?) 2. They are also long tests. 3. Do we rely on AU working on ToT? I never did. (For canaries, sure.) 4. The AU code doesn't change much. Also, I am not sure that "tests that failed 0 times" alone is a good enough measure. How bad is it to ignore the failure of a specific test? Will the system boot, will it stay up long enough, will the display work, will we be able to ssh into the system etc. And now back to understanding the variance.
,
May 16 2017
#17: these graphs aren't exactly self-explanatory (to me, at least) 1. I can see that the x-axis is time/date. 2. What is the Y axis? Test time in seconds? 3. Are they grey blocks the number of samples falling in each time interval and value range? Are these all successful tests? There is a 5x range between the slowest ones and the fastest ones. That seems a little high.
,
May 16 2017
1. yes 2. stage (i.e. total hwtest suite) time in seconds 3. yes I think the 5x range is real.
,
May 16 2017
,
May 17 2017
c#24, c#25: it will be good to find examples of a few runs where the timing disparity is 5x and analyze those closely. Is it possible to run a query for test time is > x minutes and < y minutes? Comparing the two random runs below, the first run completed in 40 minutes while the second one completed in 33 minutes. The delta seems to be due to how the tests got sharded -- in the first run the longer running tests got scheduled later and ended up becoming the long poles. Could we do smarter scheduling where the longer tests in a suite get scheduled first? https://viceroy.corp.google.com/chromeos/suite_details?build_id=1522370 https://viceroy.corp.google.com/chromeos/suite_details?build_id=1522383
,
May 17 2017
c#27: to put that 7 minute delta into better context, excluding provisioning, just the test run time in the first run above was 26 minutes vs 18 minutes in the second run, so close to 50% overhead.
,
May 17 2017
#27 running slow tests first forked to Issue 723705
,
May 17 2017
,
May 18 2017
We've been looking at a veyron_speedy build and discussed it in an email thread, whose relevant parts I am copying here. ---------------- I am looking at veyron_speedy-paladin build 5295. In viceroy, this reports a duration of 1:10:01 for the bvt-cq test, starting at 1:25:58, ending at 2:35:59. http://okaybye2.mtv.corp.google.com:5381/chromeos/build_details?build_config=veyron_speedy-paladin&build_number=5295 (viceroy has a bug right now so I am using the okaybye2 mirror) However, the logdog for that build measures things differently: https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fveyron_speedy-paladin%2F5295%2F%2B%2Frecipes%2Fsteps%2FHWTest__bvt-cq_%2F0%2Fstdout Suite timings: Downloads started at 2017-05-17 01:26:09 Payload downloads ended at 2017-05-17 01:26:17 Suite started at 2017-05-17 01:27:22 Artifact downloads ended (at latest) at 2017-05-17 01:27:25 Testing started at 2017-05-17 01:27:29 Testing ended at 2017-05-17 02:02:59 The last few lines of the logdog indeed show the later time stamp: ... 3347815 Reset started on: 2017-05-17 01:50:26 status PASS 118343967 veyron_speedy-paladin/R60-9559.0.0-rc2/bvt-cq/platform_ToolchainOptions started on: 2017-05-17 01:45:06 status Completed 3347733 Provision started on: 2017-05-17 01:28:35 status PASS 05-17-2017 [02:35:51] Output below this line is for buildbot consumption: Will return from run_suite with status: OK 02:35:55: INFO: No json dump found, no HWTest results to report 02:35:55: INFO: Running cidb query on pid 16533, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x7f66bdffffd0> 02:35:59: INFO: cidb query succeeded after 1 retries 02:35:59: INFO: Running cidb query on pid 16533, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f66bd424350> ************************************************************ ** Finished Stage HWTest [bvt-cq] - Wed, 17 May 2017 02:35:59 -0700 (PDT) ************************************************************ 01:20:03: INFO: Created cidb engine bot@173.194.81.53 for pid 15940 01:20:03: INFO: Running cidb query on pid 15940, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x7f66bdea53d0>
,
May 18 2017
Ilja's comments: Tests start at 01:48 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/118010716-chromeos-test/hostless/debug/ 05/17 01:28:54.020 DEBUG| dynamic_suite:0606| Waiting on suite. 05/17 01:48:02.644 INFO | server_job:0184| START 118010962-chromeos-test/chromeos4-row4-rack12-host18/graphics_GLMark2 graphics_GLMark2 timestamp=1495010594 localtime=May 17 01:43:14 And ended by 2:07 05/17 02:06:35.736 DEBUG| suite:1474| Adding job keyval for graphics_SanAngeles=118011508-chromeos-test 05/17 02:07:16.270 DEBUG| dynamic_suite:0608| Finished waiting on suite. Returning from _perform_reimage_and_run. -------------- David's comments: I've not dug into why things might differ but Viceroy is reporting stage timings. If you click the Suite button off the build details it will bring to details about all the individual tests. In particular in this case, I'm guessing it's getting hit by ongoing P0 crbug.com/723645 ------------------------- Ilja: That says it starts about 1:42 and ends about 2:04. Time is a wonderfully fuzzy thing. I mean I get that provisioning starts at 1:28 and some pages count provisions as tests. But even suite page shows 2 visualizations with different times for different tests, for example http://okaybye2.mtv.corp.google.com:5381/chromeos/suite_details?build_id=1524310 video_ChromeHWDecodeUsed.vp8 top visualization 02:01:39 - 02:03:19 bottom visualization 02:01:35 - 02:03:48 And clicking on either one gets me to https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/118011235-chromeos-test so yes, they are the same job. ------------- David: That says it starts about 1:42 and ends about 2:04. Time is a wonderfully fuzzy thing. I mean I get that provisioning starts at 1:28 and some pages count provisions as tests. But even suite page shows 2 visualizations with different times for different tests, for example http://okaybye2.mtv.corp.google.com:5381/chromeos/suite_details?build_id=1524310 video_ChromeHWDecodeUsed.vp8 top visualization 02:01:39 - 02:03:19 bottom visualization 02:01:35 - 02:03:48 And clicking on either one gets me to https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/118011235-chromeos-test so yes, they are the same job. ------------- Ilja: Last work I see is in parse.log ending 2:11:30. Then the GS filestamps are from 2:27. I assume after that chromeos-server3.hot.corp.google.com is taking an espresso or two latte before to checking off the work in the DB? ******************************************************************************** 02:08:43 05/17/17> ['nice', '-n', '10', '/usr/local/autotest/tko/parse', '--write-pidfile', '--record-duration', '--suite-report', '-l', '2', '-r', '-o', u'/usr/local/autotest/results/118010716-chromeos-test/hostless'] ******************************************************************************** tko parser: [...] 02:11:30: INFO: Attempting refresh to obtain initial access_token 02:11:30: INFO: Refreshing access_token tko parser: Successfully export timeline report to gcloud ------------------- Ilja: Notice chromeos-server3.hot.corp.google.com is a 32 core machine with no workload to speak off, so it is waiting for who knows what. https://viceroy.corp.google.com/chromeos/machines?hostname=chromeos-server3&board=terra&duration=8d&mdb_role=chrome-infra&pool=managed%3Acts&refresh=-1&status=Running&topstreams=40 --------------- Ilja: Actually, the server is an interesting kind of idle with a lot of leaked processes. I am going to reboot it. ihf@chromeos-server3:~$ ps -ef | grep autoserv | wc 459 11448 192575 ihf@chromeos-server3:~$ ps -ef | grep dhclient | wc 613 4914 35059
,
May 18 2017
I think some/most of the time is getting tied up in doing AFE/TKO queries in order to generate the report at the end and JSON files for export to cloud datastore. If those databases are not being responsive to basic queries, then stuff all starts falling down.
,
May 18 2017
Thanks David. Do you have any pointers to relevant logs or code? To help understand. If there aren't any relevant logs, maybe there should be some?
,
May 18 2017
I just realized that the suite report finishes at 2:11 (from data in parse.log with ihf@ also points out). So there's 3 minutes to do the TKO parse and datastore export. There's some reports generated, and I'm not sure where. Probably can start from cbuildbot/states/test_stages.py and look for the run_suite call (I think there's two, one in scripts, and then another in autotest).
,
May 22 2017
To return to a high level picture. Servers sometimes get slow. (After a reboot they are often surprisingly fast.) The slowness comes from either handling too many jobs, or because they hit a networking slowdown. Possibly disk IO or CPU limits. Or a VM kernel bug as in issue 724396. The expensive operations right now are usually network file transfers, a tar xvf or a gsutil operation. Presumably they sometimes slow down even more due to ganeti virtualization. Starting Chrome takes 20 seconds, starting Android another 20 seconds, rebooting is 40 seconds, but at least none of these cost server resources. What can be done? For server code avoid downloads and cache/share files in a safe way. Cache trees in the untarred state to avoid repeated untarring. Be cautious how much gets uploaded to GS. Notice that the cheapest CTS test used to run 10 minutes a few months ago (mostly server overhead). Now we are down to 3 minutes when servers are not crawling. Running faster condenses IO operations into a shorter time frame. And we run more and more server tests, due to enabling more and more boards. Notice CTS is not the only server test, but provision and autoupdate for instance are too. Similar the code that collects log files and uploads results is essentially server operations. So the goal here is to make *all* server code more efficient. Or at least get get a barge full of servers shipped to MTV.
,
May 22 2017
That said the graphics team will reduce the number of tests running in bvt-inline/bvt-cq dramatically. None of them are server tests, but at this stage not regressing the ability to start Android and ensure CTS can be run reliably is more important. https://chromium-review.googlesource.com/#/c/510112/ I can help with going through other tests in bvt and moving them to bvt-perbuild where they can be monitored by individual teams. The main focus will be on tests which exercise Chrome.
,
May 22 2017
I had a very similar discussion with sosa@ about four years ago wrt. security tests. TL;DR: why would we want to reduce test coverage? Moving security tests outside of inline/cq essentially moves all security regression work onto the 3.5-person security team. Is that what we want to do for one of the S pillars of Chrome OS? I never understood why the first reaction to "our infra is not quite working" is "let's reduce the number of tests we run!" instead of "let's make our infra better".
,
May 22 2017
+David David James has come up with some recommendations for speeding up the CQ but I am not sure where/when/how he's sharing them. One of the recommendations (sorry for the spoiler!) is to keep track of the tests that each individual CL passes in multiple CQ runs. Since most CQ failures are infra flake, after two failed CQ runs most CLs will have passed all tests and can be merged. This will miss CLs that cause or increase flake, but it's good enough. Since I don't see a consensus that we should remove tests, I am taking myself off this bug. I am still looking into unexplained CQ latency, but I'll find another bug for that.
,
May 22 2017
>I had a very similar discussion with sosa@ about four years ago wrt. security tests. TL;DR: why would we want to reduce test coverage? Moving security tests outside of inline/cq essentially moves all security regression work onto the 3.5-person security team. The point is to use finite resources efficiently. Regressions have a cost, but so does running tests. I'm also not suggesting we remove all such tests. But how many regressions have been caught in 90 days, and did catching them require ~20 distinct tests? > I never understood why the first reaction to "our infra is not quite working" is "let's reduce the number of tests we run!" instead of "let's make our infra better". PS It's not our first reaction. In fact, compared to the other ongoing efforts to improve infra, I think an audit of which tests are providing value is long overdue.
,
May 22 2017
(was referring in particular to security_* tests in #40)
,
May 22 2017
One thing that is true is that we have been operating under the assumption that it's mostly equivalent to run one longer test or more shorter tests. This bug implies that assumption is not true. I wouldn't mind combining some tests if that would clearly save resources (I'd love to look at some data for this BTW). I also challenge the arbitrary "90-day" rule. We had one of the security_ tests catch a full regression of the Chrome sandbox (i.e. the sandbox was accidentally fully disabled). Catching that at CQ time is pretty invaluable. In the spirit of being proactive, I'd love to see better numbers of per-test overhead, and if those suggest we could do better by combining tests, I can do the work of combining them.
,
May 22 2017
WRT combining tests: We know that a significant percentage of infra failures are related to provisioning (devserver timeouts). Unfortunately we don't have actual numbers on this and they are not trivial to query. Also, they tend to cascade, i.e. once the devserver is overloaded we tend to get a lot of failures for a while. So, any effort to reduce the amount of provisioning we do during testing is a pretty big win in my book.
,
May 22 2017
To give an example, a recent cyan-chrome-pfq builder deployed tests to 5 different DUTs in the arc-bvt-cq suite: https://uberchromegw.corp.google.com/i/chromeos/builders/cyan-chrome-pfq/builds/1175
,
May 22 2017
Jorge, the idea here is to move the working of the CQ into the direction of how changes are landed and tested on the Chromium waterfall. I am happy to discuss details with you. The basic problem to me is that the inner loop of the cq is a very expensive place for testing. In particular the way it is implemented it keeps DUTs in the lab reserved and idle most of the time (while the builders are building). If we can free these DUTs for informational testing we can have a higher test throughput in the lab. It has been agreed that the CQ cannot contain no tests (if it builds it ships) because reverting on Chrome OS is much slower than on Chromium waterfall. Instead the CQ needs to contain a minimum amount of tests that ensure more testing can happen after the CQ is done and commits the changes. If you want, call them meta tests. Or gate keeping tests. For instance the reason we run CTS test in the CQ is not to not regress these particular tests, but to not regress the ability to run CTS tests in informational suites afterwards (and cause CTS untestable builds for days or weeks). Similarly the reason we test updating from the update is to ensure we can keep the lab alive, not that users have a great experience. As for Chrome, the ability to login gates many other tests, so even though logging in is totally boring, we need to be able to do so reliably. As for security tests, there might be a gate keeping functionality there, like the sandbox regression you mention. I will have to look at all of them (but maybe you know?). Many of the security failures that I have seen are change detector tests though (say, do the same services as yesterday and describe in the whitelist still run today). I don't think these should be causing CQ runs to fail. A sheriff/gardener can examine this failure out of band (and whitelist the new services.) What we should do is run security tests more often in informational suites, say in the Chromium waterfall. So the granularity to detect their failure is lower and closer to the source. But it is true this means more work for the security team. Maybe this means you can have your own rotation?
,
May 23 2017
My experience is that before we had all the regression security_ tests, my day consisted exclusively of fixing regressions. So, we will have to agree to disagree, but I don't want to go back to those days and not being able to do actual work. And I'd appreciate it if folks didn't force me to go back to those days either =) Like I said before, I am happy to combine tests into more logical units, but we need security regression tests in order to keep the security team productive. We're 3.5 people, already have enough rotation work dealing with incoming security bugs. I still feel we're discussing the usefulness of regression tests. We shouldn't. They're useful. We already know that. But for them to be useful, they need to happen before things get committed. You mention the Chrome waterfall, but most bots in the Chrome waterfall are *not* FYI bots, they're CQ-blocking bots. You also say "In particular the way it is implemented it keeps DUTs in the lab reserved and idle most of the time". Isn't the fix for that to change the way things are implemented, rather than remove our test coverage? Moreover, these tests should be failing in the PreCQ stage, never in the CQ -- all of them test things that can be tested in VMTests. So I don't buy the argument about blocking the CQ for these things. Another thing we could do is move some of these tests to the ImageTest stage. Some tests (like AccountsBaseline or some of the FS checks) just require image data, not a running system. Alternatively, we can be very aggressive about reverting CLs -- but I don't want to have to fix code or test out of band. That would kill my productivity.
,
May 23 2017
I tend to agree with jorgelo@. I am not too gung-ho about removing tests from the CQ. We can & should revisit what is currently in the CQ and remove tests that are not adding much value and perhaps add tests that add value. Diluting test coverage and then having developers play roulette each time they repo sync should not be the answer to infra issues. BTW, davidjames@ has a few proposals on how to improve CQ efficacy in the doc below that should help. PTAL: https://docs.google.com/document/d/1EKrUevrK7-gp7Jo_I5ND3BAJI2uBIH9TUS3igbknEYA/edit#
,
May 23 2017
Without touching on other points, I strongly agree that tests should run in the pre-CQ (e.g. via VMTest or ImageTest) whenever possible. I suspect that most security tests that can run in the pre-CQ could be removed from the CQ afterwards (to reduce the chances of flakes or infra issues).
,
May 23 2017
Agreed on doing more in pre-CQ -- failing CLs can't cause collateral damage there unlike the CQ where multiple unrelated CLs get batched together.
,
May 23 2017
Re 48: we have this beautifully complicated monster: One CQ, two PFQs, a pre-CQ, but no pre-PFQs... I don't think moving a security test, which tests Chrome, out of the CQ into the pre-CQ works as you intend it. You will still need the same test in the Chrome PFQ, which means it needs to belong to another distinctly run logical group. And you need to cover the same hardware configurations in all places, otherwise you don't prevent breakage, you just push it from the CQ to the PFQ where the gardener is forced to clean it up (instead of the sheriff who owns the CQ, or the security team who best knows those tests).
,
May 23 2017
Re 46: VMTest may be a very good place for the security tests to be (and it detects failures early), except that VMs are not hardware specific (Intel only, one kernel only etc.) But the fundamental problem with the CQ is that rate of incoming CLs and the rate of incoming errors overwhelms the rate anybody is able to test. This starts from builds taking 60-90 minutes. It continues to reimaging Chromebooks taking 15 minutes and finally it extends to each test running 1-3 minutes each. This means each CQ steps takes 2 hours in good times, more in bad times. And it is hard for any automated system to distinguish good from (possibly several) bad CLs (say 100 CLs a day, 2-3 bad ones, 6-8 runs a day, call this a variation of Shannon's theorem). That is really a fundamental problem, which Chromium waterfall attacked by making builds fast, making infra overhead low, making infra scale, making tests short and making tests not flaky. I am still waiting for the day that portage becomes fast. I do. I have been working with Stephane to reduce infra server overhead, but there is lots of low hanging fruit that everyone can help with. Making tests not flaky (by improving the product or fixing the test) is the test owners responsibility. And making tests fast? Autotest is a pig. Even 1 second client tests take 40 seconds server/DUT time: Sample: Duration [s] 1 https://wmatrix.googleplex.com/testrun/unfiltered?test_ids=486011675 But really, at least 38s, possibly more hidden outside of these log files: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/119066934-chromeos-test/chromeos6-row2-rack23-host8/debug/ 05/22 14:04:56.760 INFO | autoserv:0687| Results placed in /usr/local/autotest/results/119066934-chromeos-test/chromeos6-row2-rack23-host8 [...] 05/22 14:05:34.924 DEBUG| logging_manager:0627| Logging subprocess finished
,
May 23 2017
There are only a very small number of security_* tests that test Chrome. And there are a lot of security_* tests that are not hardware specific. Can we create a suite that only gets run on on PreCQ (and maybe also perbuild? If I can have PreCQ+perbuild, I think we could be OK with not doing most of the work in the CQ proper.
,
May 23 2017
Agreed, we should preserve tests in VMTest, since they can (and already are) run in the pre-cq. That means we should invent a separate suite for things that should run in vmtest.
,
May 23 2017
In principle moving security tests into the pre-cq should work (say suite:pre-cq) and should help with making the inner loop faster. Of course we will get a combinatorial explosion of new suites, all with a slightly different scope and meaning (VMtest/suite:smoke gets run in other places with different meaning). I am scared of this. Also tests in the VM do not have visibility on wmatrix and sponge, hence all tests in the vm need to run on physical hardware as well. For the security tests that means at a minimum they should also run in suite:bvt-perbuild for FYI. Overall there are so many stages to our build system and phases where tests can run (and components that we test now, OS, Browser, Android), that it needs good documentation to stay consistent over time and not regress in a few weeks (silently lose coverage). The alternative of running everything everywhere FYI (bvt-perbuild) is way less fragile with regards to infra changes, but requires teams to continuously monitor the situation.
,
May 23 2017
I would enforce that for our tests, everything in PreCQ is also in -perbuild. I think the actual feature request from c#52 is a suite that runs only on PreCQ, or alternatively, a suite that runs only on VMs but that is different from smoke (smoke being akin to "is this image not irreparably broken").
,
May 25 2017
Re 55: I agree. Now for graphics tests, they can't run in VM/pre-cq, so all we can do is refactor a little to avoid the 40s penalty and cull (from 23 tests to 5): https://chromium-review.googlesource.com/#/c/510112/
,
May 29 2017
Re 51: > I am still waiting for the day that portage becomes fast. Do we have any numbers backing this statement? As a Gentoo user, someone who happens to do some work with portage and ebuild internals and someone who builds ChromeOS on my workstation quite often, I notice significant overhead mainly in the initial stage of the build (when all the dependencies are being calculated), all the rest are just build systems of particular packages doing their work. On the other hand, I can see where the model we use with portage fails to work, for example things could be done more incrementally instead of starting the build from scratch every time (at the cost of additional complexity and possible point of failure) or configure stage of autotools being awfully slow (cmake seems to be doing much better). Do we know what makes Chrome build system fast? Maybe we could adopt some of the ideas to ChromeOS build system as well?
,
May 29 2017
We should discuss this, but not on this issue. (Basically portage is a highly flexible recipe collection, but which does not allow for incremental builds. Tarballing sources, symbols and binaries are the main/costly operation.)
,
May 29 2017
,
Jun 12 2017
A bunch of coverage was moved around. Things got faster. There was much rejoicing.
,
Jun 13 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58 commit 292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58 Author: Ilja H. Friedel <ihf@chromium.org> Date: Tue Jun 13 07:19:42 2017 Run fewer graphics tests in blocking part of CQ/BVT. TLDR: This reduces the number of graphics tests in BVT from 20 to 8 As autotest currently has a server overhead per client test of 90+ DUT holding seconds (crbug.com/726481), this saves more than 90s * (20-8) = 18 minutes in the locking suite (usually distributed over 4-7 DUTs; or still 3-4 minutes CQ cycle time). -- To make the Chrome OS commit queue work better we want to a) reduce builder cycle time b) reduce the number of possible flakes causing extra cycles We also want to make the Chrome gardener's life easier and impose less graphics tests to watch on the gardening rotation. Notice all tests in this change are very mature and well behaved, but the law of large numbers is against CQ runs. So the less valuable tests must go out of the critical loop and into a not-blocking suite. To not reduce the coverage frequency we move remaining tests to bvt-perbuild. For this to work the graphics team will need to watch regressions in a rotation and revert changes after failures are detected. We move several very important tests to bvt-inline, which right now finishes faster than bvt-cq and has greater hardware coverage. The currently important/blocking graphics tests are - graphics_GLBench.bvt - Chrome-like functionality, pixel accurate. - graphics_GLMark2.bvt - Exercises game like paths. - graphics_dEQP.bvt - gles2, gles3, gles31 and vulkan. - graphics_Drm.bvt - Functional/integration tests (not in this change will follow later). - graphics_Gbm - Verifies Mesa graphics buffer management. - graphics_Idle - Verifies graphics reaches low power state. - graphics_Idle.arc - Moved by Richard to bvt-arc (good place). - graphics_Sanity - Checks screenshot for corruption. Sum of remaining pure test times (without autotest overhead) on veyron_minnie R60-9574 are on the order of 3-4 minutes: bin/autotest tests/graphics_GLBench/control.bvt - real 0m45.225s bin/autotest tests/graphics_GLMark2/control.bvt - real 0m18.270s bin/autotest tests/graphics_dEQP/control.bvt - real 0m11.459s bin/autotest tests/graphics_Drm/control.bvt - real 0m55.893s bin/autotest tests/graphics_Gbm/control - real 0m12.223s While we are at it also get rid of JOB_RETRIES in bvt-perbuild. BUG= chromium:723898 , chromium:722474 TEST=CQ will test. CQ-DEPEND=CL:*390814 Change-Id: I41e33557e74757c579ff0e9c5b0d65c0e60d0e6d Reviewed-on: https://chromium-review.googlesource.com/510112 Commit-Ready: Ilja H. Friedel <ihf@chromium.org> Tested-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Stéphane Marchesin <marcheu@chromium.org> [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_GLAPICheck/control [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.gles3.performance [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.vk-master.hasty [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_Idle/control [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.gles3.accuracy [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.gles31.stress [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_GpuReset/control [rename] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_SanAngeles/control.bvt [delete] https://crrev.com/536f23869b1b94e59916e806bcbee15c55a1f190/client/site_tests/graphics_dEQP/control.vk.info [delete] https://crrev.com/536f23869b1b94e59916e806bcbee15c55a1f190/client/site_tests/graphics_dEQP/control.vk.binding_model.hasty [delete] https://crrev.com/536f23869b1b94e59916e806bcbee15c55a1f190/client/site_tests/graphics_dEQP/control.vk.spirv_assembly.hasty [delete] https://crrev.com/536f23869b1b94e59916e806bcbee15c55a1f190/client/site_tests/graphics_dEQP/control.gles2.performance [rename] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_GLBench/control.bvt [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.vk-master.hasty.5 [delete] https://crrev.com/536f23869b1b94e59916e806bcbee15c55a1f190/client/site_tests/graphics_dEQP/control.gles31.info [delete] https://crrev.com/536f23869b1b94e59916e806bcbee15c55a1f190/client/site_tests/graphics_dEQP/control.gles3.info [delete] https://crrev.com/536f23869b1b94e59916e806bcbee15c55a1f190/client/site_tests/graphics_dEQP/control.gles2.accuracy [delete] https://crrev.com/536f23869b1b94e59916e806bcbee15c55a1f190/client/site_tests/graphics_dEQP/control.gles2.info [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_PerfControl/control [delete] https://crrev.com/536f23869b1b94e59916e806bcbee15c55a1f190/client/site_tests/graphics_dEQP/control.gles2.capability [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.vk-master.hasty.8 [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.vk-master.hasty.9 [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/generate_controlfiles.py [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.vk-master.hasty.2 [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.vk-master.hasty.3 [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.vk-master.hasty.0 [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.vk-master.hasty.1 [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.vk-master.hasty.6 [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.vk-master.hasty.7 [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.vk-master.hasty.4 [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_LibDRM/control [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.bvt [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_Gbm/control [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.gles3.stress [delete] https://crrev.com/536f23869b1b94e59916e806bcbee15c55a1f190/client/site_tests/graphics_dEQP/control.vk.glsl.hasty [delete] https://crrev.com/536f23869b1b94e59916e806bcbee15c55a1f190/client/site_tests/graphics_dEQP/control.vk.api.hasty [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_KernelMemory/control [modify] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_dEQP/control.gles2.stress [delete] https://crrev.com/536f23869b1b94e59916e806bcbee15c55a1f190/client/site_tests/graphics_dEQP/control.vk.api.smoke [rename] https://crrev.com/292ee4862c1c2da93bf1a4ff8e87019d2ef9eb58/client/site_tests/graphics_GLMark2/control.bvt [delete] https://crrev.com/536f23869b1b94e59916e806bcbee15c55a1f190/client/site_tests/graphics_dEQP/control.vk.pipeline.hasty
,
Jun 13 2017
The graphics team did its share to cut on CQ runtime. I think other teams might still be interested in contributing to reduce CQ cycle time.
,
Jan 22 2018
,
Sep 7
|
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by akes...@chromium.org
, May 15 2017