Cyan has an exceptionally high frequency of status "NOT_RUN" |
||||||||
Issue descriptionHere's a screenshot of suite crosbolt_perf_perbuild: https://screenshot.googleplex.com/zRHyEWQ0u5v Corresponding link: https://stainless.corp.google.com/search?view=matrix&row=model&col=build&first_date=2018-01-26&last_date=2018-02-01&suite=crosbolt_perf_perbuild&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true I checked cautotest host list for cyan boards and it doesn't seem like there is a huge shortage of DUTs (around 60 instances).
,
Feb 1 2018
That seems sensible to me. If there isn't a better option or reason why not, I'll do it tomorrow afternoo. +some lab-focused people who might know better
,
Feb 1 2018
I checked and it seems like there is only 1 instance of most boards in the performance pool, but Cyan is disproportionately absent in test runs.
,
Feb 1 2018
Currently, CrOS Infra doesn't manage the content of pool:performance, so a shortage or problem in the pool isn't an infra problem, per se. Also, last I knew, the performance pool typically only has/needs one DUT.
,
Feb 1 2018
Looking at the history of the performance DUT, you see this:
chromeos4-row12-rack11-host1
2018-02-01 08:46:13 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/68136-reset/
2018-02-01 08:21:59 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67931-provision/
2018-02-01 08:15:37 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173868744-chromeos-test/
2018-02-01 08:14:57 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67866-reset/
2018-02-01 08:08:29 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173868742-chromeos-test/
2018-02-01 08:07:49 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67791-reset/
2018-02-01 08:04:31 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173868762-chromeos-test/
2018-02-01 08:03:55 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67747-reset/
2018-02-01 08:01:11 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173868766-chromeos-test/
2018-02-01 07:45:42 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67588-provision/
2018-02-01 07:42:36 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173767886-chromeos-test/
2018-02-01 07:26:03 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67414-provision/
2018-02-01 07:19:51 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173767818-chromeos-test/
2018-02-01 07:18:28 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67357-reset/
2018-02-01 07:12:02 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173767806-chromeos-test/
2018-02-01 06:52:49 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67191-provision/
2018-02-01 06:49:40 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173868751-chromeos-test/
2018-02-01 06:33:45 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67038-provision/
2018-02-01 06:31:24 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173739110-chromeos-test/
2018-02-01 06:18:35 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/66948-provision/
2018-02-01 06:12:22 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173868743-chromeos-test/
2018-02-01 06:11:25 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/66905-reset/
2018-02-01 06:08:43 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173868765-chromeos-test/
2018-02-01 05:51:01 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/66705-provision/
2018-02-01 05:44:33 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173767938-chromeos-test/
2018-02-01 05:43:53 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/66643-reset/
2018-02-01 05:37:28 -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173767936-chromeos-test/
2018-02-01 05:21:03 OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/66491-provision/
[ ... ]
Note the repetition of provision tasks: That's not
normal. A well-scheduled pool runs many tests in between
provisioning.
Looking at two specific jobs:
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=173767818
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=173767886
Those jobs ran against cyan-release/R64-10176.65.0 and cyan-release/R65-10323.10.0,
respectively. So, it looks like runs against both R64 and R65 were scheduled at
about the same time. That caused provision thrashing, which made everything
run slow. When things run slow, they can time out, causing suite aborts.
The "NOT_RUN" status is just another word for "abort".
So, to explain this, we need to understand why cyan has this extra load.
Note: Checking the supply of the performance pool, every board has only
a single DUT. So, whatever is happening to cyan, it isn't because of the
DUT count.
Another note: Although the thrashing can be seen above, that only reflects
the last few hours of history. If you look over a longer time period, the
thrashing isn't terribly pronounced. So, although provision thrashing is a
likely explanation here, it may not be the only problem.
,
Feb 5 2018
,
Feb 5 2018
I'm pretty sure if you want to execute these bunches of cyan tests without any abortion, you want to extend the pool:performance of cyan. I checked scheduling logs, these tests are never get scheduled, after 16 hours, they're aborted. Even for some successfully executed cyan tests, the started time of these tests are usually 12 hours later than queued time, which is not safe. Plus that branches >= tot-2 are both scheduled for suite crosbolt_perf_perbuild, which means there're lots of builds scheduled for crosbolt_perf_perbuild. That makes this SINGLE dut super busy. Assign back to @vsuley to decide how many DUTs he wants for this pool.
,
Feb 5 2018
I'm seeing that some bvt-cq, bvt-perbuild and bvt-inline tests are also getting scheduled on this DUT: https://stainless.corp.google.com/search?exclude_retried=true&first_date=2018-01-30&master_builder_name=&builder_name_number=&shard=&exclude_acts=true&builder_name=&master_builder_name_number=&owner=&retry=&exclude_cts=false&exclude_non_production=true&hostname=chromeos4-row12-rack11-host1&board=&test=&exclude_not_run=false&build=%5ER66%5C-10373%5C.0%5C.0%24&status=GOOD&reason=&waterfall=&suite=&last_date=2018-02-05&exclude_non_release=false&exclude_au=true&model=%5Ecyan%24&view=list However, the DUT does not seem to have the label: "pool:bvt". I'm guessing this is the only DUT in perf pool that is getting clobbered because BVTs are getting scheduled on it when they shouldn't. Assigning back to investigate why this DUT in pool:performance is having BVTs scheuled on it.
,
Feb 5 2018
Is this stainless query accurate? I randomly pick up 2 tests in the list: AFE, logs 174603768 2018-02-04 15:18:47 2018-02-04 16:39:08 1m9s bvt-perbuild cyan cyan R66-10373.0.0 desktopui_MusLogin http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=174603768 It's run on chromeos4-row12-rack11-host17. Another: AFE, logs 174603464 2018-02-04 15:17:53 2018-02-04 15:59:50 56s bvt-cq cyan cyan R66-10373.0.0 hardware_Memtester.quick http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=174603464 It's also run on chromeos4-row12-rack11-host17.
,
Feb 5 2018
,
Feb 5 2018
Re #9; Yep, I didn't add a end-of-line to the query so it picked up hosts ending with '1' & '17'. Thanks for pointing that out. The performance pool only has one of each DUT to keep measurements constant, AFAIK so not sure if adding more DUTs is the best option.
,
Feb 22 2018
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by vsu...@chromium.org
, Feb 1 2018