New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 807819 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Cyan has an exceptionally high frequency of status "NOT_RUN"

Project Member Reported by vsu...@chromium.org, Feb 1 2018

Issue description

My bad, forgot to narrow down to pool:performance. There is exactly one Cyan in pool:performance among a total of 81 DUTs. Do we need to add more? Is it expected that one DUT wouldn't be able to handle the suite's workload? 

Comment 2 by jkop@chromium.org, Feb 1 2018

Cc: jrbarnette@chromium.org shuqianz@chromium.org
Status: Assigned (was: Untriaged)
That seems sensible to me. If there isn't a better option or reason why not, I'll do it tomorrow afternoo.

+some lab-focused people who might know better
I checked and it seems like there is only 1 instance of most boards in the performance pool, but Cyan is disproportionately absent in test runs. 
Currently, CrOS Infra doesn't manage the content of pool:performance,
so a shortage or problem in the pool isn't an infra problem, per se.

Also, last I knew, the performance pool typically only has/needs one DUT.

Looking at the history of the performance DUT, you see this:

chromeos4-row12-rack11-host1
    2018-02-01 08:46:13  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/68136-reset/
    2018-02-01 08:21:59  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67931-provision/
    2018-02-01 08:15:37  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173868744-chromeos-test/
    2018-02-01 08:14:57  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67866-reset/
    2018-02-01 08:08:29  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173868742-chromeos-test/
    2018-02-01 08:07:49  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67791-reset/
    2018-02-01 08:04:31  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173868762-chromeos-test/
    2018-02-01 08:03:55  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67747-reset/
    2018-02-01 08:01:11  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173868766-chromeos-test/
    2018-02-01 07:45:42  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67588-provision/
    2018-02-01 07:42:36  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173767886-chromeos-test/
    2018-02-01 07:26:03  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67414-provision/
    2018-02-01 07:19:51  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173767818-chromeos-test/
    2018-02-01 07:18:28  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67357-reset/
    2018-02-01 07:12:02  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173767806-chromeos-test/
    2018-02-01 06:52:49  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67191-provision/
    2018-02-01 06:49:40  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173868751-chromeos-test/
    2018-02-01 06:33:45  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/67038-provision/
    2018-02-01 06:31:24  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173739110-chromeos-test/
    2018-02-01 06:18:35  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/66948-provision/
    2018-02-01 06:12:22  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173868743-chromeos-test/
    2018-02-01 06:11:25  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/66905-reset/
    2018-02-01 06:08:43  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173868765-chromeos-test/
    2018-02-01 05:51:01  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/66705-provision/
    2018-02-01 05:44:33  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173767938-chromeos-test/
    2018-02-01 05:43:53  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/66643-reset/
    2018-02-01 05:37:28  -- http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/173767936-chromeos-test/
    2018-02-01 05:21:03  OK http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row12-rack11-host1/66491-provision/
[ ... ]

Note the repetition of provision tasks:  That's not
normal.  A well-scheduled pool runs many tests in between
provisioning.

Looking at two specific jobs:
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=173767818
    http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=173767886

Those jobs ran against cyan-release/R64-10176.65.0 and cyan-release/R65-10323.10.0,
respectively.  So, it looks like runs against both R64 and R65 were scheduled at
about the same time.  That caused provision thrashing, which made everything
run slow.  When things run slow, they can time out, causing suite aborts.
The "NOT_RUN" status is just another word for "abort".

So, to explain this, we need to understand why cyan has this extra load.

Note:  Checking the supply of the performance pool, every board has only
a single DUT.  So, whatever is happening to cyan, it isn't because of the
DUT count.

Another note:  Although the thrashing can be seen above, that only reflects
the last few hours of history.  If you look over a longer time period, the
thrashing isn't terribly pronounced.  So, although provision thrashing is a
likely explanation here, it may not be the only problem.

Comment 6 by jkop@chromium.org, Feb 5 2018

Cc: -shuqianz@chromium.org jkop@chromium.org
Owner: xixuan@chromium.org
Owner: vsu...@chromium.org
I'm pretty sure if you want to execute these bunches of cyan tests without any abortion, you want to extend the pool:performance of cyan.

I checked scheduling logs, these tests are never get scheduled, after 16 hours, they're aborted.

Even for some successfully executed cyan tests, the started time of these tests are usually 12 hours later than queued time, which is not safe.

Plus that branches >= tot-2 are both scheduled for suite crosbolt_perf_perbuild, which means there're lots of builds scheduled for crosbolt_perf_perbuild. That makes this SINGLE dut super busy.

Assign back to @vsuley to decide how many DUTs he wants for this pool.
Owner: xixuan@chromium.org
I'm seeing that some bvt-cq, bvt-perbuild and bvt-inline tests are also getting scheduled on this DUT: https://stainless.corp.google.com/search?exclude_retried=true&first_date=2018-01-30&master_builder_name=&builder_name_number=&shard=&exclude_acts=true&builder_name=&master_builder_name_number=&owner=&retry=&exclude_cts=false&exclude_non_production=true&hostname=chromeos4-row12-rack11-host1&board=&test=&exclude_not_run=false&build=%5ER66%5C-10373%5C.0%5C.0%24&status=GOOD&reason=&waterfall=&suite=&last_date=2018-02-05&exclude_non_release=false&exclude_au=true&model=%5Ecyan%24&view=list

However, the DUT does not seem to have the label: "pool:bvt". I'm guessing this is the only DUT in perf pool that is getting clobbered because BVTs are getting scheduled on it when they shouldn't. 

Assigning back to investigate why this DUT in pool:performance is having BVTs scheuled on it.
Owner: vsu...@chromium.org
Is this stainless query accurate?

I randomly pick up 2 tests in the list: 
AFE, logs	174603768	2018-02-04 15:18:47	2018-02-04 16:39:08	1m9s	bvt-perbuild	cyan	cyan	R66-10373.0.0	
desktopui_MusLogin

http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=174603768
It's run on chromeos4-row12-rack11-host17.

Another: AFE, logs	174603464	2018-02-04 15:17:53	2018-02-04 15:59:50	56s	bvt-cq	cyan	cyan	R66-10373.0.0	hardware_Memtester.quick

http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=174603464
It's also run on chromeos4-row12-rack11-host17.



Comment 10 by jkop@chromium.org, Feb 5 2018

Cc: -jkop@chromium.org
Re #9; Yep, I didn't add a end-of-line to the query so it picked up hosts ending with '1' & '17'. Thanks for pointing that out. 

The performance pool only has one of each DUT to keep measurements constant, AFAIK so not sure if adding more DUTs is the best option. 
Cc: hiroh@chromium.org

Sign in to add a comment