New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 759096 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Last visit > 30 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

PFQ informational builders should have better HWTest coverage

Project Member Reported by steve...@chromium.org, Aug 25 2017

Issue description

Neither of the ARC++ enabled builders in the pfq-informational tree appear to be running HWTest:

https://uberchromegw.corp.google.com/i/chromeos.chrome/waterfall
* veyron_minnie-tot-chrome-pfq-informational
* cyan-tot-chrome-pfq-informational

This means that arc-bvt tests do not get run until the master-chromium-pfq which is bad.

 
Cc: xiaoyinh@chromium.org
I think https://chromium-review.googlesource.com/#/c/chromiumos/chromite/+/636216/ should fix this, however before this lands it looks like we need to allocate a 'continuous' pool for these. Unfortunately both of these systems are low on free suites builders, so we may have to take from some other pools.

bhthompson@ragnarok:/usr/local/ssd960gb/chromiumos/src/third_party/autotest/files/contrib$ atest host list -b board:cyan 
| ./count_labels -p                                                                                                      
      2 audio_board
     11 bvt
      2 chameleon_audio_stable
      6 cq
      1 crosperf
     18 cts
      1 faft-test
      6 faft-test-au
      1 performance
      1 stress-wifi
      4 suites
      1 tablet_mode
      1 wificell
      1 wificell-pre-cq
bhthompson@ragnarok:/usr/local/ssd960gb/chromiumos/src/third_party/autotest/files/contrib$ atest host list -b board:veyron_minnie | ./count_labels -p                                                                                             
      1 av-flexible
     11 bvt
      1 chameleon
      7 cq
      2 crosperf
     18 cts
      4 faft-test-au
      1 performance
      6 suites
      1 wificasey_BRCM
      1 wificell
      1 wifichaos

Not sure how many we could take from the cts pool and still get enough coverage there?

Are there different boards we could select in order to spread
the load?

Looking at the existing continuous pool allocations, some boards
get by with just 6 DUTs:
    $ atest host list -b pool:continuous | count_labels -b
          6 caroline
         12 peach_pit
          6 reks
         12 tricky

I've got no objection to taking from the cts pool, but it's not
my call to make (and I can guess that the owner will object).

Comment 4 by pwang@chromium.org, Aug 25 2017

Cc: pwang@chromium.org

Comment 5 by ihf@chromium.org, Aug 25 2017

Cc: rohi...@chromium.org kenobi@chromium.org kinaba@chromium.org
I think ideally we should find new DUTs or new boards for this task. cyan is a no-go, minnie a maybe if we can't find something better (in general I expect fast ARM boards to be candidates for idle DUTs).

Details:
cts workload is setup to scale from few DUTs to many. But we need a few configs that actually run everything to cover regressions well.

cyan is continuously pegged and drops a little bit of work (not too much though):
https://viceroy.corp.google.com/chromeos/dut_utilization?board=cyan&pool=managed%3Acts&status=Running&topstreams=40&duration=8d&mdb_role=chrome-infra&refresh=-1

minnie is not pegged. It fulfills all the work with a few gaps for pauses
https://viceroy.corp.google.com/chromeos/dut_utilization?board=veyron_minnie&pool=managed%3Acts&status=Running&topstreams=40&duration=694792&mdb_role=chrome-infra&refresh=-1

We also have veyron_mighty as a backup
https://viceroy.corp.google.com/chromeos/dut_utilization?board=veyron_mighty&pool=managed%3Acts&status=Running&topstreams=40&duration=8d&mdb_role=chrome-infra&refresh=-1

So I think it wouldn't be impossible to free up 6 minnies, but I would expect it to degrade veyron CTS coverage a bit.
Owner: bhthompson@google.com
Status: Assigned (was: Untriaged)
If possible it would be nice to use a board that matches one on the PFQ. I understand that mixing coverage has benefits, but it also causes confusion. It is much more clear to gardeners (most of which have only partial or passing familiarity with the myriad of board configurations) if the pfq-informational waterfall is a subset of the PFQ.

So, my vote is for minnie, at least for now.

I will respin the CL for just minnie for now and we can go with that in the mean time.

Longer term if we want another builder on this waterfall I am not sure how exactly these get allocated (it does not appear to be done directly from chromite, so I am guessing something on the Chrome infra side instantiates these). 

Maybe we could put up a cave builder?

bhthompson@ragnarok:/usr/local/ssd960gb/chromiumos/src/third_party/autotest/files/contrib$ atest host list -b board:cave 
| ./count_labels -p                                                                                                      
      6 bvt
      8 cq
      1 crosperf
     20 cts
      1 performance
     10 suites
      1 wificell


> [ ... ] I am not sure how exactly these get allocated [ ... ]

You need to file a bug to CrOS Infra, asking for the pool to be
created/resized.  Typically, the bug should be assigned to the
secondary deputy.

Instructions for how to do the resizing are here:
    https://sites.google.com/a/google.com/chromeos/for-team-members/infrastructure/chromeos-admin/creating-pools

Preference #1:
A device with so many devices in pool:suites. 

Preference #2:
A faster ARC++ ARM board. The rationale is, an ARM board needs to run less CTS (one ABI), and if it it's faster than Minnie, We may not experience the gap in CTS testing.

The next ARC++ version will have to run double the CTS tests so I would look for suites devices.
The pools I know how to allocate, I mean I am not sure how the waterfall itself at https://uberchromegw.corp.google.com/i/chromeos.chrome/waterfall gets instantiated. 

These configs in config_dump.json and chromeos-config.py don't appear to indicate which boards actually show up in for tot-chrome-pfq-informational configs here. I suspect it might work like the firmware/factory builders but I have not dug into it.

Comment 11 by ihf@chromium.org, Aug 25 2017

Re 7: cave pool is fully pegged. Lets start with minnie only. As for Intel coverage for continuous Chrome, we should look at the larger context how we can spare DUTs. That said we do have a lot of terra DUTs and we could put some of these in PFQ and informational.
If there is an option to pick a board which is not there is PFQ, why not to pick some other faster ARM device? or more free DUTs in pool:suites?
I pulled 4 DUTs from cts and 2 DUTs from suites to make a 6 unit pool of continuous.

bhthompson@ragnarok:/usr/local/ssd960gb/chromiumos/src/third_party/autotest/files/contrib$ atest host list -b board:veyron_minnie | ./count_labels -p
      1 av-flexible
     11 bvt
      1 chameleon
      6 continuous
      7 cq
      2 crosperf
     14 cts
      4 faft-test-au
      1 performance
      4 suites
      1 wificasey_BRCM
      1 wificell
      1 wifichaos

The CL to turn on veyron_minnie is now in the CQ.
Project Member

Comment 14 by bugdroid1@chromium.org, Aug 28 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/19e33cff15662e4a1cee89d16eacc79f339acd90

commit 19e33cff15662e4a1cee89d16eacc79f339acd90
Author: Bernie Thompson <bhthompson@google.com>
Date: Mon Aug 28 22:24:14 2017

Give HWTests to the minnie tot-chrome-pfq-informational builder

BUG=chromium:759096
TEST=None

Change-Id: I20be97896d837b7d383ebbef6c3ac572cdb56c34
Reviewed-on: https://chromium-review.googlesource.com/636216
Commit-Ready: Bernie Thompson <bhthompson@chromium.org>
Tested-by: Bernie Thompson <bhthompson@chromium.org>
Reviewed-by: Po-Hsien Wang <pwang@chromium.org>
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>

[modify] https://crrev.com/19e33cff15662e4a1cee89d16eacc79f339acd90/cbuildbot/config_dump.json
[modify] https://crrev.com/19e33cff15662e4a1cee89d16eacc79f339acd90/cbuildbot/chromeos_config.py

And even catching failures! Cheers!

Comment 17 by ihf@chromium.org, Aug 30 2017

For the records, it looks like the continuous pool only needs about 3 ARM DUTs at current test load.
https://viceroy.corp.google.com/chromeos/dut_utilization?board=veyron_minnie&pool=managed%3Acontinuous&status=Running&topstreams=40&duration=8d&mdb_role=chrome-infra&refresh=-1

For Intel probably 4 DUTs should be enough. Somebody else already dissolved the caroline and reks pools yesterday (I guess as they were not used so far). I reduced the tricky size to 6 for now as it was grossly oversized.

That said I don't agree with dissolving caroline. In the past I was hoping to have at some point caroline coverage everywhere. Caroline is on the PFQ and it would be great to add it to the continuous/informational coverage as well. I think we should re-add it.
I would support adding caroline to chrome-pfq-informational since it is in chrome-pfq, assuming it makes sense to keep it in chrome-pfq.

Comment 19 by ihf@chromium.org, Aug 30 2017

Mhh, something is broken right now. The caroline continuous pool is still there, but the lab thinks the DUTs are no good. But I can ssh into each of them just fine:

./dut_status.py -b caroline -p continuous
hostname                       S   last checked         URL
chromeos6-row2-rack23-host11   ??  2017-08-28 12:40:21  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack23-host11/1477644-verify/
chromeos6-row2-rack23-host9    ??  2017-08-28 12:40:21  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack23-host9/1477645-verify/
chromeos6-row2-rack23-host15   ??  2017-08-28 12:40:21  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack23-host15/1477646-verify/
chromeos6-row2-rack23-host13   ??  2017-08-28 12:40:21  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack23-host13/1477647-verify/
chromeos6-row2-rack23-host17   ??  2017-08-28 12:40:21  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack23-host17/1477648-verify/
chromeos6-row2-rack21-host11   ??  2017-08-28 12:40:21  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack21-host11/1477649-verify

Comment 20 by ihf@chromium.org, Aug 31 2017

I manually reimaged and reverified the DUTs and caroline continuous pool is back. Bernie, could you please add caroline to informational coverage?

(cr) ((449289bf3...)) ihf@ql ~/trunk/src/third_party/autotest/files/site_utils $ ./dut_status.py -b caroline -p continuous
hostname                       S   last checked         URL
chromeos6-row2-rack23-host11   OK  2017-08-30 20:09:44  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack23-host11/1508704-repair/
chromeos6-row2-rack23-host9    OK  2017-08-30 20:08:13  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack23-host9/1508691-verify/
chromeos6-row2-rack23-host15   OK  2017-08-30 22:01:04  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack23-host15/1508925-verify/
chromeos6-row2-rack23-host13   OK  2017-08-30 23:16:02  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack23-host13/1508924-verify/
chromeos6-row2-rack23-host17   OK  2017-08-30 22:01:43  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack23-host17/1508922-verify/
chromeos6-row2-rack21-host11   OK  2017-08-30 22:01:54  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack21-host11/1508917-verify/

> I manually reimaged [ ... ]

Manually re-imaged?  That was overkill...

Regarding this:
chromeos6-row2-rack23-host11   ??  2017-08-28 12:40:21  http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row2-rack23-host11/1477644-verify/

The '??' status means that the DUT was good at the cited
time, but the status is old, meaning the DUT hasn't run tests
for 24 hours or more.  Normally, the lab re-verifies DUTs before
that happens, but the system isn't foolproof.

In general, if you see the '??' status, the 'repair_hosts'
command will clean it up for you.  For the case above, this
would do the trick:
    $ dut-status -b caroline -p continuous -n | xargs repair_hosts

(N.B.  For a working DUT, "repair" and "verify" are equivalent; for
a broken DUT, "repair" will get results sooner.  IOW, "verify" is
worthless, and someday, when some brave soul volunteers to do the
work, we'll delete "verify").
Re:21 is there a site page for such useful info? ;-)
> Re:21 is there a site page for such useful info? ;-)

The lab team has an index of a variety of procedures used
in regular maintenance.  That's at go/chromeos-lab-admin.

On that page you'll find a table with an entry labeled
"Reverify DUTs", which points here:
    https://sites.google.com/a/google.com/chromeos/for-team-members/infrastructure/chromeos-admin/reverify

Thanks, that's helpful!
I think https://chrome-internal-review.googlesource.com/446195 might do what we want, I am not super familiar with this informational waterfall.

The CL replaces cyan with caroline, and caroline is already configured to have tests.

If we want cyan and caroline at the same time we need someone on the Chrome infra side to take a look, I am not sure how we are supposed to allocate new build slaves on this waterfall (prior CLs just seem to use them, but I am not sure how they are chosen).
Project Member

Comment 26 by bugdroid1@chromium.org, Sep 21 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chrome/tools/build/+/bfb3899f7bc97c732b68f7518d28b47db44c462d

commit bfb3899f7bc97c732b68f7518d28b47db44c462d
Author: Bernie Thompson <bhthompson@google.com>
Date: Thu Sep 21 17:22:28 2017

So, caroline-tot-chrome-pfq-informational is currently running bvt-inline, bvt-cq, bvt-arc, and chrome-informational.

However, caroline-chrome-pfq is not running any HWTests, which is pretty confusing and potentially problematic.

Is this due to resource constraints? Are there any plans to change this?

Comment 28 by ihf@chromium.org, Nov 10 2017

Machine availability is pretty tight and will not get better. While it is desirable to squeeze all coverage combinations in, machines are just not available for every combo.

atest host list -b board:caroline | ./count_labels -p
      8 bvt
      6 continuous
      8 cq
      1 crosperf
     18 cts
      1 groamer
      1 performance
      1 stress
      8 suites
      1 wificell

One could add 3 DUTs to bvt and then add hwtest to caroline in ApplyCustomOverrides()

      'caroline-chrome-pfq': {
          'hw_tests': hw_test_list.SharedPoolPFQ(),
      },

--

sentry and cave are same SOC/family and there are a few more DUTs in the suites pool. So if you want caroline HWTest coverage you may want to run HWTest on either sentry or cave instead. (I know, one extra builder.) cave in particular should have enough DUTs to provide Chrome PFQ hardware coverage.

atest host list -b board:cave | ./count_labels -p
      5 bvt
[...]
     12 suites
The tot-chrome-informational builders run hwtest against their own dedicated pool "continuous". This is by design so that they don't overwhelm bvt pool (these builders run much more frequently).

So, we could enlarge the bvt pool for caroline as we do for other chrome-pfq hwtest-builders as #27 suggests. Or pick a different board in the family to share with.
Ideally it would be much less confusing to gardners (and anyone not directly involved in lab allocation) if all pfq-chrome (informational or primary) builders ran the same hwtest suite(s). Arguably there isn't much value in having board specific PFQ builders that do not run HWTests.

Some of them might be required for producing prebuilts.

I agree it is confusing. jrbarnette@ has a background project to invent a tool to make this easier to understand.

Comment 32 by ihf@chromium.org, Nov 10 2017

Yes, the PFQ builders that don't run hardware tests generate the Chrome prebuilds. That is their actual purpose in life.

That said to get more hardware coverage for the Chrome PFQ is to search for large suite pools in the lab. DUTs are hard to get these days, but if gardeners move fast and make some claims they can get allocations on new deployments.

As for the confusion of gardeners, some we have unified builds and then caroline, cave will all be a wash as they will run the same image.

That said there are boatloads of older boards (pre-CTS, especially Baytrail) that have huge idle suite pools that can be taken for expanding continuous and PFQ HWTesting.



"gardeners make some claimes"??? We're doing our best to learn more about the infrastructure that supports us, but DUT allocation seems way beyond our purview :P

WRT prebuilts, hopefully documenting the reason for each is part of what jrbarnette@ is working on? e.g. why we have both 'caroline' and 'chell' on the PFQ?

I just noticed that _paladin_hwtest_assignments has been formatted as a nicely documented table (thanks Richard + Bernie!), maybe we can do something similar for chrome_pfq_important_boards (and chrome_informational_hwtest_boards).



Comment 34 by ihf@chromium.org, Nov 11 2017

"claims" as in gold rush, basically... Lab resource allocation is discrete and exclusive.

Here is a list of boards with more than 15 devices in the suites pool. Start picking your extended PFQ coverage...

atest host list -b pool:suites | ./count_labels -b
     17 auron_paine
     29 banjo
     21 buddy
     31 candy
     19 clapper
     32 coral
     41 gnawty
     28 guado
     19 heli
     20 kip
     33 leon
     22 monroe
     15 ninja
     18 nyan_big
     27 nyan_blaze
     17 nyan_kitty
     26 orco
     21 panther
     20 peach_pi
     17 peppy
     27 quawks
     25 rikku
     25 sand
     21 squawks
     25 swanky
     16 tidus
     20 tricky
     20 ultima
     17 veyron_mickey
     17 winky
Components: Infra>Client>ChromeOS>CI
Components: -Infra>Client>ChromeOS
Labels: -Pri-1 Pri-3
No activity in 6 months; must not be a P1.
Owner: bhthompson@chromium.org

Sign in to add a comment