New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 757815 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

all lab inventory jobs failing because of KeyError 'chameleon'

Project Member Reported by pprabhu@chromium.org, Aug 22 2017

Issue description

From the weekly email about stable updates

Traceback (most recent call last):
   File "site_utils/stable_images/assign_stable_images.py", line 616, in
<module>
     main(sys.argv)
   File "site_utils/stable_images/assign_stable_images.py", line 598, in main
     lab_inventory.get_managed_boards(afe))
   File "/usr/local/autotest/site_utils/lab_inventory.py", line 1164,
in get_managed_boards
     return get_inventory(afe).get_managed_boards()
   File "/usr/local/autotest/site_utils/lab_inventory.py", line 1160,
in get_inventory
     return _LabInventory.create_inventory(afe, start_time, end_time)
   File "/usr/local/autotest/site_utils/lab_inventory.py", line 467, in
create_inventory
     return cls([create(host) for host in afehosts])
   File "/usr/local/autotest/site_utils/lab_inventory.py", line 485, in
__init__
     self[h.host_board].record_host(h)
   File "/usr/local/autotest/site_utils/lab_inventory.py", line 273, in
record_host
     self._pools[pool].record_host(host_history)
KeyError: 'chameleon'
 
Status: Started (was: Untriaged)
Summary: all lab inventory jobs failing because of KeyError 'chameleon' (was: stable update failing because of KeyError 'chameleon')
No board / dut inventory jobs ran for the last two days either.
So, I'm blind w.r.t. dut-inventories, and englab-sys-cros doesn't have the usual "please do this emails".

From logs on cautotest/, it is the same problem there.
We are entirely blind wrt to metrics / alerts for these somewhat critical jobs.
OK, so there's a bunch of DUTs claiming to be in pool:chameleon

pprabhu@pprabhu:/work/chromiumos/src/third_party/autotest/files$ dut-status -p chameleon |wc
     62     308    9450


This is not a managed pool, which makes lab_inventory to blow up. I don't see any recent CL suggesting why this is now a problem.
A quick look at the relevant modules gives me no clues why this ever worked so far.
This broke roughly at the time of the last push though: https://groups.google.com/a/google.com/forum/#!searchin/chromeos-infra-discuss/push$20to$20prod%7Csort:date/chromeos-infra-discuss/x_zdy_GO58Q/UIBlEV_lBwAJ

so, will try to bisect.
This repros even before the last push.
Something changed in the lab inventory itself.
Think I got it.
The problem is that https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/597508 added pool:performance to managed pools.

And someone ended up adding a DUT to pool:performance and pool:chameleon at the same time.
lab_inventory chocked on this.

chromeos4-row5-rack1-host1     Running        chromeos-server5.hot.corp.google.com    False                None       link           bluetooth, video_glitch_detection, lightsensor, chameleon:dp, ec:cros, storage:ssd, board:link, chameleon, cts_abi_x86, cts_abi_arm, hw_jpeg_acc_dec, hw_video_acc_enc_h264, internal_display, os:cros, power:battery, hw_video_acc_h264, webcam, servo, pool:chameleon, touchpad, sku:link_intel_i5_3427u_1800_4_core_2Gb, variant:link, usb_nic:asix, pool:performance, cros-version:link-release/R62-9864.0.0


Quick fix, remove that DUT from pool:chameleon
Real fix, make lab_inventory not choke on this and/or figure out if managed pool DUTs can also be in unmanaged pools at all. I _think_ the answer is no. If that is the case, we still don't want lab_inventory to choke on it.
Quick fix:

pprabhu@pprabhu:files$ atest label remove -m chromeos4-row5-rack1-host1 pool:chameleon
Removed from label 'pool:chameleon' host: 
        chromeos4-row5-rack1-host1


My local run got beyond the sticky point. Will kick off a the lab inventory job on cautotest manually now.
Owner: jrbarnette@chromium.org
Status: Assigned (was: Started)
Posted https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/627162 to improve error messaging.
Real fix is a little more complicated because server/lib/status_history.py arbitrarily promotes the first pool label it sees as the DUT pool.

It should be migrated to use labellib.py instead of doing custom stuff. But I ain't devoting time to this as the deputy.
Labels: -Pri-1 Pri-2
For the long run, the rule is that a DUT in a managed pool
isn't allowed to be in more than one pool.  We've been
allowing one exception, which is that a DUT in pool:chameleon
may also be in pool:suites (but not any other pool).

I'm prepared to say that Chameleon attached DUTs can't be
any managed pool, if that will help move things forward.

OK.  I've done a survey looking for DUTs that break the "only one
managed pool" rule.  Apparently all guado_moblab DUTs break the
rule.  Outside of chameleon DUTs, there are 7 additional DUTs
breaking the rule.

Long-term, I think we need to do two things:
  * Add some sort of sanity enforcement that refuses to change
    pool labels if it would break the rules.  I'm uncertain of
    the feasibility/viability of doing this, but I think we have
    to give it the ol' college try.
  * Change the lab inventory script to gather a list of any
    rulebreakers, and post the list in the e-mail we send to the
    deputies.

Project Member

Comment 11 by bugdroid1@chromium.org, Aug 24 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/754403d1182481d60d4e93d05fec75ffd50bd3ae

commit 754403d1182481d60d4e93d05fec75ffd50bd3ae
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Aug 24 00:01:22 2017

autotest: Improve error message when choking on unmanaged pool

If a DUT has two pool: labels, one managed and one unmanaged,
lab_inventory may choke on the DUT (depending on which pool
status_history blesses). In this case, we got no information whatsoever
about where the failure was.
This CL is a small improvement in the error message so that the next
person doesn't need to spend an hour chasing the error message.

BUG= chromium:757815 
TEST=None

Change-Id: Ia84eaa3d3a1dfaa9e7996d2adbfccfcb4e5f1e61
Reviewed-on: https://chromium-review.googlesource.com/627162
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/754403d1182481d60d4e93d05fec75ffd50bd3ae/site_utils/lab_inventory.py

Regarding chameleon: It's time to declare that chameleon attached DUTs
can't be in a managed pool.  Currently, only three hosts break the rule.

Cc: haddowk@chromium.org
OK.  Current plan to deal with this problem long-term:
 1) Change the inventory script to identify and report any managed DUT
    that's in more than one pool.
 2) Remove managed pool labels from the three chameleon DUTs that
    have them.
 3) Declare that "managed DUT" excludes moblab instances, since our
    moblab maintenance process is currently different.

I've made the following changes:

: jrbarnette ~; atest label remove -m chromeos4-row5-rack1-host5 pool:performance
Removed from label 'pool:performance' host: 
	chromeos4-row5-rack1-host5
: jrbarnette ~; atest label remove -m chromeos2-row10-rack8-host13 pool:suites pool:crosperf
Removed from label 'pool:suites', 'pool:crosperf' host: 
	chromeos2-row10-rack8-host13
: jrbarnette ~; atest label remove -m chromeos2-row10-rack8-host15 pool:suites
Removed from label 'pool:suites' host: 
	chromeos2-row10-rack8-host15
: jrbarnette ~; atest label remove -m chromeos2-row4-rack2-host10,chromeos2-row3-rack6-host19 pool:suites 
Removed from label 'pool:suites' hosts: 
	chromeos2-row3-rack6-host19, chromeos2-row4-rack2-host10
: jrbarnette ~; atest label remove -m chromeos2-row8-rack3-host2 pool:crosperf
Removed from label 'pool:crosperf' host: 
	chromeos2-row8-rack3-host2

At this point, except for moblab instances no managed DUT is in more
than one pool.  The issue with moblab instances will be fixed by
redefining "managed DUT" to explicitly exclude any guado_moblab instance.

Within minutes of the action above, a snappy DUT acquired a new
pool label; it's now in both 'suites' and 'performance'.

Additionally, the change to chromeos4-row5-rack1-host5 didn't stick,
and had to be repeated.

This could be from known problems with master/shard synchronization,
or there could be some other issue relating to the 'performance' pool.


When I get a second I will look at the performance pool balancer to see if I can find a bug.
Logging shows that the performance pool balance script has not removed or added any labels since Aug 3rd so I do not believe it can be responsible.
OK.  All of last night's inventory jobs ran successfully, so the
database is clean for the moment.

Two tasks remain:
  * Adjust the definition of "managed DUT" to exclude guado_moblab.
  * Add checking to the lab_inventory script to report managed DUTs
    with more than one pool label.

Cc: cywang@chromium.org
Grrr...  Something re-added pool:crosperf to chromeos2-row10-rack8-host13.
That caused this morning's inventory run to fail.

$ atest label remove -m chromeos2-row10-rack8-host13 pool:crosperf
Removed from label 'pool:crosperf' host: 
	chromeos2-row10-rack8-host13

If this sort of thing keeps happening, the next step will be to
remove crosperf from the managed pools until we can figure out why.

Cc: bccheng@chromium.org
Its no me - all my performance hacks were switched off last night and the server will be decommissioned soon.


Comment 22 Deleted

@#19, it is my script to select a 'Ready' DUT with the target sku from suites to crosperf if it detects there is no DUT with the sku in crosperf. Lucky or unlucky, it picks the same DUT chromeos2-row10-rack8-host13 again.
Another example, after you remove the chromeos2-row10-rack8-host13 again 6 hours ago, and this time it swapped in chromeos6-row2-rack12-host6 into the crosperf. I will manage to make sure the selected DUT will be labelled in one pool only. In the near future, I would like to work with you to move the monitoring script to inventory jobs. Sorry about that.
> I will manage to make sure the selected DUT will be labelled in one pool only

Wait.  No.  There needs to be more than that.

The DUTs in pool:chameleon are OFF LIMITS.  So are a variety of
other DUTs.  Those DUTs can't have their pool labels changed.

My script tries to select DUT from pool:suites only? I did not touch any DUT in pool:chameleon. I am not sure what you mean. Or are they both in pool:suites and pool:chameleon?
Cc: waihong@chromium.org
Cc: ka...@chromium.org rjahagir@chromium.org
+kalin@ and rjahagir@ who keep track of the DUTs with chameleons, to ensure the devices in the chameleon pools should not contain any non-chameleon pool labels
> [ ... ] Or are they both in pool:suites and pool:chameleon?

There are no longer DUTs in both pool:suites and pool:chameleon,
although, until recently, there were.

In any event, the rule to use is the rule from 'balance_pool':
"In exactly one pool."

Re #23, #24, #25.... I think it high time we simply disallow external teams to toggle pool labels via automated jobs.
We (chromeos-infra) team should own these automated management jobs. The impact of mistakes here is too big, and outside of the infra team, pretty much impossible to anticipate or respond to.

I've already contacted the folks for performance pool balancing to try to rewrite their scripts as part of our usual inventory management jobs.

Note that all these automated jobs will break when we move to skylab inventory, unless they're accounted for and migrated by the infra team (some time mid Q4)
I have some DUTs in wificells which use both pool:chameleon and pool:wificell (see chromeos9-row4-rack[7-8]-host[1-4]) for examples.
> I have some DUTs in wificells which use both pool:chameleon and pool:wificell [ ... ]

Neither chameleon nor wificell are managed pools, so those DUTs aren't
affected or involved.

These are the managed pools:
    bvt
    cq
    continuous
    cts
    arc-presubmit
    crosperf
    performance
    suites

Only DUTs in one of those pools are affected.

Another chameleon DUT has acquired performance pool label:
    chromeos2-row10-rack8-host13 pool:chameleon, pool:performance, pool:chameleon_audio_stable, pool:chameleon_hdmi_stable, pool:audio_board, pool:manual_run_chameleon_suites,

I'm going to go remove both 'performance' and 'crosperf' from the
managed pool labels until this gets sorted out.  If it were possible
to block external tools from manipulating pool assignments, I'd do
that instead.

Project Member

Comment 33 by bugdroid1@chromium.org, Sep 5 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0a18de00b807e8b5f2b9111835fe1913d0b6ac8a

commit 0a18de00b807e8b5f2b9111835fe1913d0b6ac8a
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Tue Sep 05 21:15:07 2017

[autotest] Make the performance pools unmanaged.

Something external to Autotest is adding chameleon DUTs to the
'crosperf' and 'performance' pools.  This is causing lab inventory
scripts to fail, because that creates DUTs that are in both
"managed" and "unmanaged" pools, which the tools can't handle.

This removes the two performance pools from the managed pool list
until we can stop the improper relabeling.

BUG= chromium:757815 
TEST=None

Change-Id: I1fdcef327dad0dd15ebd546271e9a8457eb7d221
Reviewed-on: https://chromium-review.googlesource.com/650534
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/0a18de00b807e8b5f2b9111835fe1913d0b6ac8a/site_utils/suite_scheduler/constants.py

Project Member

Comment 34 by bugdroid1@chromium.org, Sep 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/eabcf39bd25c46239e8e9d12d058c48d59f2c26d

commit eabcf39bd25c46239e8e9d12d058c48d59f2c26d
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Wed Sep 06 23:54:29 2017

[autotest] Exclude moblab from lab inventory management.

The moblab instances in Stierlin Ct are handled by a separate
process that's separate from the standard lab inventory cron job
e-mail.  So, remove the board from consideration.

BUG= chromium:757815 
TEST=unit test for sanity

Change-Id: Ie327752c2b2980f541e091867e75ac135f870e83
Reviewed-on: https://chromium-review.googlesource.com/647316
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/eabcf39bd25c46239e8e9d12d058c48d59f2c26d/site_utils/lab_inventory.py

I'm concerend about #34.
moblab is in the CQ. Lack of DUTs will cause CQ failures. We should exclude it from board_inventory but not DUT inventory. cros-infra-deputy needs to be aware of moblab health.
For #33, I would also add the check in my script to make sure we won't pull DUT with chameleon board into performance pools as well. Please let us know once we are clear to move forward as we might do balance_pool between performance pools and inventory one(pool:suite)? Thanks~
BTW, FYI, there is no DUT in performance pools(pool: performance and crosperf) with label of chameleon:

google3$ atest host list -b pool:performance | grep chameleon
google3$ atest host list -b pool:crosperf | grep chameleon


> BTW, FYI, there is no DUT in performance pools(pool: performance and crosperf) with label of chameleon:

That's true now because I removed the pool:crosperf label from the offending
DUT cited in c#32.

Another DUT has acquired a performance pool label inexplicably:
    chromeos6-row2-rack20-host7 pool:suites, pool:performance,

I'm going to remove 'pool:suites' label from the DUT for now.

All external services that change DUT labels (especially pool
labels) need to be turned off ASAP.

I found the RPC history showing what happened to chromeos6-row2-rack20-host7:

09/06 01:49:16.669 INFO | 172.30.210.13| modify_host:cywang [{'locked': True, 'id': 'chromeos6-row2-rack20-host7', 'lock_reason': 'migrate_to_performance'}]
09/06 01:49:27.532 INFO | 172.30.210.13| label_add_hosts:cywang [{'hosts': ['chromeos6-row2-rack20-host7'], 'id': 'pool:performance'}]
09/06 01:49:38.300 INFO | 172.30.210.13| label_remove_hosts:cywang [{'hosts': ['chromeos6-row2-rack20-host7'], 'id': 'pool:suites'}]
09/06 01:49:50.207 INFO | 172.30.210.13| modify_host:cywang [{'locked': False, 'id': 'chromeos6-row2-rack20-host7', 'lock_reason': ''}]
09/07 01:07:29.335 INFO | 172.30.210.13| label_add_hosts:cywang [{'hosts': ['chromeos6-row2-rack20-host7'], 'id': 'pool:suites'}]
09/07 01:07:30.109 INFO | 172.30.210.13| label_remove_hosts:cywang [{'hosts': ['chromeos6-row2-rack20-host7'], 'id': 'pool:crosperf'}]

Initially, the host was removed from suites, and added to performance.
Then, the host was added back to suites, and removed from crosperf.
Removal from crosperf would have failed, since the label wasn't
present, leaving the DUT in both suites and performance.

Ah, that's my bug which hardcode the pool name(crosperf) when it tried to return a 'Repair Failed' DUT back to pool:performance. Apologize for causing this problem.


The DUT originally added into pool:performance, then the script detected it was 'Repair Failed' @Thu Sep  7 16:07:24 CST 2017, so it tried to return it to pool:suites and removed itself from pool:performance, however, the script incorrectly tried to remove it from 'pool:crosperf'. That's why the DUT ends up in two pools.

Here is my log for the DUT - chromeos6-row2-rack20-host7:
===
[Move one DUT(wizpig - wizpig_intel_celeron_n3060_2Gb) from suites to pool:performance]
===> Migrate chromeos6-row2-rack20-host7 from suites to pool:performance...
Locked host: 
	chromeos6-row2-rack20-host7
Added to label 'pool:performance' host: 
	chromeos6-row2-rack20-host7
Removed from label 'pool:suites' host: 
	chromeos6-row2-rack20-host7
Unlocked host: 
	chromeos6-row2-rack20-host7
=================================================
========= Thu Sep  7 16:07:24 CST 2017 ==========
====== Check pool:performance ==========
[Check failed DUT(s) in pool:performance]
chromeos6-row2-rack20-host7   Repair Failed  chromeos-server104.mtv.corp.google.com  False                None       wizpig         board:wizpig, accel:cros-ec, arc, os:cros, power:battery, ec:cros, servo, cts_abi_x86, cts_abi_arm, storage:mmc, webcam, internal_display, audio_loopback_dongle, usb_nic:r8152, sku:wizpig_intel_celeron_n3060_2Gb, variant:wizpig, touchpad, phase:PVT, pool:performance, cros-version:wizpig-release/R63-9916.0.0
Retire  from pool:crosperf
Added to label 'pool:suites' host: 
	chromeos6-row2-rack20-host7
Removed from label 'pool:crosperf' host: 
	chromeos6-row2-rack20-host7

=== 
Project Member

Comment 42 by bugdroid1@chromium.org, Oct 25 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/99473f6963e30282bed8ef4538af7d23be3fba6c

commit 99473f6963e30282bed8ef4538af7d23be3fba6c
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Wed Oct 25 05:47:53 2017

[autotest] Exclude hosts in more than one pool from lab inventory.

The lab inventory script fails if a DUT is assigned to a non-managed
pool.  The query that constructs the inventory selects only hosts
that are in managed pools, but failure may still occur if a DUT is
in both a managed and non-managed pool.

The lab inventory script isn't meant to allow DUTs to be in more
than one pool, so this excludes those DUTs from inventory.

BUG= chromium:757815 
TEST=run the script locally

Change-Id: Ia5f5ed762b58e9bd6ce3014c07ee7cc8fd39887c
Reviewed-on: https://chromium-review.googlesource.com/724216
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/99473f6963e30282bed8ef4538af7d23be3fba6c/site_utils/lab_inventory.py

Status: Fixed (was: Assigned)
OK.  We still may have problems with outside code that breaks the
inventory by incorrectly adding/removing pool labels.  But, that
ain't this bug.  _This_ bug is fixed.

Sign in to add a comment