all lab inventory jobs failing because of KeyError 'chameleon' |
|||||||||
Issue description
From the weekly email about stable updates
Traceback (most recent call last):
File "site_utils/stable_images/assign_stable_images.py", line 616, in
<module>
main(sys.argv)
File "site_utils/stable_images/assign_stable_images.py", line 598, in main
lab_inventory.get_managed_boards(afe))
File "/usr/local/autotest/site_utils/lab_inventory.py", line 1164,
in get_managed_boards
return get_inventory(afe).get_managed_boards()
File "/usr/local/autotest/site_utils/lab_inventory.py", line 1160,
in get_inventory
return _LabInventory.create_inventory(afe, start_time, end_time)
File "/usr/local/autotest/site_utils/lab_inventory.py", line 467, in
create_inventory
return cls([create(host) for host in afehosts])
File "/usr/local/autotest/site_utils/lab_inventory.py", line 485, in
__init__
self[h.host_board].record_host(h)
File "/usr/local/autotest/site_utils/lab_inventory.py", line 273, in
record_host
self._pools[pool].record_host(host_history)
KeyError: 'chameleon'
,
Aug 22 2017
OK, so there's a bunch of DUTs claiming to be in pool:chameleon
pprabhu@pprabhu:/work/chromiumos/src/third_party/autotest/files$ dut-status -p chameleon |wc
62 308 9450
This is not a managed pool, which makes lab_inventory to blow up. I don't see any recent CL suggesting why this is now a problem.
,
Aug 22 2017
A quick look at the relevant modules gives me no clues why this ever worked so far. This broke roughly at the time of the last push though: https://groups.google.com/a/google.com/forum/#!searchin/chromeos-infra-discuss/push$20to$20prod%7Csort:date/chromeos-infra-discuss/x_zdy_GO58Q/UIBlEV_lBwAJ so, will try to bisect.
,
Aug 22 2017
This repros even before the last push. Something changed in the lab inventory itself.
,
Aug 22 2017
Think I got it. The problem is that https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/597508 added pool:performance to managed pools. And someone ended up adding a DUT to pool:performance and pool:chameleon at the same time. lab_inventory chocked on this. chromeos4-row5-rack1-host1 Running chromeos-server5.hot.corp.google.com False None link bluetooth, video_glitch_detection, lightsensor, chameleon:dp, ec:cros, storage:ssd, board:link, chameleon, cts_abi_x86, cts_abi_arm, hw_jpeg_acc_dec, hw_video_acc_enc_h264, internal_display, os:cros, power:battery, hw_video_acc_h264, webcam, servo, pool:chameleon, touchpad, sku:link_intel_i5_3427u_1800_4_core_2Gb, variant:link, usb_nic:asix, pool:performance, cros-version:link-release/R62-9864.0.0 Quick fix, remove that DUT from pool:chameleon Real fix, make lab_inventory not choke on this and/or figure out if managed pool DUTs can also be in unmanaged pools at all. I _think_ the answer is no. If that is the case, we still don't want lab_inventory to choke on it.
,
Aug 22 2017
Quick fix:
pprabhu@pprabhu:files$ atest label remove -m chromeos4-row5-rack1-host1 pool:chameleon
Removed from label 'pool:chameleon' host:
chromeos4-row5-rack1-host1
My local run got beyond the sticky point. Will kick off a the lab inventory job on cautotest manually now.
,
Aug 22 2017
Posted https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/627162 to improve error messaging. Real fix is a little more complicated because server/lib/status_history.py arbitrarily promotes the first pool label it sees as the DUT pool. It should be migrated to use labellib.py instead of doing custom stuff. But I ain't devoting time to this as the deputy.
,
Aug 22 2017
,
Aug 23 2017
For the long run, the rule is that a DUT in a managed pool isn't allowed to be in more than one pool. We've been allowing one exception, which is that a DUT in pool:chameleon may also be in pool:suites (but not any other pool). I'm prepared to say that Chameleon attached DUTs can't be any managed pool, if that will help move things forward.
,
Aug 23 2017
OK. I've done a survey looking for DUTs that break the "only one
managed pool" rule. Apparently all guado_moblab DUTs break the
rule. Outside of chameleon DUTs, there are 7 additional DUTs
breaking the rule.
Long-term, I think we need to do two things:
* Add some sort of sanity enforcement that refuses to change
pool labels if it would break the rules. I'm uncertain of
the feasibility/viability of doing this, but I think we have
to give it the ol' college try.
* Change the lab inventory script to gather a list of any
rulebreakers, and post the list in the e-mail we send to the
deputies.
,
Aug 24 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/754403d1182481d60d4e93d05fec75ffd50bd3ae commit 754403d1182481d60d4e93d05fec75ffd50bd3ae Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Aug 24 00:01:22 2017 autotest: Improve error message when choking on unmanaged pool If a DUT has two pool: labels, one managed and one unmanaged, lab_inventory may choke on the DUT (depending on which pool status_history blesses). In this case, we got no information whatsoever about where the failure was. This CL is a small improvement in the error message so that the next person doesn't need to spend an hour chasing the error message. BUG= chromium:757815 TEST=None Change-Id: Ia84eaa3d3a1dfaa9e7996d2adbfccfcb4e5f1e61 Reviewed-on: https://chromium-review.googlesource.com/627162 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/754403d1182481d60d4e93d05fec75ffd50bd3ae/site_utils/lab_inventory.py
,
Aug 25 2017
Regarding chameleon: It's time to declare that chameleon attached DUTs can't be in a managed pool. Currently, only three hosts break the rule.
,
Aug 25 2017
OK. Current plan to deal with this problem long-term:
1) Change the inventory script to identify and report any managed DUT
that's in more than one pool.
2) Remove managed pool labels from the three chameleon DUTs that
have them.
3) Declare that "managed DUT" excludes moblab instances, since our
moblab maintenance process is currently different.
,
Aug 28 2017
I've made the following changes: : jrbarnette ~; atest label remove -m chromeos4-row5-rack1-host5 pool:performance Removed from label 'pool:performance' host: chromeos4-row5-rack1-host5 : jrbarnette ~; atest label remove -m chromeos2-row10-rack8-host13 pool:suites pool:crosperf Removed from label 'pool:suites', 'pool:crosperf' host: chromeos2-row10-rack8-host13 : jrbarnette ~; atest label remove -m chromeos2-row10-rack8-host15 pool:suites Removed from label 'pool:suites' host: chromeos2-row10-rack8-host15 : jrbarnette ~; atest label remove -m chromeos2-row4-rack2-host10,chromeos2-row3-rack6-host19 pool:suites Removed from label 'pool:suites' hosts: chromeos2-row3-rack6-host19, chromeos2-row4-rack2-host10 : jrbarnette ~; atest label remove -m chromeos2-row8-rack3-host2 pool:crosperf Removed from label 'pool:crosperf' host: chromeos2-row8-rack3-host2 At this point, except for moblab instances no managed DUT is in more than one pool. The issue with moblab instances will be fixed by redefining "managed DUT" to explicitly exclude any guado_moblab instance.
,
Aug 28 2017
Within minutes of the action above, a snappy DUT acquired a new pool label; it's now in both 'suites' and 'performance'. Additionally, the change to chromeos4-row5-rack1-host5 didn't stick, and had to be repeated. This could be from known problems with master/shard synchronization, or there could be some other issue relating to the 'performance' pool.
,
Aug 28 2017
When I get a second I will look at the performance pool balancer to see if I can find a bug.
,
Aug 28 2017
Logging shows that the performance pool balance script has not removed or added any labels since Aug 3rd so I do not believe it can be responsible.
,
Aug 29 2017
OK. All of last night's inventory jobs ran successfully, so the
database is clean for the moment.
Two tasks remain:
* Adjust the definition of "managed DUT" to exclude guado_moblab.
* Add checking to the lab_inventory script to report managed DUTs
with more than one pool label.
,
Aug 30 2017
Grrr... Something re-added pool:crosperf to chromeos2-row10-rack8-host13. That caused this morning's inventory run to fail.
,
Aug 30 2017
$ atest label remove -m chromeos2-row10-rack8-host13 pool:crosperf Removed from label 'pool:crosperf' host: chromeos2-row10-rack8-host13 If this sort of thing keeps happening, the next step will be to remove crosperf from the managed pools until we can figure out why.
,
Aug 30 2017
Its no me - all my performance hacks were switched off last night and the server will be decommissioned soon.
,
Aug 31 2017
@#19, it is my script to select a 'Ready' DUT with the target sku from suites to crosperf if it detects there is no DUT with the sku in crosperf. Lucky or unlucky, it picks the same DUT chromeos2-row10-rack8-host13 again. Another example, after you remove the chromeos2-row10-rack8-host13 again 6 hours ago, and this time it swapped in chromeos6-row2-rack12-host6 into the crosperf. I will manage to make sure the selected DUT will be labelled in one pool only. In the near future, I would like to work with you to move the monitoring script to inventory jobs. Sorry about that.
,
Aug 31 2017
> I will manage to make sure the selected DUT will be labelled in one pool only Wait. No. There needs to be more than that. The DUTs in pool:chameleon are OFF LIMITS. So are a variety of other DUTs. Those DUTs can't have their pool labels changed.
,
Aug 31 2017
My script tries to select DUT from pool:suites only? I did not touch any DUT in pool:chameleon. I am not sure what you mean. Or are they both in pool:suites and pool:chameleon?
,
Aug 31 2017
,
Aug 31 2017
+kalin@ and rjahagir@ who keep track of the DUTs with chameleons, to ensure the devices in the chameleon pools should not contain any non-chameleon pool labels
,
Aug 31 2017
> [ ... ] Or are they both in pool:suites and pool:chameleon? There are no longer DUTs in both pool:suites and pool:chameleon, although, until recently, there were. In any event, the rule to use is the rule from 'balance_pool': "In exactly one pool."
,
Aug 31 2017
Re #23, #24, #25.... I think it high time we simply disallow external teams to toggle pool labels via automated jobs. We (chromeos-infra) team should own these automated management jobs. The impact of mistakes here is too big, and outside of the infra team, pretty much impossible to anticipate or respond to. I've already contacted the folks for performance pool balancing to try to rewrite their scripts as part of our usual inventory management jobs. Note that all these automated jobs will break when we move to skylab inventory, unless they're accounted for and migrated by the infra team (some time mid Q4)
,
Sep 1 2017
I have some DUTs in wificells which use both pool:chameleon and pool:wificell (see chromeos9-row4-rack[7-8]-host[1-4]) for examples.
,
Sep 1 2017
> I have some DUTs in wificells which use both pool:chameleon and pool:wificell [ ... ]
Neither chameleon nor wificell are managed pools, so those DUTs aren't
affected or involved.
These are the managed pools:
bvt
cq
continuous
cts
arc-presubmit
crosperf
performance
suites
Only DUTs in one of those pools are affected.
,
Sep 5 2017
Another chameleon DUT has acquired performance pool label:
chromeos2-row10-rack8-host13 pool:chameleon, pool:performance, pool:chameleon_audio_stable, pool:chameleon_hdmi_stable, pool:audio_board, pool:manual_run_chameleon_suites,
I'm going to go remove both 'performance' and 'crosperf' from the
managed pool labels until this gets sorted out. If it were possible
to block external tools from manipulating pool assignments, I'd do
that instead.
,
Sep 5 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0a18de00b807e8b5f2b9111835fe1913d0b6ac8a commit 0a18de00b807e8b5f2b9111835fe1913d0b6ac8a Author: Richard Barnette <jrbarnette@chromium.org> Date: Tue Sep 05 21:15:07 2017 [autotest] Make the performance pools unmanaged. Something external to Autotest is adding chameleon DUTs to the 'crosperf' and 'performance' pools. This is causing lab inventory scripts to fail, because that creates DUTs that are in both "managed" and "unmanaged" pools, which the tools can't handle. This removes the two performance pools from the managed pool list until we can stop the improper relabeling. BUG= chromium:757815 TEST=None Change-Id: I1fdcef327dad0dd15ebd546271e9a8457eb7d221 Reviewed-on: https://chromium-review.googlesource.com/650534 Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/0a18de00b807e8b5f2b9111835fe1913d0b6ac8a/site_utils/suite_scheduler/constants.py
,
Sep 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/eabcf39bd25c46239e8e9d12d058c48d59f2c26d commit eabcf39bd25c46239e8e9d12d058c48d59f2c26d Author: Richard Barnette <jrbarnette@chromium.org> Date: Wed Sep 06 23:54:29 2017 [autotest] Exclude moblab from lab inventory management. The moblab instances in Stierlin Ct are handled by a separate process that's separate from the standard lab inventory cron job e-mail. So, remove the board from consideration. BUG= chromium:757815 TEST=unit test for sanity Change-Id: Ie327752c2b2980f541e091867e75ac135f870e83 Reviewed-on: https://chromium-review.googlesource.com/647316 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/eabcf39bd25c46239e8e9d12d058c48d59f2c26d/site_utils/lab_inventory.py
,
Sep 7 2017
I'm concerend about #34. moblab is in the CQ. Lack of DUTs will cause CQ failures. We should exclude it from board_inventory but not DUT inventory. cros-infra-deputy needs to be aware of moblab health.
,
Sep 7 2017
For #33, I would also add the check in my script to make sure we won't pull DUT with chameleon board into performance pools as well. Please let us know once we are clear to move forward as we might do balance_pool between performance pools and inventory one(pool:suite)? Thanks~
,
Sep 7 2017
BTW, FYI, there is no DUT in performance pools(pool: performance and crosperf) with label of chameleon: google3$ atest host list -b pool:performance | grep chameleon google3$ atest host list -b pool:crosperf | grep chameleon
,
Sep 7 2017
> BTW, FYI, there is no DUT in performance pools(pool: performance and crosperf) with label of chameleon: That's true now because I removed the pool:crosperf label from the offending DUT cited in c#32.
,
Sep 7 2017
Another DUT has acquired a performance pool label inexplicably:
chromeos6-row2-rack20-host7 pool:suites, pool:performance,
I'm going to remove 'pool:suites' label from the DUT for now.
All external services that change DUT labels (especially pool
labels) need to be turned off ASAP.
,
Sep 7 2017
I found the RPC history showing what happened to chromeos6-row2-rack20-host7:
09/06 01:49:16.669 INFO | 172.30.210.13| modify_host:cywang [{'locked': True, 'id': 'chromeos6-row2-rack20-host7', 'lock_reason': 'migrate_to_performance'}]
09/06 01:49:27.532 INFO | 172.30.210.13| label_add_hosts:cywang [{'hosts': ['chromeos6-row2-rack20-host7'], 'id': 'pool:performance'}]
09/06 01:49:38.300 INFO | 172.30.210.13| label_remove_hosts:cywang [{'hosts': ['chromeos6-row2-rack20-host7'], 'id': 'pool:suites'}]
09/06 01:49:50.207 INFO | 172.30.210.13| modify_host:cywang [{'locked': False, 'id': 'chromeos6-row2-rack20-host7', 'lock_reason': ''}]
09/07 01:07:29.335 INFO | 172.30.210.13| label_add_hosts:cywang [{'hosts': ['chromeos6-row2-rack20-host7'], 'id': 'pool:suites'}]
09/07 01:07:30.109 INFO | 172.30.210.13| label_remove_hosts:cywang [{'hosts': ['chromeos6-row2-rack20-host7'], 'id': 'pool:crosperf'}]
Initially, the host was removed from suites, and added to performance.
Then, the host was added back to suites, and removed from crosperf.
Removal from crosperf would have failed, since the label wasn't
present, leaving the DUT in both suites and performance.
,
Sep 8 2017
Ah, that's my bug which hardcode the pool name(crosperf) when it tried to return a 'Repair Failed' DUT back to pool:performance. Apologize for causing this problem. The DUT originally added into pool:performance, then the script detected it was 'Repair Failed' @Thu Sep 7 16:07:24 CST 2017, so it tried to return it to pool:suites and removed itself from pool:performance, however, the script incorrectly tried to remove it from 'pool:crosperf'. That's why the DUT ends up in two pools. Here is my log for the DUT - chromeos6-row2-rack20-host7: === [Move one DUT(wizpig - wizpig_intel_celeron_n3060_2Gb) from suites to pool:performance] ===> Migrate chromeos6-row2-rack20-host7 from suites to pool:performance... Locked host: chromeos6-row2-rack20-host7 Added to label 'pool:performance' host: chromeos6-row2-rack20-host7 Removed from label 'pool:suites' host: chromeos6-row2-rack20-host7 Unlocked host: chromeos6-row2-rack20-host7 ================================================= ========= Thu Sep 7 16:07:24 CST 2017 ========== ====== Check pool:performance ========== [Check failed DUT(s) in pool:performance] chromeos6-row2-rack20-host7 Repair Failed chromeos-server104.mtv.corp.google.com False None wizpig board:wizpig, accel:cros-ec, arc, os:cros, power:battery, ec:cros, servo, cts_abi_x86, cts_abi_arm, storage:mmc, webcam, internal_display, audio_loopback_dongle, usb_nic:r8152, sku:wizpig_intel_celeron_n3060_2Gb, variant:wizpig, touchpad, phase:PVT, pool:performance, cros-version:wizpig-release/R63-9916.0.0 Retire from pool:crosperf Added to label 'pool:suites' host: chromeos6-row2-rack20-host7 Removed from label 'pool:crosperf' host: chromeos6-row2-rack20-host7 ===
,
Oct 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/99473f6963e30282bed8ef4538af7d23be3fba6c commit 99473f6963e30282bed8ef4538af7d23be3fba6c Author: Richard Barnette <jrbarnette@chromium.org> Date: Wed Oct 25 05:47:53 2017 [autotest] Exclude hosts in more than one pool from lab inventory. The lab inventory script fails if a DUT is assigned to a non-managed pool. The query that constructs the inventory selects only hosts that are in managed pools, but failure may still occur if a DUT is in both a managed and non-managed pool. The lab inventory script isn't meant to allow DUTs to be in more than one pool, so this excludes those DUTs from inventory. BUG= chromium:757815 TEST=run the script locally Change-Id: Ia5f5ed762b58e9bd6ce3014c07ee7cc8fd39887c Reviewed-on: https://chromium-review.googlesource.com/724216 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/99473f6963e30282bed8ef4538af7d23be3fba6c/site_utils/lab_inventory.py
,
Jan 26 2018
OK. We still may have problems with outside code that breaks the inventory by incorrectly adding/removing pool labels. But, that ain't this bug. _This_ bug is fixed. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by pprabhu@chromium.org
, Aug 22 2017Summary: all lab inventory jobs failing because of KeyError 'chameleon' (was: stable update failing because of KeyError 'chameleon')