DUT's locked status may fail to propagate to the shard |
||||||||||||||||
Issue descriptionI locked chromeos2-row8-rack1-host20 at 12:37 for issue 732434. It is still locked as of 15:39, yet it started running tests for Bernie at 14:07, which is unfortunate. http://cautotest/afe/#tab_id=view_host Locked: Yes, by ihf on 2017-06-13 12:37:31, reason: 732434 ~/trunk/src/third_party/autotest/files/site_utils $ ./dut_status.py -f chromeos2-row8-rack1-host20 chromeos2-row8-rack1-host20 2017-06-13 15:31:57 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730489-reset/ 2017-06-13 15:27:12 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960091-bhthompson/ 2017-06-13 15:26:36 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730452-reset/ 2017-06-13 15:21:49 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960088-bhthompson/ 2017-06-13 15:21:19 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730419-reset/ 2017-06-13 15:16:24 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960080-bhthompson/ 2017-06-13 15:15:53 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730387-reset/ 2017-06-13 15:09:57 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960075-bhthompson/ 2017-06-13 15:09:22 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730350-reset/ 2017-06-13 15:04:25 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960069-bhthompson/ 2017-06-13 15:03:51 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730304-reset/ 2017-06-13 14:58:54 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960068-bhthompson/ 2017-06-13 14:58:24 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730230-reset/ 2017-06-13 14:53:49 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960061-bhthompson/ 2017-06-13 14:53:19 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730152-reset/ 2017-06-13 14:48:36 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960055-bhthompson/ 2017-06-13 14:39:29 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/729941-provision/ 2017-06-13 14:17:18 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/729651-repair/ 2017-06-13 14:07:26 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/729521-provision/ 2017-06-13 11:45:36 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/123140961-chromeos-test/ 2017-06-13 11:36:16 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/728894-provision/ 2017-06-13 09:22:48 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/123135364-chromeos-test/
,
Jun 13 2017
Unlocking and then re-locking the DUT fixed the problem. My leading theory is that the lock got lost when I balanced pools. I'll do some work to try and confirm that...
,
Jun 13 2017
I tried the following: * Lock a randomly chosen suites DUT from the master * Confirm it was locked in its shard * Remove the pool:suites label in the master * See the shard had the change * Restore the pool:suites label in the master * See that the shard had the change. The updates had no impact on the lock status. At this point, I'm not sure how to reproduce the problem. ihf@ - when you locked the original DUT, did you do it from the shard, or from the master?
,
Jun 13 2017
I locked from http://cautotest/afe/#tab_id=view_host
,
Jun 14 2017
I don't see a clear AI on this bug at this point. The immediate problem has been resolved, and we don't have (yet) a way to reproduce the problem. I'm going to close this bug as there's nothing more to do. It'll still serve has history for the generic problem. I'll note that I've seen something similar in the past... jrbarnette@: If you have ideas about how to follow through, please re-open and assign it to yourself dumping your thoughts on the bug.
,
Jun 15 2017
Someone should do a basic survey of the source code, to see
how locked status is transferred from the master to the slave,
with special attention to how it might get corrupted by
subsequent status updates. Specifically, the history in this
case is this:
1) Host was locked at the master.
2) The locked status was (should have been) transferred to
the shard.
3) balance_pool ran, removing one "pool:" label, and adding
another.
Somewhere, there's a window either between 1) and 2) where the
status transfer can get dropped, or there's a bug between 2) and
3) where locked status gets reset.
I'll note my experiments showed that normally, after locking a
DUT, the shard sees the status almost instantaneously. So, there's
no plausible case that the problem occurred because 1) and 3)
occurred before 2).
,
Jun 16 2017
One additional action: If we can't identify a possible cause from a focused study of the code, we should look for someplace to add a little extra logging that might be able to catch this sort of problem in action, so that the _next_ time we see this, we'll have more data. Finally, aside from anecdotes about other cases of lost locks, I note that we also have periodic cases of lost labels, especially 'pool:' labels. So, there could be some sort of bug with updating host status from the master to the shard. Again, better logging would help.
,
Jun 16 2017
Following up on the available theories, here's the history
from the DUT:
[ ... ]
2017-06-13 14:07:26 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/729521-provision/
2017-06-13 11:45:36 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/123140961-chromeos-test/
[ ... ]
ihf@ locked the DUT at 12:37. I ran 'balance_pool' shortly before
14:08. So, it looks like the DUT scheduled nothing until the moment
I ran 'balance_pool'. I'm disinclined to consider that coincidence.
We need to look at the possibility that 'balance_pool' can cause DUT
information to get lost.
,
Jun 16 2017
Although adding logs would be good. Questions is where? The RPC involved in balance_pool run over multiple servers and the only place they can log stuff is the apache log. balance_pool itself already dumps a whole bunch of information that we don't save anywhere. So, adding more logging to the tool is unlikely to help in practice. Lowering priority because this sounds like a minor degradation to a Class 1 Service, and I do not expect to work hard to close this within 6 weeks.
,
Jun 19 2017
Happened again to me: pprabhu@pprabhu:chromiumos$ atest host list chromeos2-row8-rack1-host3 -w cautotest Host Status Shard Locked Lock Reason Locked by Platform Labels chromeos2-row8-rack1-host3 Provisioning chromeos-server98.mtv.corp.google.com True crbug.com/734701 crbug.com/734690 pprabhu caroline board:caroline, bluetooth, accel:cros-ec, arc, hw_video_acc_enc_h264, os:cros, hw_jpeg_acc_dec, power:battery, ec:cros, hw_video_acc_h264, servo, cts_abi_x86, cts_abi_arm, storage:mmc, webcam, internal_display, audio_loopback_dongle, fwrw-version:caroline-firmware/R49-7820.263.0, fwro-version:caroline-firmware/R49-7820.263.0, sku:caroline_intel_skylake_core_m3_4Gb, phase:DVT, touchpad, touchscreen, variant:caroline, stylus, pool:cq, cros-version:caroline-paladin/R61-9663.0.0-rc2 pprabhu@pprabhu:chromiumos$ atest host list chromeos2-row8-rack1-host3 -w chromeos-server98.mtv Host Status Shard Locked Lock Reason Locked by Platform Labels chromeos2-row8-rack1-host3 Provisioning chromeos-server98.mtv.corp.google.com False None caroline board:caroline, bluetooth, accel:cros-ec, arc, hw_video_acc_enc_h264, os:cros, hw_jpeg_acc_dec, power:battery, ec:cros, hw_video_acc_h264, servo, cts_abi_x86, cts_abi_arm, storage:mmc, webcam, internal_display, audio_loopback_dongle, fwrw-version:caroline-firmware/R49-7820.263.0, fwro-version:caroline-firmware/R49-7820.263.0, sku:caroline_intel_skylake_core_m3_4Gb, phase:DVT, touchpad, touchscreen, variant:caroline, stylus, pool:cq, cros-version:caroline-paladin/R61-9663.0.0-rc2 Exact same steps to get there. Note that the DUT in question had gotten picked up for a job (and its provision task) before I got around to locking it. Maybe balance_pool gets confused if the DUT in question is not "Ready"?
,
Jun 19 2017
> Although adding logs would be good. Questions is where? Presumably, somewhere in the heartbeat exchange that updates the afe_hosts table. Part of the ask in this bug is for someone to go study that code looking for theories that could explain this. One expected outcome of that exercise would be to identify knowledge gaps that should be filled in with better debug logging.
,
Jun 19 2017
> Note that the DUT in question had gotten picked up for a
> job (and its provision task) before I got around to locking
> it. Maybe balance_pool gets confused if the DUT in question is not "Ready"?
This sounds like race condition update host status:
* The shard is updating host information based on the host status
change.
* At the same time, the master wants to update the host information
based on locking or a "balance_pool" operation.
,
Jun 19 2017
,
Jun 29 2017
Issue 737819 has been merged into this issue.
,
Jun 30 2017
,
Jul 25 2017
Upgrading priority. This is causing incorrect and perplexing behavior on a frequent basis, potentially contributing to downtime (or lengthened diagnosis time) of class 1 services... Will find an owner.
,
Jul 25 2017
,
Jul 28 2017
Issue 748677 has been merged into this issue.
,
Jul 31 2017
step 1: sentinel should at least tell us about these consistencies
,
Jul 31 2017
,
Aug 3 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/63634d11dd3952f2035a26c495f562f2f7f1f6fd commit 63634d11dd3952f2035a26c495f562f2f7f1f6fd Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Aug 03 19:18:47 2017
,
Aug 7 2017
cl for metrics landed, awaiting prod push which may inform further fix
,
Aug 9 2017
Here's a graph showing specifically the afe_host locks being fixed by the sentinel: http://shortn/_KJrMqEI7Sg (counts in the last day) So, there's a non-trivial number of these being fixed (32 in the last day right now). This makes me think this is not hard to reproduce as we thought at first.
,
Aug 9 2017
,
Aug 9 2017
Looking at two instances of the sentinel auto-fixing the locked state, I see issues both ways:
In these logs, the first host is from the shard, second from the master. The sentinel always updates the shard to match with the master.
2017-08-09 12:48:55,711 WARN| Inconsistent Host: {'lock_time': datetime.datetime(2017, 7, 24, 13, 30, 56), 'locked': 1, 'lock_reason': 'Unable to image due to bad blocks', 'hostname': 'chromeos4-row1-rack12-host3', 'invalid': 0, 'locked_by_id': 912L, 'shard_id': 72L, 'id': 2898L} != {'lock_time': None, 'locked': 0, 'lock_reason': '', 'hostname': 'chromeos4-row1-rack12-host3', 'invalid': 0, 'locked_by_id': None, 'shard_id': 72L, 'id': 2898L}
In this host was not locked on the master, was on the shard. I've pinged rsabanga@ who had it locked. But, looking at host history, the host has successfully recovered after a repair, whereas it had been locked for a few days (it was in verify/repair loop before that).
This makes me think that rsabanga@ actually unlocked this host and that update didn't propagate from master to shard on its own.
2017-08-09 12:54:12,155 WARN| Inconsistent Host: {'lock_time': None, 'locked': 0, 'lock_reason': '', 'hostname': 'chromeos1-row4-rack5-host4', 'invalid': 0, 'locked_by_id': None, 'shard_id': 76L, 'id': 4909L} != {'lock_time': datetime.datetime(2017, 8, 9, 11, 2, 47), 'locked': 1, 'lock_reason': 'Device is under test', 'hostname': 'chromeos1-row4-rack5-host4', 'invalid': 0, 'locked_by_id': 474L, 'shard_id': 76L, 'id': 4909L}
In this case Harpreet locked the host. The lock bit was only on the master, and was propagated to the shard by the sentinel.
So conclusion: Somehow the lock/unlock update doesn't make it to shard from master when the RPC is made.
Later, sentinel fixes it.
But, this wouldn't have caused some of the bad behaviour noted above -- there would be a delay in the lock state sticking, but the system would catch up when sentinel comes around.
,
Aug 9 2017
Re #25: rsabanga@ has verified my theory. That DUT was unlocked from the cautotest UI this morning.
,
Aug 9 2017
Doh. My "conclusion" is there in c#1. But I have quantitative proof And, I can prove when this is fixed because that metric will go to 0 :D
,
Aug 9 2017
I found a broken pattern of RPC forwarding: https://chromium-review.googlesource.com/c/608580 I'll see if it helps this issue, and if it does go update other RPCs that fanout as well.
,
Aug 10 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/8f01aa669d04798a28f5bea1df700327bc7d3901 commit 8f01aa669d04798a28f5bea1df700327bc7d3901 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Aug 10 19:26:40 2017 [autotest] Update shard state before master for modify_host This is a testing-the-waters CL for the following idea: Any time there is an rpc_fanout from the master AFE to the shards, the shard state must be complete before the master state is modified. This way, in case of a partial update, master state is left untouched. Any corrective action assumes that the master state is correct in case of master-shard desync, so that shard's partial updates will be discarded. Currently, we update master state before the shard in a few places. If an update fails midway, this causes us to assume that the update was successful, but the shards lag behind. BUG= chromium:732999 TEST=modify hosts on master-shard on the staging servers. Change-Id: Ia8920550f75e10e51f1c86eef347346c88e08e5e Reviewed-on: https://chromium-review.googlesource.com/608580 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/8f01aa669d04798a28f5bea1df700327bc7d3901/frontend/afe/rpc_interface.py
,
Aug 14 2017
,
Aug 14 2017
Incorrect lock state metrics have improved, but are still non-zero: http://shortn/_N7io9sooPD
,
Aug 21 2017
AI: Make sure sentinel is being as aggressive as possible in fixing the lock state.
,
Sep 5 2017
No progress. Chase bugs should not be left untouched. Either make reasonable progress this week, or justify removing from list (yes, I'm berating myself).
,
Sep 11 2017
This has magically disappeared. Both the metric and sentinel logs show that there are no longer any locking inconsistencies detected / fixed. metrics: http://shortn/_ISdrKC2rh4
,
Sep 11 2017
But the metrics' data is completely missing for '# hosts inconsistencies recovered', going back 90 days. That's just can't be true.
,
Sep 11 2017
Note that the toolchain team had some hand-rolled script to lock DUTs that stopped working (for unknown reasons) ~Aug 30th. Dunno if they were causing havoc here. See issue 762589 . If so, they might fix their script and start causing havoc again.
,
Sep 11 2017
Due diligence done. I'm trusting the logs from the sentinel and marking this fixed, unless proved otherwise.
,
Jan 22 2018
,
Jan 23 2018
|
||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||
Comment 1 by jrbarnette@chromium.org
, Jun 13 2017The DUT is locked in the master AFE, but not in the shard: $ atest host list chromeos2-row8-rack1-host20 Host Status Shard Locked ... chromeos2-row8-rack1-host20 Running chromeos-server98.mtv.corp.google.com True 732434 ... $ atest host list -w chromeos-server98.mtv.corp.google.com chromeos2-row8-rack1-host20 Host Status Shard Locked ... chromeos2-row8-rack1-host20 Running chromeos-server98.mtv.corp.google.com False ... I don't know how it happened (yet).