New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 732999 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug
OKR



Sign in to add a comment

DUT's locked status may fail to propagate to the shard

Project Member Reported by ihf@chromium.org, Jun 13 2017

Issue description

I locked chromeos2-row8-rack1-host20 at 12:37 for issue 732434. It is still locked as of 15:39, yet it started running tests for Bernie at 14:07, which is unfortunate.

http://cautotest/afe/#tab_id=view_host
Locked:	Yes, by ihf on 2017-06-13 12:37:31, reason: 732434 

~/trunk/src/third_party/autotest/files/site_utils $ ./dut_status.py -f chromeos2-row8-rack1-host20
chromeos2-row8-rack1-host20
    2017-06-13 15:31:57  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730489-reset/
    2017-06-13 15:27:12  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960091-bhthompson/
    2017-06-13 15:26:36  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730452-reset/
    2017-06-13 15:21:49  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960088-bhthompson/
    2017-06-13 15:21:19  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730419-reset/
    2017-06-13 15:16:24  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960080-bhthompson/
    2017-06-13 15:15:53  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730387-reset/
    2017-06-13 15:09:57  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960075-bhthompson/
    2017-06-13 15:09:22  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730350-reset/
    2017-06-13 15:04:25  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960069-bhthompson/
    2017-06-13 15:03:51  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730304-reset/
    2017-06-13 14:58:54  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960068-bhthompson/
    2017-06-13 14:58:24  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730230-reset/
    2017-06-13 14:53:49  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960061-bhthompson/
    2017-06-13 14:53:19  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/730152-reset/
    2017-06-13 14:48:36  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/122960055-bhthompson/
    2017-06-13 14:39:29  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/729941-provision/
    2017-06-13 14:17:18  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/729651-repair/
    2017-06-13 14:07:26  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/729521-provision/
    2017-06-13 11:45:36  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/123140961-chromeos-test/
    2017-06-13 11:36:16  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/728894-provision/
    2017-06-13 09:22:48  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/123135364-chromeos-test/


 
Cc: pprabhu@chromium.org
The DUT is locked in the master AFE, but not in the shard:
    $ atest host list chromeos2-row8-rack1-host20
    Host                         Status   Shard                                  Locked  ...
    chromeos2-row8-rack1-host20  Running  chromeos-server98.mtv.corp.google.com  True    732434  ...
    $ atest host list -w chromeos-server98.mtv.corp.google.com chromeos2-row8-rack1-host20
    Host                         Status   Shard                                  Locked  ...
    chromeos2-row8-rack1-host20  Running  chromeos-server98.mtv.corp.google.com  False   ...

I don't know how it happened (yet).

Unlocking and then re-locking the DUT fixed the problem.

My leading theory is that the lock got lost when I balanced
pools.

I'll do some work to try and confirm that...

I tried the following:
  * Lock a randomly chosen suites DUT from the master
  * Confirm it was locked in its shard
  * Remove the pool:suites label in the master
  * See the shard had the change
  * Restore the pool:suites label in the master
  * See that the shard had the change.

The updates had no impact on the lock status.

At this point, I'm not sure how to reproduce the problem.

ihf@ - when you locked the original DUT, did you do it from the
shard, or from the master?

Comment 4 by ihf@chromium.org, Jun 13 2017

I locked from http://cautotest/afe/#tab_id=view_host
Owner: jrbarnette@chromium.org
Status: Fixed (was: Untriaged)
I don't see a clear AI on this bug at this point.

The immediate problem has been resolved, and we don't have (yet) a way to reproduce the problem.
I'm going to close this bug as there's nothing more to do. It'll still serve has history for the generic problem.

I'll note that I've seen something similar in the past...

jrbarnette@: If you have ideas about how to follow through, please re-open and assign it to yourself dumping your thoughts on the bug.
Owner: ----
Status: Available (was: Fixed)
Someone should do a basic survey of the source code, to see
how locked status is transferred from the master to the slave,
with special attention to how it might get corrupted by
subsequent status updates.  Specifically, the history in this
case is this:
 1) Host was locked at the master.
 2) The locked status was (should have been) transferred to
    the shard.
 3) balance_pool ran, removing one "pool:" label, and adding
    another.

Somewhere, there's a window either between 1) and 2) where the
status transfer can get dropped, or there's a bug between 2) and
3) where locked status gets reset.

I'll note my experiments showed that normally, after locking a
DUT, the shard sees the status almost instantaneously.  So, there's
no plausible case that the problem occurred because 1) and 3)
occurred before 2).

One additional action:  If we can't identify a possible cause from
a focused study of the code, we should look for someplace to add
a little extra logging that might be able to catch this sort of
problem in action, so that the _next_ time we see this, we'll have
more data.

Finally, aside from anecdotes about other cases of lost locks, I
note that we also have periodic cases of lost labels, especially
'pool:' labels.  So, there could be some sort of bug with updating
host status from the master to the shard.  Again, better logging
would help.

Following up on the available theories, here's the history
from the DUT:

[ ... ]
    2017-06-13 14:07:26  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos2-row8-rack1-host20/729521-provision/
    2017-06-13 11:45:36  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/123140961-chromeos-test/
[ ... ]

ihf@ locked the DUT at 12:37.  I ran 'balance_pool' shortly before
14:08.  So, it looks like the DUT scheduled nothing until the moment
I ran 'balance_pool'.  I'm disinclined to consider that coincidence.
We need to look at the possibility that 'balance_pool' can cause DUT
information to get lost.

Labels: -Pri-1 Pri-2
Although adding logs would be good. Questions is where?

The RPC involved in balance_pool run over multiple servers and the only place they can log stuff is the apache log.

balance_pool itself already dumps a whole bunch of information that we don't save anywhere. So, adding more logging to the tool is unlikely to help in practice.


Lowering priority because this sounds like a minor degradation to a Class 1 Service, and I do not expect to work hard to close this within 6 weeks.
Happened again to me:
pprabhu@pprabhu:chromiumos$ atest host list chromeos2-row8-rack1-host3 -w cautotest
Host                        Status        Shard                                  Locked  Lock Reason                        Locked by  Platform  Labels
chromeos2-row8-rack1-host3  Provisioning  chromeos-server98.mtv.corp.google.com  True     crbug.com/734701   crbug.com/734690   pprabhu    caroline  board:caroline, bluetooth, accel:cros-ec, arc, hw_video_acc_enc_h264, os:cros, hw_jpeg_acc_dec, power:battery, ec:cros, hw_video_acc_h264, servo, cts_abi_x86, cts_abi_arm, storage:mmc, webcam, internal_display, audio_loopback_dongle, fwrw-version:caroline-firmware/R49-7820.263.0, fwro-version:caroline-firmware/R49-7820.263.0, sku:caroline_intel_skylake_core_m3_4Gb, phase:DVT, touchpad, touchscreen, variant:caroline, stylus, pool:cq, cros-version:caroline-paladin/R61-9663.0.0-rc2
pprabhu@pprabhu:chromiumos$ atest host list chromeos2-row8-rack1-host3 -w chromeos-server98.mtv
Host                        Status        Shard                                  Locked  Lock Reason  Locked by  Platform  Labels
chromeos2-row8-rack1-host3  Provisioning  chromeos-server98.mtv.corp.google.com  False                None       caroline  board:caroline, bluetooth, accel:cros-ec, arc, hw_video_acc_enc_h264, os:cros, hw_jpeg_acc_dec, power:battery, ec:cros, hw_video_acc_h264, servo, cts_abi_x86, cts_abi_arm, storage:mmc, webcam, internal_display, audio_loopback_dongle, fwrw-version:caroline-firmware/R49-7820.263.0, fwro-version:caroline-firmware/R49-7820.263.0, sku:caroline_intel_skylake_core_m3_4Gb, phase:DVT, touchpad, touchscreen, variant:caroline, stylus, pool:cq, cros-version:caroline-paladin/R61-9663.0.0-rc2


Exact same steps to get there.

Note that the DUT in question had gotten picked up for a job (and its provision task) before I got around to locking it. Maybe balance_pool gets confused if the DUT in question is not "Ready"?
> Although adding logs would be good. Questions is where?

Presumably, somewhere in the heartbeat exchange that updates
the afe_hosts table.

Part of the ask in this bug is for someone to go study that code
looking for theories that could explain this.  One expected outcome
of that exercise would be to identify knowledge gaps that should be
filled in with better debug logging.

> Note that the DUT in question had gotten picked up for a
> job (and its provision task) before I got around to locking
> it. Maybe balance_pool gets confused if the DUT in question is not "Ready"?

This sounds like race condition update host status:
  * The shard is updating host information based on the host status
    change.
  * At the same time, the master wants to update the host information
    based on locking or a "balance_pool" operation.

Labels: OKR
Issue 737819 has been merged into this issue.
Summary: DUT's locked status may fail to propagate to the shard (was: Locked DUT after some idle time does a provision and starts running many jobs)
Labels: -Pri-2 Pri-1
Upgrading priority. This is causing incorrect and perplexing behavior on a frequent basis, potentially contributing to downtime (or lengthened diagnosis time) of class 1 services...

Will find an owner.
Labels: Chase-Pending
Issue 748677 has been merged into this issue.
Owner: pprabhu@chromium.org
Status: Assigned (was: Available)
step 1: sentinel should at least tell us about these consistencies
Labels: -Chase-Pending Chase
Project Member

Comment 21 by bugdroid1@chromium.org, Aug 3 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/63634d11dd3952f2035a26c495f562f2f7f1f6fd

commit 63634d11dd3952f2035a26c495f562f2f7f1f6fd
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Aug 03 19:18:47 2017

cl for metrics landed, awaiting prod push which may inform further fix
Here's a graph showing specifically the afe_host locks being fixed by the sentinel: http://shortn/_KJrMqEI7Sg (counts in the last day)

So, there's a non-trivial number of these being fixed (32 in the last day right now). This makes me think this is not hard to reproduce as we thought at first.
Status: Started (was: Assigned)
Looking at two instances of the sentinel auto-fixing the locked state, I see issues both ways:

In these logs, the first host is from the shard, second from the master. The sentinel always updates the shard to match with the master.


2017-08-09 12:48:55,711 WARN| Inconsistent Host: {'lock_time': datetime.datetime(2017, 7, 24, 13, 30, 56), 'locked': 1, 'lock_reason': 'Unable to image due to bad blocks', 'hostname': 'chromeos4-row1-rack12-host3', 'invalid': 0, 'locked_by_id': 912L, 'shard_id': 72L, 'id': 2898L} != {'lock_time': None, 'locked': 0, 'lock_reason': '', 'hostname': 'chromeos4-row1-rack12-host3', 'invalid': 0, 'locked_by_id': None, 'shard_id': 72L, 'id': 2898L}
In this host was not locked on the master, was on the shard. I've pinged rsabanga@ who had it locked. But, looking at host history, the host has successfully recovered after a repair, whereas it had been locked for a few days (it was in verify/repair loop before that).
This makes me think that rsabanga@ actually unlocked this host and that update didn't propagate from master to shard on its own.

2017-08-09 12:54:12,155 WARN| Inconsistent Host: {'lock_time': None, 'locked': 0, 'lock_reason': '', 'hostname': 'chromeos1-row4-rack5-host4', 'invalid': 0, 'locked_by_id': None, 'shard_id': 76L, 'id': 4909L} != {'lock_time': datetime.datetime(2017, 8, 9, 11, 2, 47), 'locked': 1, 'lock_reason': 'Device is under test', 'hostname': 'chromeos1-row4-rack5-host4', 'invalid': 0, 'locked_by_id': 474L, 'shard_id': 76L, 'id': 4909L}
In this case Harpreet locked the host. The lock bit was only on the master, and was propagated to the shard by the sentinel.


So conclusion: Somehow the lock/unlock update doesn't make it to shard from master when the RPC is made.
Later, sentinel fixes it.

But, this wouldn't have caused some of the bad behaviour noted above -- there would be a delay in the lock state sticking, but the system would catch up when sentinel comes around.



Re #25: rsabanga@ has verified my theory. That DUT was unlocked from the cautotest UI this morning.
Doh. My "conclusion" is there in c#1. But I have quantitative proof
And, I can prove when this is fixed because that metric will go to 0 :D
I found a broken pattern of RPC forwarding: https://chromium-review.googlesource.com/c/608580

I'll see if it helps this issue, and if it does go update other RPCs that fanout as well.
Project Member

Comment 29 by bugdroid1@chromium.org, Aug 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/8f01aa669d04798a28f5bea1df700327bc7d3901

commit 8f01aa669d04798a28f5bea1df700327bc7d3901
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Aug 10 19:26:40 2017

[autotest] Update shard state before master for modify_host

This is a testing-the-waters CL for the following idea: Any time there
is an rpc_fanout from the master AFE to the shards, the shard state must
be complete before the master state is modified.
This way, in case of a partial update, master state is left untouched.
Any corrective action assumes that the master state is correct in case
of master-shard desync, so that shard's partial updates will be
discarded.

Currently, we update master state before the shard in a few places. If
an update fails midway, this causes us to assume that the update was
successful, but the shards lag behind.

BUG= chromium:732999 
TEST=modify hosts on master-shard on the staging servers.

Change-Id: Ia8920550f75e10e51f1c86eef347346c88e08e5e
Reviewed-on: https://chromium-review.googlesource.com/608580
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/8f01aa669d04798a28f5bea1df700327bc7d3901/frontend/afe/rpc_interface.py

Status: Fixed (was: Started)
Status: Started (was: Fixed)
Incorrect lock state metrics have improved, but are still non-zero: http://shortn/_N7io9sooPD
AI: Make sure sentinel is being as aggressive as possible in fixing the lock state.
No progress. Chase bugs should not be left untouched. Either make reasonable progress this week, or justify removing from list (yes, I'm berating myself).
This has magically disappeared.
Both the metric and sentinel logs show that there are no longer any locking inconsistencies detected / fixed.

metrics: http://shortn/_ISdrKC2rh4
But the metrics' data is completely missing for '# hosts inconsistencies recovered', going back 90 days. That's just can't be true.
Note that the toolchain team had some hand-rolled script to lock DUTs that stopped working (for unknown reasons) ~Aug 30th. Dunno if they were causing havoc here. See  issue 762589 .
If so, they might fix their script and start causing havoc again.
Status: Fixed (was: Started)
Due diligence done.
I'm trusting the logs from the sentinel and marking this fixed, unless proved otherwise.

Comment 38 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Comment 39 by dchan@chromium.org, Jan 23 2018

Status: Fixed (was: Archived)

Sign in to add a comment