removing board from shard caused leftover afe_jobs timebomb |
||||||
Issue descriptionOutage described on crbug.com/711838 My understanding of the root cause: 1) On or around Apr 10, we removed cyan from shard server102 using the new board remove feature. 2) Shard correctly invalidated all hosts locally, and aborted any hqes associated with those hosts. 3) However, afe_jobs table on the shard still contained entries for queued up jobs for cyan. 4) For ~2 days, none of these jobs can make any progress on the shard (because there are no hosts with matching labels). This is a good thing. However, at some point scheduler decides to timeout and abort them, by setting their afe_jobs status to Aborted 5) shard client now sees jobs with shard=None (this is how shard knows those jobs are "in need of upload to master") that have a final status, so it tries to sent these to master as finished jobs in its heartbeat. 6) master has already assigned these jobs to a different shard, so returns UnallowedRecordsSentToMaster exception. 7) shard_client sees this exception and crashes. Thoughts about resolutions to this to follow...
,
Apr 15 2017
Regarding c#1, the advice is based on SRE best practices:
https://docs.google.com/a/google.com/document/d/18YYLPkKnYsB90MJdodZXEOh3Vqw3b_v0x86s25AIL8Q/pub
In that document is this recommendation:
Fail safe: in case of failure [ ... ] fail in a way which preserves function, possibly at the expense of being overly permissive or overly simplistic, rather than overly restrictive.
,
Apr 17 2017
Improvements we might consider: 1) In heartbeat handler on the master, if a shard tries to update results on a job that it doesn't own, silently drop the update rather than returning a failure (we want the rest of the heartbeat to succeed so we can return invalid hosts / invalid jobs). 2) Add an incorrect_job_ids parameter to the shard heartbeat response. Master can compute it on each heartbeat, based on the jobs that shard X things it owns, but that are actually owned by shard Y according to master db. When shard_client on shard X receives incorrect_job_ids value, it locally updates them all to status Aborted shard=X to permanently silence them from future updates (note: we set shard=X locally on that shard to silence future updates on the job, even though the job actually may belong to shard=Y authoritatively).
,
Apr 17 2017
We could do #1 and #2 independently. #1 is probably the simplest. If we do that, we probably don't really need #2, it will just mean that a bunch of orphaned afe_jobs will hang out on that shard for a few days until the shard scheduler aborts them (but the job update attempts will be silently ignored by master).
,
Apr 18 2017
,
Apr 18 2017
fix #1 is in the CQ.
,
Apr 21 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/6dec0e1ddc726a0e5e2c0a8b2d8ebefd00ca32f0 commit 6dec0e1ddc726a0e5e2c0a8b2d8ebefd00ca32f0 Author: Aviv Keshet <akeshet@chromium.org> Date: Fri Apr 21 00:13:47 2017 autotest: make master silently ignore wrong-shard job updates BUG= chromium:711852 TEST=rpc_interface_unittest; rpc_utils_unittest; overhaul of a unittest to correctly test the wrong-job-in-heartbeat behavior Change-Id: I5ebaa659da89f1dbcf484e1ef1f0c7ffef177544 Reviewed-on: https://chromium-review.googlesource.com/479550 Reviewed-by: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Commit-Queue: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/6dec0e1ddc726a0e5e2c0a8b2d8ebefd00ca32f0/frontend/afe/rpc_utils.py [modify] https://crrev.com/6dec0e1ddc726a0e5e2c0a8b2d8ebefd00ca32f0/frontend/afe/frontend_test_utils.py [modify] https://crrev.com/6dec0e1ddc726a0e5e2c0a8b2d8ebefd00ca32f0/frontend/afe/rpc_utils_unittest.py [modify] https://crrev.com/6dec0e1ddc726a0e5e2c0a8b2d8ebefd00ca32f0/client/common_lib/error.py [modify] https://crrev.com/6dec0e1ddc726a0e5e2c0a8b2d8ebefd00ca32f0/frontend/afe/rpc_interface_unittest.py [modify] https://crrev.com/6dec0e1ddc726a0e5e2c0a8b2d8ebefd00ca32f0/frontend/afe/models.py
,
May 26 2017
,
Aug 1 2017
,
Jan 22 2018
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by jrbarnette@chromium.org
, Apr 15 2017My thought on the resolution: * Find some way to avoid this kind of inconsistency. I'm agnostic as to how. * Change the shard client code to fail open: Once it's told that the data was rejected, it should update its local database to match what the master says is correct.