Cleanup master-shard RPCs |
|||||
Issue descriptionOne of the original CLs to add shard forwarded host modification RPCs to shard directly. (This CL is gone from gerrit) Author: Jakob Juelich <jakobjuelich@chromium.org> Date: Wed Oct 1 12:43:23 2014 -0700 [autotest] Forward calls to modify_host(s) to shards. If a host gets locked after it was assigned to a shard, right now only the master will know about it. With this commit, calls to modify_host and modify_hosts will be forwarded to the relevant shards, so locking of hosts will propagate. BUG=None DEPLOY=apache TEST=Ran suites and called rpcs manually, also with failing shard rpcs. Change-Id: I90ca34a4cefbdf55acd47ee6e8df872527b27285 Reviewed-on: https://chromium-review.googlesource.com/220850 Tested-by: Jakob Jülich <jakobjuelich@chromium.org> Reviewed-by: Fang Deng <fdeng@chromium.org> Commit-Queue: Jakob Jülich <jakobjuelich@chromium.org> Later, MK came and changed modify_hosts to be forwarded to master: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/297595 This is the right way to go. The master should modify the host. Shard gets the update via a future shard_client heartbeat. We need to reverse the delete_host RPC. Currently, if you delete a host (that belongs to a shard), it'll likely remain undeleted...
,
Dec 11 2017
To test this theory, I deleted a DUT known to be in a wonky state (so no one will mist it): pprabhu@pprabhu:chromite$ atest host list --web cautotest chromeos3-row1-rack2-host11 Host Status Shard Locked Lock Reason Locked by Platform Labels chromeos3-row1-rack2-host11 Repair Failed chromeos-server12.cbf.corp.google.com True This DUT is still wonky: board claims stumpy, platform+shard claims storm. Who cares about this? pprabhu storm bluetooth, sku:stumpy_intel_i5_2450m_2500mhz_4Gb, power:AC_only, board:stumpy, storage:ssd, gpu_family:sandybridge, board_freq_mem:stumpy_2.5GHz_4GB, graphics:gl, chaos_chamber, variant:stumpy, pool:wifichaos, variant:whirlwind, packet_capture, phase:EVT, sku:whirlwind_cpu_qualcomm_1Gb pprabhu@pprabhu:chromite$ atest host list --web chromeos-server12.cbf chromeos3-row1-rack2-host11 Host Status Shard Locked Lock Reason Locked by Platform Labels chromeos3-row1-rack2-host11 Repair Failed chromeos-server12.cbf.corp.google.com True This DUT is still wonky: board claims stumpy, platform+shard claims storm. Who cares about this? pprabhu storm bluetooth, sku:stumpy_intel_i5_2450m_2500mhz_4Gb, power:AC_only, board:stumpy, storage:ssd, gpu_family:sandybridge, board_freq_mem:stumpy_2.5GHz_4GB, graphics:gl, chaos_chamber, variant:stumpy, pool:wifichaos, variant:whirlwind, packet_capture, phase:EVT, sku:whirlwind_cpu_qualcomm_1Gb pprabhu@pprabhu:chromite$ atest shard list | grep server12 53 chromeos-server12.cbf.corp.google.com board:stumpy, board:falco DB state before delete. Shard: mysql> select * from afe_hosts where hostname='chromeos3-row1-rack2-host11' \G *************************** 1. row *************************** id: 788 hostname: chromeos3-row1-rack2-host11 locked: 1 status: Repair Failed invalid: 0 protection: 0 locked_by_id: 377 lock_time: 2017-03-24 13:01:04 dirty: 1 leased: 1 shard_id: 53 lock_reason: This DUT is still wonky: board claims stumpy, platform+shard claims storm. Who cares about this? synch_id: NULL 1 row in set (0.06 sec) Master: mysql> select * from afe_hosts where hostname='chromeos3-row1-rack2-host11' \G *************************** 1. row *************************** id: 788 hostname: chromeos3-row1-rack2-host11 locked: 1 synch_id: NULL status: Repair Failed invalid: 0 protection: 4 locked_by_id: 377 lock_time: 2017-03-24 13:01:04 dirty: 1 leased: 0 shard_id: 53 lock_reason: This DUT is still wonky: board claims stumpy, platform+shard claims storm. Who cares about this? 1 row in set (0.02 sec) .... and it did the right thing!
,
Dec 11 2017
Ah, the rpc_utils.forward_single_host_rpc_to_shard is misnamed. It first runs the RPC on the master, then runs it on shard: http://shortn/_CXCbGxTnR0 Of course, it doesn't check the result from master, and runs it first on master. So, you could end up with the host deleted on master then an error returned, leading to master-shard desync in the wrong direction. (All RPCs should run on slave first, then master. If they fail mid-stream, we consider the master state to be the correct one. In this case, since the RPC failed, the "correct" state is the old state).
,
Dec 11 2017
Using this bug for some cleanup before I forget all I've learnt. For any host, label, attribute modification RPC. One of two approaches must be used: - host creation and deletion: - only happens on master. shard_heartbeat is the sole way a host is added or removed from a shard. - all other modifications. - First fanout to shards for the given host. - If that succeeds, update master. The only information flow in the opposite direction is host.status: shards update host.status as a result of running a job, and shard_heartbeat syncs that back up to the master. There is an inherent race in this design that I'm not tackling: - a label modification on a host requires a shard (re-)assignment - next shard_heartbeat for the (old and) new shard will do the reassignment. - a concurrent host / label / attribute modification (to the shard_heartbeat from the new shard) is fanned out to the wrong shards, succeeds and so master is updated. Now the new shard has incorrect information. - sentinel will come and fix this. This race is likely to occur when creating a new host. We create a new host and then do various label modifications on it.
,
Dec 11 2017
I learnt that we're relying on sentinel on basic shard functionality, not just to fix raciness / partial update issues. when adding a label to a host: - we create label on master - we create label on the shard that has the host - we add the label to the host on master and the host's shard. This means that the other shards don't have that label at all. Now, if you migrate the hosts for a label to another shard, that shard won't have all the relevant labels at all. It only has them because sentinel adds them asynchronously. This is even true of a brand new shard. The only way the shard has labels when it is assigned hosts is because sentinel added them.
,
Dec 11 2017
,
Dec 11 2017
Here's the stack of cleanup CLs I'll land and then call this bug fixed: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/820681 But the main takeaway is: sentinel is critical to the functioning of shards, and especially to our ability to migrate hosts between shards. I haven't been thinking of sentinel as a first class entity so far -- thinking it's a backup, fixing things that fall of the road. That is wrong. We depend on sentinel updating the shards' afe_labels table to be able to add hosts that have those labels.
,
Jan 29 2018
status?
,
Jan 31 2018
Pushed to backburner. Writing the CLs was the easy part, actually testing is is harder. I was considering this a Fixit. Are you interested in this as beyond just a what the heck / do you have instances of this causing trouble recently? (I agree this _is_ a correctness problem currently)
,
Jan 31 2018
This was just a what-the-heck question.
,
Jul 10
Kill shards with fire instead. Really. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by pprabhu@chromium.org
, Dec 9 2017Labels: -Pri-3 Pri-2