New issue
Advanced search Search tips

Issue 793512 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Jul 10
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Cleanup master-shard RPCs

Project Member Reported by pprabhu@chromium.org, Dec 9 2017

Issue description

One of the original CLs to add shard forwarded host modification RPCs to shard directly.

(This CL is gone from gerrit)

Author: Jakob Juelich <jakobjuelich@chromium.org>
Date:   Wed Oct 1 12:43:23 2014 -0700

    [autotest] Forward calls to modify_host(s) to shards.
    
    If a host gets locked after it was assigned to a shard, right now
    only the master will know about it.
    
    With this commit, calls to modify_host and modify_hosts will be
    forwarded to the relevant shards, so locking of hosts will propagate.
    
    BUG=None
    DEPLOY=apache
    TEST=Ran suites and called rpcs manually, also with failing shard rpcs.
    
    Change-Id: I90ca34a4cefbdf55acd47ee6e8df872527b27285
    Reviewed-on: https://chromium-review.googlesource.com/220850
    Tested-by: Jakob Jülich <jakobjuelich@chromium.org>
    Reviewed-by: Fang Deng <fdeng@chromium.org>
    Commit-Queue: Jakob Jülich <jakobjuelich@chromium.org>


Later, MK came and changed modify_hosts to be forwarded to master:
https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/297595

This is the right way to go. The master should modify the host. Shard gets the update via a future shard_client heartbeat.

We need to reverse the delete_host RPC.
Currently, if you delete a host (that belongs to a shard), it'll likely remain undeleted...
 
Cc: ayatane@chromium.org
Labels: -Pri-3 Pri-2
To test this theory, I deleted a DUT known to be in a wonky state (so no one will mist it):
pprabhu@pprabhu:chromite$ atest host list --web cautotest chromeos3-row1-rack2-host11
Host                         Status         Shard                                  Locked  Lock Reason                                                                                       Locked by  Platform  Labels
chromeos3-row1-rack2-host11  Repair Failed  chromeos-server12.cbf.corp.google.com  True    This DUT is still wonky: board claims stumpy, platform+shard claims storm. Who cares about this?  pprabhu    storm     bluetooth, sku:stumpy_intel_i5_2450m_2500mhz_4Gb, power:AC_only, board:stumpy, storage:ssd, gpu_family:sandybridge, board_freq_mem:stumpy_2.5GHz_4GB, graphics:gl, chaos_chamber, variant:stumpy, pool:wifichaos, variant:whirlwind, packet_capture, phase:EVT, sku:whirlwind_cpu_qualcomm_1Gb
pprabhu@pprabhu:chromite$ atest host list --web chromeos-server12.cbf chromeos3-row1-rack2-host11                                                                                              
Host                         Status         Shard                                  Locked  Lock Reason                                                                                       Locked by  Platform  Labels
chromeos3-row1-rack2-host11  Repair Failed  chromeos-server12.cbf.corp.google.com  True    This DUT is still wonky: board claims stumpy, platform+shard claims storm. Who cares about this?  pprabhu    storm     bluetooth, sku:stumpy_intel_i5_2450m_2500mhz_4Gb, power:AC_only, board:stumpy, storage:ssd, gpu_family:sandybridge, board_freq_mem:stumpy_2.5GHz_4GB, graphics:gl, chaos_chamber, variant:stumpy, pool:wifichaos, variant:whirlwind, packet_capture, phase:EVT, sku:whirlwind_cpu_qualcomm_1Gb
pprabhu@pprabhu:chromite$ atest shard list | grep server12
53   chromeos-server12.cbf.corp.google.com   board:stumpy, board:falco


DB state before delete.

Shard:

mysql> select * from afe_hosts where hostname='chromeos3-row1-rack2-host11' \G
*************************** 1. row ***************************
          id: 788
    hostname: chromeos3-row1-rack2-host11
      locked: 1
      status: Repair Failed
     invalid: 0
  protection: 0
locked_by_id: 377
   lock_time: 2017-03-24 13:01:04
       dirty: 1
      leased: 1
    shard_id: 53
 lock_reason: This DUT is still wonky: board claims stumpy, platform+shard claims storm. Who cares about this?
    synch_id: NULL
1 row in set (0.06 sec)


Master:
mysql> select * from afe_hosts where hostname='chromeos3-row1-rack2-host11' \G
*************************** 1. row ***************************
          id: 788
    hostname: chromeos3-row1-rack2-host11
      locked: 1
    synch_id: NULL
      status: Repair Failed
     invalid: 0
  protection: 4
locked_by_id: 377
   lock_time: 2017-03-24 13:01:04
       dirty: 1
      leased: 0
    shard_id: 53
 lock_reason: This DUT is still wonky: board claims stumpy, platform+shard claims storm. Who cares about this?
1 row in set (0.02 sec)


.... and it did the right thing!

Status: WontFix (was: Untriaged)
Ah, the rpc_utils.forward_single_host_rpc_to_shard is misnamed.
It first runs the RPC on the master, then runs it on shard: http://shortn/_CXCbGxTnR0

Of course, it doesn't check the result from master, and runs it first on master.
So, you could end up with the host deleted on master then an error returned, leading to master-shard desync in the wrong direction.

(All RPCs should run on slave first, then master. If they fail mid-stream, we consider the master state to be the correct one. In this case, since the RPC failed, the "correct" state is the old state).
Labels: Hotlist-Fixit
Owner: pprabhu@chromium.org
Status: Started (was: WontFix)
Summary: Cleanup master-shard RPCs (was: delete_host RPC is forwarded to shard, should be forwarded to master)
Using this bug for some cleanup before I forget all I've learnt.

For any host, label, attribute modification RPC. One of two approaches must be used:

- host creation and deletion:
  - only happens on master. shard_heartbeat is the sole way a host is added or removed from a shard.

- all other modifications.
  - First fanout to shards for the given host.
  - If that succeeds, update master.


The only information flow in the opposite direction is host.status: shards update host.status as a result of running a job, and shard_heartbeat syncs that back up to the master.

There is an inherent race in this design that I'm not tackling: 
- a label modification on a host requires a shard (re-)assignment
- next shard_heartbeat for the (old and) new shard will do the reassignment.
- a concurrent host / label / attribute modification (to the shard_heartbeat from the new shard) is fanned out to the wrong shards, succeeds and so master is updated.

Now the new shard has incorrect information.
- sentinel will come and fix this.

This race is likely to occur when creating a new host. We create a new host and then do various label modifications on it.
I learnt that we're relying on sentinel on basic shard functionality, not just to fix raciness / partial update issues.

when adding a label to a host:
 - we create label on master
 - we create label on the shard that has the host
 - we add the label to the host on master and the host's shard.

This means that the other shards don't have that label at all.

Now, if you migrate the hosts for a label to another shard, that shard won't have all the relevant labels at all.
It only has them because sentinel adds them asynchronously.


This is even true of a brand new shard. The only way the shard has labels when it is assigned hosts is because sentinel added them.
Cc: akes...@chromium.org
Here's the stack of cleanup CLs I'll land and then call this bug fixed: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/820681

But the main takeaway is: sentinel is critical to the functioning of shards, and especially to our ability to migrate hosts between shards.

I haven't been thinking of sentinel as a first class entity so far -- thinking it's a backup, fixing things that fall of the road. That is wrong. We depend on sentinel updating the shards' afe_labels table to be able to add hosts that have those labels.
status?
Pushed to backburner. Writing the CLs was the easy part, actually testing is is harder. I was considering this a Fixit. Are you interested in this as beyond just a what the heck / do you have instances of this causing trouble recently? (I agree this _is_ a correctness problem currently)
This was just a what-the-heck question.
Status: Archived (was: Started)
Kill shards with fire instead. Really.

Sign in to add a comment