Automatically quarantine locked Mac devices |
||||||||
Issue descriptionMac devices on swarming are supposed to remain unlocked, but can get into a state with the screen locked if they go into hibernation and later wake up. https://chromium-review.googlesource.com/c/chromium/src/+/1229366 adds logic to the GPU team's test runner to detect the lock screen and immediately fail if the device is locked. We would like similar logic to be added to swarming so that locked devices are automatically quarantined since locked devices cannot properly run tests.
,
Sep 18
Issue 885381 has been merged into this issue.
,
Sep 18
Ken, you mentioned in the other issue that some devices like VMs might be functioning properly with the lock screen up. Do you think all the devices in the Chrome-GPU pool are a good test candidate since they are all (at least AFAIK) physical devices and shouldn't be able to run tests with the lock screen up?
,
Sep 18
And a related question for maruel@: is there a good way to record when is_locked() returns true so we can test if there are any instances of tests running successfully on locked devices before rolling this out to all Mac devices on chromium-swarm?
,
Sep 18
Yes you can add it to the state: add the value in get_state() in bot_config. Sadly it will not be snapshotted in the task's result, that's issue 850560.
,
Sep 19
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/317abcef7dd2425699c50270caa0fbbf9f207d57 commit 317abcef7dd2425699c50270caa0fbbf9f207d57 Author: bsheedy <bsheedy@chromium.org> Date: Wed Sep 19 00:00:55 2018 Add OSX lockscreen detection Adds is_locked to the OSX-specific swarming API, which returns whether the lock screen is detected or not. Bug: 885337 Change-Id: I904c6ee1cea1a0be66ba438497406f78ef961d15 Reviewed-on: https://chromium-review.googlesource.com/1232396 Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> Commit-Queue: Brian Sheedy <bsheedy@chromium.org> [modify] https://crrev.com/317abcef7dd2425699c50270caa0fbbf9f207d57/appengine/swarming/swarming_bot/api/platforms/osx.py
,
Sep 19
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/4a29d85be8308b84ed5f41268b07935516ef58be commit 4a29d85be8308b84ed5f41268b07935516ef58be Author: Brian Sheedy <bsheedy@google.com> Date: Wed Sep 19 19:47:52 2018
,
Sep 19
Brian: yes, the machines in the Chrome-GPU pool are good candidates. Those are basically all of the Mac laptops which have been occasionally problematic. At some point we would like to run this detection code on the Mac Minis, too, and those are in the Chrome pool.
,
Sep 19
,
Sep 21
,
Sep 21
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/086e43efd1bb7fc5a6e8e5cbf6b010c71def85b2 commit 086e43efd1bb7fc5a6e8e5cbf6b010c71def85b2 Author: Brian Sheedy <bsheedy@google.com> Date: Fri Sep 21 19:48:34 2018
,
Sep 21
Thanks for implementing the auto-quaranting Brian! Please watch the dashboards at: https://chromium.googlesource.com/chromium/src/+/master/docs/gpu/pixel_wrangling.md#Fleet-Status and let's make sure that things are stable -- i.e. that we don't see a bunch of bots mysteriously auto-quarantining themselves.
,
Sep 21
I don't think I have access to the dashboards? When I load any of them, I get a mostly empty page with "Executors by status: Query completed successfully, but had no results.". In the meantime, I'll just keep an eye out for any quarantined Mac bots on Chrome-GPU https://chromium-swarm.appspot.com/botlist?c=id&c=os&c=task&c=status&f=pool%3AChrome-GPU&f=os%3AMac&f=status%3Aquarantined&l=1000&s=id%3Aasc
,
Sep 21
,
Sep 21
Argh, thanks for pointing that out. I should have tested the graphs myself. This is a recent regression and I've just filed a blocking bug about it.
,
Sep 21
Per the other bug this was my mistake for forgetting to update a couple of the Mac dashboards after a recent OS upgrade. Fix is incoming.
,
Sep 21
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/cdf07c8ab33218ffad49cab92ce49bc86a2f8f8e commit cdf07c8ab33218ffad49cab92ce49bc86a2f8f8e Author: Kenneth Russell <kbr@chromium.org> Date: Fri Sep 21 21:12:54 2018 Update links to Mac GPU bot dashboards. The bots have been upgraded to 10.13.6 and the graphs had to be updated as well. Bug: 885337 Tbr: zmo@chromium.org No-Try: True Change-Id: I676817f3595fee7667c521437bdb907299c08bab Reviewed-on: https://chromium-review.googlesource.com/1239413 Commit-Queue: Kenneth Russell <kbr@chromium.org> Reviewed-by: Kenneth Russell <kbr@chromium.org> Cr-Commit-Position: refs/heads/master@{#593332} [modify] https://crrev.com/cdf07c8ab33218ffad49cab92ce49bc86a2f8f8e/docs/gpu/pixel_wrangling.md
,
Sep 24
I haven't seen any mass-quarantining yet on the Chrome-GPU pool, and from 888081, it looks like the quarantining is working properly when a machine gets into this state. Next step is to figure out when this logic needs to be applied to the Chrome pool since there are a few devices in this state that aren't failing tests. I think that can be a lower priority though since AFAIK the problematic machines were in Chrome-GPU.
,
Oct 3
Fantastic! Thank you Brian for getting things to this state! It would be fantastic if we could figure out the situations where the other machines have their lock screens up. If it's safe to require in Chrome's infra that the lock screens be disabled on Macs for tests to run, and we can auto-quarantine if not, then we can have a nice uniform configuration and close this out.
,
Oct 3
I can take a look sometime within the next few days when I have some free time. Since it looks like only the GPU tests have issues with the lockscreen and (AFAIK) the only machines that GPU tests run on in the Chrome pool are the Mac Minis, we might be able to just apply the quarantining to the Mac Minis.
,
Oct 3
Alright, so I took a quick look at the current bots with locked screens using the following (slightly modified from maruel@'s original command): tools/swarming_client/swarming.py query -S chromium-swarm.appspot.com --json a.json --limit 0 'bots/list?dimensions=os:Mac' python -c 'import json;d=json.load(open("a.json"))["items"];e={b["bot_id"]:json.loads(b["state"])for b in d}; print "\n".join("%s: %s" % (id, state.get("lockscreen")) for id,state in sorted(e.iteritems()) if state.get("lockscreen") == True)' That only found 4 bots with locked screens, three of which where the ones originally found when lock screen detection was added, and the last one being one in the Chrome-GPU pool that's been quarantined: build27-b4: MacBook Pro 11.2, unassigned pool build5-b1: MacBook Pro 11.2, unassigned pool build768-m4: MacMini 7.1, Chrome pool build397-m4: MackBook Pro 11.5, Chrome-GPU pool Given that, I think it's fairly safe to assume that stuff in the Chrome pool doesn't get into a locked state often. I think it'd be safe to enable quarantining for Mac devices in the Chrome pool, or if we want to be really safe, just Mac Minis.
,
Oct 3
Great work identifying these machines Brian. https://chromium-swarm.appspot.com/bot?id=build5-b1&sort_stats=total%3Adesc https://chromium-swarm.appspot.com/bot?id=build27-b4&sort_stats=total%3Adesc Looks like these have run Perf and Pinpoint jobs in the past, but have been idle since February and March 2018, respectively. If these plus build768-m4 are the only un-quarantined Macs with a lock screen up, then I think proceeding with auto-quarantining for all Macs with a lock screen up sounds great!
,
Oct 3
https://chrome-internal-review.googlesource.com/c/infradata/config/+/690950 quarantines any locked devices in Chrome or Chrome-GPU - I can change that to be any locked device regardless of pool if that's desirable. Considering non-GPU tests don't seem to have issues with locked devices, not sure if there's much benefit to applying the logic to the entire swarming server.
,
Oct 3
For the sake of simplifying the number of configurations we support, I think it would be better in the long run to say that in order to successfully deploy a Mac (including a VM) into Swarming, the lock screen has to be disabled.
,
Oct 4
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/f18910c956967c1117c52a51c48c33e3a3046ca4 commit f18910c956967c1117c52a51c48c33e3a3046ca4 Author: Brian Sheedy <bsheedy@google.com> Date: Thu Oct 04 01:05:14 2018
,
Oct 24
Can this be considered fixed at this point?
,
Oct 24
I believe so, although do we want to remove the fail-on-lockscreen logic we initially added from the GPU test runner script before closing since it shouldn't be necessary any longer?
,
Oct 25
I think it's fine to leave it in. It should never be triggered, but could be a useful canary if something goes wrong.
,
Oct 25
,
Oct 25
Thank you Brian for adding this monitoring! It's categorically eliminated a significant cause of flakiness on our team's physical hardware.
,
Jan 15
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by mar...@chromium.org
, Sep 18