New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 885337 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 25
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 2
Type: Feature

Blocked on:
issue 888119

Blocking:
issue 884913
issue 888081
issue 922237



Sign in to add a comment

Automatically quarantine locked Mac devices

Project Member Reported by bsheedy@chromium.org, Sep 18

Issue description

Mac devices on swarming are supposed to remain unlocked, but can get into a state with the screen locked if they go into hibernation and later wake up.

https://chromium-review.googlesource.com/c/chromium/src/+/1229366 adds logic to the GPU team's test runner to detect the lock screen and immediately fail if the device is locked. We would like similar logic to be added to swarming so that locked devices are automatically quarantined since locked devices cannot properly run tests.
 
Let's add is_locked() to https://cs.chromium.org/chromium/infra/luci/appengine/swarming/swarming_bot/api/platforms/osx.py

then call it from the relevant bot_config.py.
Issue 885381 has been merged into this issue.
Cc: kbr@chromium.org
Ken, you mentioned in the other issue that some devices like VMs might be functioning properly with the lock screen up. Do you think all the devices in the Chrome-GPU pool are a good test candidate since they are all (at least AFAIK) physical devices and shouldn't be able to run tests with the lock screen up?
And a related question for maruel@: is there a good way to record when is_locked() returns true so we can test if there are any instances of tests running successfully on locked devices before rolling this out to all Mac devices on chromium-swarm?
Yes you can add it to the state: add the value in get_state() in bot_config. Sadly it will not be snapshotted in the task's result, that's issue 850560.
Project Member

Comment 6 by bugdroid1@chromium.org, Sep 19

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/317abcef7dd2425699c50270caa0fbbf9f207d57

commit 317abcef7dd2425699c50270caa0fbbf9f207d57
Author: bsheedy <bsheedy@chromium.org>
Date: Wed Sep 19 00:00:55 2018

Add OSX lockscreen detection

Adds is_locked to the OSX-specific swarming API, which returns whether
the lock screen is detected or not.

Bug:  885337 
Change-Id: I904c6ee1cea1a0be66ba438497406f78ef961d15
Reviewed-on: https://chromium-review.googlesource.com/1232396
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>
Commit-Queue: Brian Sheedy <bsheedy@chromium.org>

[modify] https://crrev.com/317abcef7dd2425699c50270caa0fbbf9f207d57/appengine/swarming/swarming_bot/api/platforms/osx.py

Project Member

Comment 7 by bugdroid1@chromium.org, Sep 19

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/4a29d85be8308b84ed5f41268b07935516ef58be

commit 4a29d85be8308b84ed5f41268b07935516ef58be
Author: Brian Sheedy <bsheedy@google.com>
Date: Wed Sep 19 19:47:52 2018

Brian: yes, the machines in the Chrome-GPU pool are good candidates. Those are basically all of the Mac laptops which have been occasionally problematic. At some point we would like to run this detection code on the Mac Minis, too, and those are in the Chrome pool.

Owner: bsheedy@chromium.org
Status: Started (was: Untriaged)
Blocking: 888081
Project Member

Comment 11 by bugdroid1@chromium.org, Sep 21

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/086e43efd1bb7fc5a6e8e5cbf6b010c71def85b2

commit 086e43efd1bb7fc5a6e8e5cbf6b010c71def85b2
Author: Brian Sheedy <bsheedy@google.com>
Date: Fri Sep 21 19:48:34 2018

Thanks for implementing the auto-quaranting Brian!

Please watch the dashboards at:
https://chromium.googlesource.com/chromium/src/+/master/docs/gpu/pixel_wrangling.md#Fleet-Status

and let's make sure that things are stable -- i.e. that we don't see a bunch of bots mysteriously auto-quarantining themselves.

I don't think I have access to the dashboards? When I load any of them, I get a mostly empty page with "Executors by status: Query completed successfully, but had no results.".

In the meantime, I'll just keep an eye out for any quarantined Mac bots on Chrome-GPU https://chromium-swarm.appspot.com/botlist?c=id&c=os&c=task&c=status&f=pool%3AChrome-GPU&f=os%3AMac&f=status%3Aquarantined&l=1000&s=id%3Aasc
Blockedon: 888119
Argh, thanks for pointing that out. I should have tested the graphs myself. This is a recent regression and I've just filed a blocking bug about it.

Per the other bug this was my mistake for forgetting to update a couple of the Mac dashboards after a recent OS upgrade. Fix is incoming.

Project Member

Comment 17 by bugdroid1@chromium.org, Sep 21

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/cdf07c8ab33218ffad49cab92ce49bc86a2f8f8e

commit cdf07c8ab33218ffad49cab92ce49bc86a2f8f8e
Author: Kenneth Russell <kbr@chromium.org>
Date: Fri Sep 21 21:12:54 2018

Update links to Mac GPU bot dashboards.

The bots have been upgraded to 10.13.6 and the graphs had to be
updated as well.

Bug:  885337 
Tbr: zmo@chromium.org
No-Try: True
Change-Id: I676817f3595fee7667c521437bdb907299c08bab
Reviewed-on: https://chromium-review.googlesource.com/1239413
Commit-Queue: Kenneth Russell <kbr@chromium.org>
Reviewed-by: Kenneth Russell <kbr@chromium.org>
Cr-Commit-Position: refs/heads/master@{#593332}
[modify] https://crrev.com/cdf07c8ab33218ffad49cab92ce49bc86a2f8f8e/docs/gpu/pixel_wrangling.md

Labels: -Pri-1 Pri-2
I haven't seen any mass-quarantining yet on the Chrome-GPU pool, and from 888081, it looks like the quarantining is working properly when a machine gets into this state. Next step is to figure out when this logic needs to be applied to the Chrome pool since there are a few devices in this state that aren't failing tests. I think that can be a lower priority though since AFAIK the problematic machines were in Chrome-GPU.
Fantastic! Thank you Brian for getting things to this state!

It would be fantastic if we could figure out the situations where the other machines have their lock screens up. If it's safe to require in Chrome's infra that the lock screens be disabled on Macs for tests to run, and we can auto-quarantine if not, then we can have a nice uniform configuration and close this out.

I can take a look sometime within the next few days when I have some free time. Since it looks like only the GPU tests have issues with the lockscreen and (AFAIK) the only machines that GPU tests run on in the Chrome pool are the Mac Minis, we might be able to just apply the quarantining to the Mac Minis.
Alright, so I took a quick look at the current bots with locked screens using the following (slightly modified from maruel@'s original command):

tools/swarming_client/swarming.py query -S chromium-swarm.appspot.com --json a.json --limit 0 'bots/list?dimensions=os:Mac'

python -c 'import json;d=json.load(open("a.json"))["items"];e={b["bot_id"]:json.loads(b["state"])for b in d}; print "\n".join("%s: %s" % (id, state.get("lockscreen")) for id,state in sorted(e.iteritems()) if state.get("lockscreen") == True)'

That only found 4 bots with locked screens, three of which where the ones originally found when lock screen detection was added, and the last one being one in the Chrome-GPU pool that's been quarantined:

build27-b4: MacBook Pro 11.2, unassigned pool
build5-b1: MacBook Pro 11.2, unassigned pool
build768-m4: MacMini 7.1, Chrome pool
build397-m4: MackBook Pro 11.5, Chrome-GPU pool

Given that, I think it's fairly safe to assume that stuff in the Chrome pool doesn't get into a locked state often. I think it'd be safe to enable quarantining for Mac devices in the Chrome pool, or if we want to be really safe, just Mac Minis.
Great work identifying these machines Brian. 

https://chromium-swarm.appspot.com/bot?id=build5-b1&sort_stats=total%3Adesc
https://chromium-swarm.appspot.com/bot?id=build27-b4&sort_stats=total%3Adesc

Looks like these have run Perf and Pinpoint jobs in the past, but have been idle since February and March 2018, respectively.

If these plus build768-m4 are the only un-quarantined Macs with a lock screen up, then I think proceeding with auto-quarantining for all Macs with a lock screen up sounds great!

https://chrome-internal-review.googlesource.com/c/infradata/config/+/690950 quarantines any locked devices in Chrome or Chrome-GPU - I can change that to be any locked device regardless of pool if that's desirable. Considering non-GPU tests don't seem to have issues with locked devices, not sure if there's much benefit to applying the logic to the entire swarming server.
For the sake of simplifying the number of configurations we support, I think it would be better in the long run to say that in order to successfully deploy a Mac (including a VM) into Swarming, the lock screen has to be disabled.

Project Member

Comment 25 by bugdroid1@chromium.org, Oct 4

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/f18910c956967c1117c52a51c48c33e3a3046ca4

commit f18910c956967c1117c52a51c48c33e3a3046ca4
Author: Brian Sheedy <bsheedy@google.com>
Date: Thu Oct 04 01:05:14 2018

Comment 26 Deleted

Can this be considered fixed at this point?

I believe so, although do we want to remove the fail-on-lockscreen logic we initially added from the GPU test runner script before closing since it shouldn't be necessary any longer?
I think it's fine to leave it in. It should never be triggered, but could be a useful canary if something goes wrong.

Status: Fixed (was: Started)
Thank you Brian for adding this monitoring! It's categorically eliminated a significant cause of flakiness on our team's physical hardware.

Blocking: 922237

Sign in to add a comment