Windows Scheduler jobs by default run with "Below normal" priority. This is problematic for luci_machine_tokend job that wakes up each 5 min and refreshes machine's authentication token if necessary.
If there's high CPU/IO activity running in foreground (like compilation), luci_machine_tokend can get stalled for a LONG time, so that eventually the authentication token expires and Swarming bot dies.
This has been observed in https://bugs.chromium.org/p/chromium/issues/detail?id=856894
Manually bumping the job's priority to "Normal" fixed the issue.
Unfortunately, looks like the only way to change the priority in Windows Scheduler is to import tasks definition from some XML file. So we'll need to change how we deploy luci_machine_tokend jobs :(
(Also exact same issue affects Puppet and everything else we run through Windows Scheduler, but luci_machine_tokend is the only time sensitive job, I believe).
Comment 1 by s...@google.com
, Jul 10Status: Available (was: Untriaged)