New issue
Advanced search Search tips

Issue 843548 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

ChOps Mutex does not work well with multibots

Project Member Reported by serg...@chromium.org, May 16 2018

Issue description

https://chromium-swarm.appspot.com/botlist?c=id&c=os&c=task&c=status&f=beefy_baremetal%3A0&f=multibot%3A1&f=os%3AUbuntu-14.04&f=pool%3Aluci.v8.try&l=100&s=id%3Aasc (screenshot attached)

Most of the virtual bots (82 out of 90) are not able to schedule tasks due to maintenance state and waiting for a small number of tasks (6) that are blocking puppet from running. This happens every 30 minutes and increases pending times for V8 try pool significantly.

Not sure what's the best solution for this, but I am going to disable mmutex on our multibots for now.
 
GG7SzfJswOB.png
337 KB View Download
Components: -Infra>Platform Infra>Platform>Swarming
CL disabling mmutex on V8's multibots landed: https://crrev.com/i/626967 (I've specified wrong bug number on the CL)

Comment 3 by mar...@chromium.org, May 21 2018

Cc: -charliea@chromium.org
Labels: Type-Bug
Owner: charliea@chromium.org
Status: Assigned (was: Untriaged)
Charlie, what do you think about never running _acquire_maintenance_mutex() in on_before_poll() if in_docker() is True?
Cc: bpastene@chromium.org
I defer to bpastene@ here, who knows more than I do about how maintenance mutex should work on this specific configuration due to his work on bug 808060
So long as we have puppet running on these hosts, puppet & swarming should honor the mutex (especially if puppet is what deploys docker). That all the bots on a single host were drained during a puppet run is WAI.

BTW, unless you manually remove the mmutex binary, the change in #2 won't have any effect. Puppet isn't smart enough to remove packages, so it'll just stick around on the host.
> So long as we have puppet running on these hosts, puppet & swarming should honor the mutex (especially if puppet is what deploys docker). That all the bots on a single host were drained during a puppet run is WAI.

It is WAI, but it's suboptimal because we get too many bots offline for too long. Given that puppet tries to acquire the mutex every 30 minutes and our LUCI builds run up to 40 minutes, the bots are in maintenance mode for most of the time. Reducing frequency with which puppet runs on a multibots will mitigate this issue to an extent. A more advanced solution is to allow tasks to run in maintenance mode if their expected completion time is earlier than the latest expected completion time of the already running tasks. Doing the estimates for the expected completion time may be non-trivial though and require analyzing history of similar tasks.

Another issue is that all 3 bots that we have in the pool can be in maintenance mode at the same time and we get from 90 virtual swarm bots down to 0 causing significant increase in CQ latency. Again, increasing the frequently with which puppet runs will reduce the impact of this issue as likelihood of all bots being in maintenance mode will drop. An  alternative (and harder to implement) solution is to try coordinate between bots to ensure that at most X bots (expressed as number or percentage of the machines in the pool) are in maintenance mode at the same time.

So, it seems to me, easy solution here is to reduce puppet run frequency to something like every 6-12 hours and see if this helps. More advanced solution may allow running puppet more frequently, but will also require a lot of engineering effort.

> BTW, unless you manually remove the mmutex binary, the change in #2 won't have any effect. Puppet isn't smart enough to remove packages, so it'll just stick around on the host.

Yes, I've learned the hard way when alerts fired after landing that :-). I've simply re-spawned the instances via ccompute script, but I guess removing the binaries manually would have also worked.
Owner: serg...@chromium.org
Assigning this to Sergiy to drive the conversation with bpastene@ and reach a reasonable solution
Owner: iannucci@chromium.org
IIUC main reason ChOps team wanted to have mmutex everywhere is to prevent tasks rebooting the host / locking files when puppet is running. V8 team only needed mmutex on perf bots to reduce measured noise, so from our PoV not running mmutex on multibots is fine because we do not automatically reboot them and mostly work with files in the start dir. Given than Robbie started this project, I'll assign to him to decide whether this is a viable long-term solution.
Yes, I agree; we should run puppet much less often on these machines (especially since we're isolating the potentially-system-altering-tasks into containers). 1/12h or 1/8h seem reasonable to me (or maybe even 1/24h). If there's something urgent we can always manually induce the puppet runs.

The reason we'd still want to USE the mutex is because puppet can do stuff like 'upgrade docker' (which could e.g. immediately kill all the containers). It's cleaner to not use the machine for chops stuff and user tasks simultaneously (but obviously not if it takes everything offline).
Components: -Infra>Platform>Swarming Infra>Platform>Swarming>Admin
Cc: friedman@chromium.org
So I am looking into reducing frequency of puppet on android_docker and swarm_docker machines. 

What's is not clear to me is how we can change the frequency. Currently the rule running puppet on a schedule looks like this:

  cron { "puppet":
    environment => "MAILTO=chrome-puppet-alerts@google.com",
    command     => "/bin/bash /usr/local/bin/run_puppet.sh 2>&1",
    user        => "root",
    hour        => "*",
    #minute      => [$cron_minute1, $cron_minute2],
    minute      => $cron_minute,
    require     => File["${::puppet_conf_dir}/puppet.conf"],
  }

Elliott, do you think if it's possible to apply an additional rule to a subset of machines that will just override minute and hour parameters of the cron job, but keep the rest intact? In order words, would the following work?

  nodes.yaml
  ==========
  - nodes: '*'
  - classes:
      ...
      chrome_infra::puppet: {}  # includes the rule above
      ...

  - nodes:
    - swarm-docker-*-c*.*.chromecompute.google.com.internal
  - classes:
      chrome_infra::puppet_daily: {}

  puppetm/etc/puppet/modules/chrome_infra/manifests/puppet_daily.pp
  =================================================================
  class chrome_infra::puppet_daily {
    cron { "puppet":
      hour = fqdn_mod_by(24)
      minute = fqdn_mod_by(60)
    }
  }
If puppet didn't manage docker on these hosts, would this still be an issue?
Yes, it would still be an issue, because it's not about managing docker, but about ChOps Mutex that prevents tasks from running each time puppet needs to sync. Completely disabling puppet would help, but that's bad for many other reasons. We just need to make sure that puppet does not run as frequently.
Owner: iannu...@google.com

Sign in to add a comment