New issue
Advanced search Search tips

Issue 854352 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Two bot processes running for the same bot_id

Project Member Reported by xixuan@chromium.org, Jun 19 2018

Issue description

Owner: pprabhu@chromium.org
Hypothesis: there are two processes of the same bot id running with different working directories. Pprabhu to add working directory and pid to bot dimensions to allow this to be determined from completed tasks.
Cc: mar...@chromium.org
 Issue 851681  has been merged into this issue.
 Issue 854311  has been merged into this issue.
Summary: Two bot processes running for the same bot_id (was: A bot runs different tasks at the same time)
bot_manager determines the running bots using `lsof -p $BOT_CWD/swarming.lck`

Turns out that there it can race with the bot's self-update mechanims. During self-update, the bot 
- downloads the new bot package
- releases swarming.lck
- forks, to run from the new bot package
- acquires swarming.lck

Swarming bot manager checks swarming.lck in the interim and decides that the bot is dead.
As a result, it ends up finding no bot for the relevant dut id, and starts a new bot (from a new cwd).
There are few approaches we can follow here:

# Hack:  time.sleep()
- whenever the bot manager finds a dut_id for which no bot exists
  - it starts an internal timout for (say) 10 seconds
  - if the dut_id still doesn't have a bot after 10 seconds, a new bot is started.

This ensures that any glitches in swarming.lck acquisition are smoothed over.
The big drawback of this approach is that it is a time.sleep() hack. If the server is overloaded for any reason and bot restart takes long, we're back in the bad case. We won't realize automatically that two bot processes are running for the same bot_id, and there is not automatic recovery.
Another approach is to use `lsof -p` on swarming.zip.1 and swarming.zip.2 files together, instead of swarming.lck. One of these is guaranteed to be opened by the bot process, even during self-update.

If doing it this way, we must do a back and forth dance:
- check swarming.1
- check swarming.2
- check swarming.1 again.

This ensures that we do not race against the bot moving from using swarming.2 to swarming.1
Cc: bpastene@chromium.org
+bpastene because I think he's also using swarming.lck to list bots in a different context.
Owner: xixuan@chromium.org
Actually, I'd like xixuan@ to poke around in the swarming_bot_manager code so that I'm not the only one with context.
Xixuan, can you implement the idea in #8 for this bug?
Status: Started (was: Assigned)
Re #5, why does the bot code do that?  It should be possible to hold a file descriptor/lock open across forks/execs (if you're careful).  Can we get a reliable mechanism for determining if a bot is running?  If I understand correctly, #8 is still theoretically vulnerable to races, if we catch the bot moving from 1 to 2, then back from 2 to 1 with extremely bad timing if the system is under load/process scheduler is wonky.
There's sth I can't make sure.

At 6.20, when I check, I can see there's process running like " /usr/bin/python /run/skylab_swarming/d8676/swarming_bot.2.zip start_bot"

But at 6.21, when I check, they're gone, instead, the command becomes to '/usr/bin/python swarming_bot.1.zip start_bot' without a specified working directory.


The process is like:

chromeo+  10990  0.0  0.0   9552  2636 ?        S    17:41   0:00 /bin/bash /usr/local/google/home/chromeos-test/chromiumos/chromeos-admin/venv/skylab_swarming_bot/start_swarming_bot.sh -u http://chrome-swarming.appspot.com -b chromeos-skylab-bot-af6e8833-d65c-4ecc-b033-861684d882b5 -w /var/run/skylab_swarming/d8662
chromeo+  10992  0.0  0.0   9552  2648 ?        S    17:41   0:00 /bin/bash /usr/local/google/home/chromeos-test/chromiumos/chromeos-admin/venv/skylab_swarming_bot/start_swarming_bot.sh -u http://chrome-swarming.appspot.com -b chromeos-skylab-bot-8632ec67-a236-4e5e-a79f-dfd3acd59d0b -w /var/run/skylab_swarming/d9829
chromeo+  10996  0.0  0.0   9552  2532 ?        S    17:41   0:00 /bin/bash /usr/local/google/home/chromeos-test/chromiumos/chromeos-admin/venv/skylab_swarming_bot/start_swarming_bot.sh -u http://chrome-swarming.appspot.com -b chromeos-skylab-bot-43282944-7d98-490b-8700-b9db2b4438e9 -w /var/run/skylab_swarming/d9970
chromeo+  11005  0.0  0.0   9552  2636 ?        S    17:41   0:00 /bin/bash /usr/local/google/home/chromeos-test/chromiumos/chromeos-admin/venv/skylab_swarming_bot/start_swarming_bot.sh -u http://chrome-swarming.appspot.com -b chromeos-skylab-bot-9818f78a-15d3-4c64-9123-b7b4a3fdb3b0 -w /var/run/skylab_swarming/d4353
chromeo+  11017  0.0  0.0   9552  2628 ?        S    17:41   0:00 /bin/bash /usr/local/google/home/chromeos-test/chromiumos/chromeos-admin/venv/skylab_swarming_bot/start_swarming_bot.sh -u http://chrome-swarming.appspot.com -b chromeos-skylab-bot-46c6c402-c6cb-4b7c-a96d-40f548f364ba -w /var/run/skylab_swarming/d9162
chromeo+  11021  0.0  0.0   9552  2684 ?        S    17:41   0:00 /bin/bash /usr/local/google/home/chromeos-test/chromiumos/chromeos-admin/venv/skylab_swarming_bot/start_swarming_bot.sh -u http://chrome-swarming.appspot.com -b chromeos-skylab-bot-64be8a4a-f695-45a8-b0d0-52efa76df665 -w /var/run/skylab_swarming/d6140
chromeo+  11916  0.2  0.1 182040 35772 ?        Sl   17:41   0:05 /usr/bin/python swarming_bot.1.zip start_bot
chromeo+  12002  0.2  0.1 182104 35748 ?        Sl   17:41   0:05 /usr/bin/python swarming_bot.1.zip start_bot
chromeo+  12010  0.2  0.1 182032 35716 ?        Sl   17:41   0:05 /usr/bin/python swarming_bot.1.zip start_bot
chromeo+  12046  0.2  0.1 182044 35804 ?        Sl   17:41   0:05 /usr/bin/python swarming_bot.1.zip start_bot
chromeo+  12087  0.2  0.1 182040 35808 ?        Sl   17:41   0:05 /usr/bin/python swarming_bot.1.zip start_bot
chromeo+  12147  0.2  0.1 182044 35680 ?        Sl   17:41   0:05 /usr/bin/python swarming_bot.1.zip start_bot
https://chromium.googlesource.com/infra/luci/luci-py/+/master/appengine/swarming/doc/Detailed-Design.md#self-updating might be of some help for reference.

I wasn't aware of the race-condition in #5, that's an interesting find. I don't think I've ever come across it. In our case, I think we'd end up shutting down the container if we can't figure out the pid, and a few minutes later start it back up:
https://chromium.googlesource.com/infra/infra/+/master/infra/services/swarm_docker/containers.py#315
Would be pretty much invisible to us since we wouldn't be interrupting any task, just delay the updating by a few minutes.

I'll be on the look out for it though
Labels: Type-Bug
The more I think of it, the more I think it'd be simpler to just fix issue 855022 instead of working around.
Project Member

Comment 17 by bugdroid1@chromium.org, Jun 22 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/a1606fe5bbd441860c7c5e9e9bf2380e04c9b768

commit a1606fe5bbd441860c7c5e9e9bf2380e04c9b768
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Fri Jun 22 17:26:27 2018

[swarming] make bot always starts with absolute path

In the initial self-replication, the bot would start itself with a relative
path. But in self-update, the bot would start itself with an absolute path. Some
team are using observers to look at open file path, and the relative vs absolute
confused their tools.

This incidentally replaces a few copy pasted constants with named variables.

R=qyearsley@chromium.org

Bug:  854352 
Change-Id: Ib24fee8ecce6717af1e1dab2d43f7452f7ce8239
Reviewed-on: https://chromium-review.googlesource.com/1112039
Reviewed-by: Xixuan Wu <xixuan@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/a1606fe5bbd441860c7c5e9e9bf2380e04c9b768/appengine/swarming/swarming_bot/__main__.py

Project Member

Comment 18 by bugdroid1@chromium.org, Jun 25 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/14a587914e658a7a22fb56fe29347d70ee4153fc

commit 14a587914e658a7a22fb56fe29347d70ee4153fc
Author: Xixuan Wu <xixuan@chromium.org>
Date: Mon Jun 25 23:00:27 2018

Project Member

Comment 19 by bugdroid1@chromium.org, Jun 29 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/4afbe6a4ef43efcbf2e4518ff3c388b06c82800d

commit 4afbe6a4ef43efcbf2e4518ff3c388b06c82800d
Author: Xixuan Wu <xixuan@chromium.org>
Date: Fri Jun 29 00:00:01 2018

Status: Fixed (was: Started)

Sign in to add a comment