Two bot processes running for the same bot_id |
||||||||
Issue descriptionNot sure whether it's the reason of the failure 'Client job got aborted.': Failed tasks: https://chrome-swarming.appspot.com/task?id=3e322a3cd05c1a10&refresh=10 https://chrome-swarming.appspot.com/task?id=3e322a7a79630510&refresh=10 https://chrome-swarming.appspot.com/task?id=3e322a7dc2d71310&refresh=10 It might be related to crbug.com/851681 .
,
Jun 20 2018
Hypothesis: there are two processes of the same bot id running with different working directories. Pprabhu to add working directory and pid to bot dimensions to allow this to be determined from completed tasks.
,
Jun 20 2018
,
Jun 20 2018
Issue 854311 has been merged into this issue.
,
Jun 20 2018
bot_manager determines the running bots using `lsof -p $BOT_CWD/swarming.lck` Turns out that there it can race with the bot's self-update mechanims. During self-update, the bot - downloads the new bot package - releases swarming.lck - forks, to run from the new bot package - acquires swarming.lck Swarming bot manager checks swarming.lck in the interim and decides that the bot is dead. As a result, it ends up finding no bot for the relevant dut id, and starts a new bot (from a new cwd).
,
Jun 20 2018
Relevant self-udpate bit in bot code: https://cs.chromium.org/chromium/infra/luci/appengine/swarming/swarming_bot/bot_code/bot_main.py?l=1267
,
Jun 20 2018
There are few approaches we can follow here: # Hack: time.sleep() - whenever the bot manager finds a dut_id for which no bot exists - it starts an internal timout for (say) 10 seconds - if the dut_id still doesn't have a bot after 10 seconds, a new bot is started. This ensures that any glitches in swarming.lck acquisition are smoothed over. The big drawback of this approach is that it is a time.sleep() hack. If the server is overloaded for any reason and bot restart takes long, we're back in the bad case. We won't realize automatically that two bot processes are running for the same bot_id, and there is not automatic recovery.
,
Jun 20 2018
Another approach is to use `lsof -p` on swarming.zip.1 and swarming.zip.2 files together, instead of swarming.lck. One of these is guaranteed to be opened by the bot process, even during self-update. If doing it this way, we must do a back and forth dance: - check swarming.1 - check swarming.2 - check swarming.1 again. This ensures that we do not race against the bot moving from using swarming.2 to swarming.1
,
Jun 20 2018
+bpastene because I think he's also using swarming.lck to list bots in a different context.
,
Jun 20 2018
Actually, I'd like xixuan@ to poke around in the swarming_bot_manager code so that I'm not the only one with context. Xixuan, can you implement the idea in #8 for this bug?
,
Jun 20 2018
,
Jun 20 2018
Re #5, why does the bot code do that? It should be possible to hold a file descriptor/lock open across forks/execs (if you're careful). Can we get a reliable mechanism for determining if a bot is running? If I understand correctly, #8 is still theoretically vulnerable to races, if we catch the bot moving from 1 to 2, then back from 2 to 1 with extremely bad timing if the system is under load/process scheduler is wonky.
,
Jun 22 2018
There's sth I can't make sure. At 6.20, when I check, I can see there's process running like " /usr/bin/python /run/skylab_swarming/d8676/swarming_bot.2.zip start_bot" But at 6.21, when I check, they're gone, instead, the command becomes to '/usr/bin/python swarming_bot.1.zip start_bot' without a specified working directory.
,
Jun 22 2018
The process is like: chromeo+ 10990 0.0 0.0 9552 2636 ? S 17:41 0:00 /bin/bash /usr/local/google/home/chromeos-test/chromiumos/chromeos-admin/venv/skylab_swarming_bot/start_swarming_bot.sh -u http://chrome-swarming.appspot.com -b chromeos-skylab-bot-af6e8833-d65c-4ecc-b033-861684d882b5 -w /var/run/skylab_swarming/d8662 chromeo+ 10992 0.0 0.0 9552 2648 ? S 17:41 0:00 /bin/bash /usr/local/google/home/chromeos-test/chromiumos/chromeos-admin/venv/skylab_swarming_bot/start_swarming_bot.sh -u http://chrome-swarming.appspot.com -b chromeos-skylab-bot-8632ec67-a236-4e5e-a79f-dfd3acd59d0b -w /var/run/skylab_swarming/d9829 chromeo+ 10996 0.0 0.0 9552 2532 ? S 17:41 0:00 /bin/bash /usr/local/google/home/chromeos-test/chromiumos/chromeos-admin/venv/skylab_swarming_bot/start_swarming_bot.sh -u http://chrome-swarming.appspot.com -b chromeos-skylab-bot-43282944-7d98-490b-8700-b9db2b4438e9 -w /var/run/skylab_swarming/d9970 chromeo+ 11005 0.0 0.0 9552 2636 ? S 17:41 0:00 /bin/bash /usr/local/google/home/chromeos-test/chromiumos/chromeos-admin/venv/skylab_swarming_bot/start_swarming_bot.sh -u http://chrome-swarming.appspot.com -b chromeos-skylab-bot-9818f78a-15d3-4c64-9123-b7b4a3fdb3b0 -w /var/run/skylab_swarming/d4353 chromeo+ 11017 0.0 0.0 9552 2628 ? S 17:41 0:00 /bin/bash /usr/local/google/home/chromeos-test/chromiumos/chromeos-admin/venv/skylab_swarming_bot/start_swarming_bot.sh -u http://chrome-swarming.appspot.com -b chromeos-skylab-bot-46c6c402-c6cb-4b7c-a96d-40f548f364ba -w /var/run/skylab_swarming/d9162 chromeo+ 11021 0.0 0.0 9552 2684 ? S 17:41 0:00 /bin/bash /usr/local/google/home/chromeos-test/chromiumos/chromeos-admin/venv/skylab_swarming_bot/start_swarming_bot.sh -u http://chrome-swarming.appspot.com -b chromeos-skylab-bot-64be8a4a-f695-45a8-b0d0-52efa76df665 -w /var/run/skylab_swarming/d6140 chromeo+ 11916 0.2 0.1 182040 35772 ? Sl 17:41 0:05 /usr/bin/python swarming_bot.1.zip start_bot chromeo+ 12002 0.2 0.1 182104 35748 ? Sl 17:41 0:05 /usr/bin/python swarming_bot.1.zip start_bot chromeo+ 12010 0.2 0.1 182032 35716 ? Sl 17:41 0:05 /usr/bin/python swarming_bot.1.zip start_bot chromeo+ 12046 0.2 0.1 182044 35804 ? Sl 17:41 0:05 /usr/bin/python swarming_bot.1.zip start_bot chromeo+ 12087 0.2 0.1 182040 35808 ? Sl 17:41 0:05 /usr/bin/python swarming_bot.1.zip start_bot chromeo+ 12147 0.2 0.1 182044 35680 ? Sl 17:41 0:05 /usr/bin/python swarming_bot.1.zip start_bot
,
Jun 22 2018
https://chromium.googlesource.com/infra/luci/luci-py/+/master/appengine/swarming/doc/Detailed-Design.md#self-updating might be of some help for reference. I wasn't aware of the race-condition in #5, that's an interesting find. I don't think I've ever come across it. In our case, I think we'd end up shutting down the container if we can't figure out the pid, and a few minutes later start it back up: https://chromium.googlesource.com/infra/infra/+/master/infra/services/swarm_docker/containers.py#315 Would be pretty much invisible to us since we wouldn't be interrupting any task, just delay the updating by a few minutes. I'll be on the look out for it though
,
Jun 22 2018
The more I think of it, the more I think it'd be simpler to just fix issue 855022 instead of working around.
,
Jun 22 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/a1606fe5bbd441860c7c5e9e9bf2380e04c9b768 commit a1606fe5bbd441860c7c5e9e9bf2380e04c9b768 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Fri Jun 22 17:26:27 2018 [swarming] make bot always starts with absolute path In the initial self-replication, the bot would start itself with a relative path. But in self-update, the bot would start itself with an absolute path. Some team are using observers to look at open file path, and the relative vs absolute confused their tools. This incidentally replaces a few copy pasted constants with named variables. R=qyearsley@chromium.org Bug: 854352 Change-Id: Ib24fee8ecce6717af1e1dab2d43f7452f7ce8239 Reviewed-on: https://chromium-review.googlesource.com/1112039 Reviewed-by: Xixuan Wu <xixuan@chromium.org> Reviewed-by: Quinten Yearsley <qyearsley@chromium.org> Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/a1606fe5bbd441860c7c5e9e9bf2380e04c9b768/appengine/swarming/swarming_bot/__main__.py
,
Jun 25 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/14a587914e658a7a22fb56fe29347d70ee4153fc commit 14a587914e658a7a22fb56fe29347d70ee4153fc Author: Xixuan Wu <xixuan@chromium.org> Date: Mon Jun 25 23:00:27 2018
,
Jun 29 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/4afbe6a4ef43efcbf2e4518ff3c388b06c82800d commit 4afbe6a4ef43efcbf2e4518ff3c388b06c82800d Author: Xixuan Wu <xixuan@chromium.org> Date: Fri Jun 29 00:00:01 2018
,
Jun 29 2018
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by akes...@chromium.org
, Jun 20 2018