build97-a7 in linux-perf are assigned to two different shards of performance_test_suite |
||
Issue descriptionI was wondering why "shard #10" of https://ci.chromium.org/buildbot/chromium.perf/linux-perf/255 always have a long pending time. Turns out that it was using "build97-a7" which is also run by shard #0 (https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2Flinux-perf%2F254%2F%2B%2Frecipes%2Fsteps%2Ftest_pre_run%2F0%2Fsteps%2Fs__trigger__performance_test_suite_on_NVIDIA_GPU_on_Linux%2F0%2Fstdout )
,
Jun 22 2018
Ok so here is what happened, but I am racking my brain to figure out why and I think without knowing the state of all the bots at that point in time this is hard to debug. In build 69, right before device affinity was turned on: https://uberchromegw.corp.google.com/i/chromium.perf/builders/linux-perf/builds/69 Shard 0 was on build67-a7 (https://chrome-swarming.appspot.com/task?id=3d82a80de0455710&refresh=10&show_raw=1) and shard 2 was on build69-a7 (https://chrome-swarming.appspot.com/task?id=3d82a818548e7110&refresh=10&show_raw=1) When device affinity was turned on in build 170: https://uberchromegw.corp.google.com/i/chromium.perf/builders/linux-perf/builds/70 Shard 0 jumped to build69-a7 (https://chrome-swarming.appspot.com/task?id=3d84f48deb9cc710&refresh=10&request_detail=true&show_raw=1) along with shard 2 (https://chrome-swarming.appspot.com/task?id=3d84f49242961810&refresh=10&request_detail=true&show_raw=1) and they both started running on that shard. The logs show that build77-a7 was down (https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2Flinux-perf%2F70%2F%2B%2Frecipes%2Fsteps%2Ftest_pre_run%2F0%2Fsteps%2Fs__trigger__performance_test_suite_on_NVIDIA_GPU_on_Linux%2F0%2Fstdout) , but it doesn't indicate anything about build67-a7 so we have to assume it was still healthy, so there has to be a bug that made shard 0 jump. Now fast forward to build 173 and build69-a7 goes down you can see that in the logs: https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2Flinux-perf%2F173%2F%2B%2Frecipes%2Fsteps%2Ftest_pre_run%2F0%2Fsteps%2Fs__trigger__performance_test_suite_on_NVIDIA_GPU_on_Linux%2F0%2Fstdout Shard 2 correctly gets allocated to build84-a7 -- what the logs said it would (https://chrome-swarming.appspot.com/task?id=3deaa39424123210&refresh=10&show_raw=1) But then shard 0 gets allocate dto build97-a7 (https://chrome-swarming.appspot.com/task?id=3deaa38342fbf210&refresh=10&show_raw=1) that is already running shard 10 (https://chrome-swarming.appspot.com/task?id=3deaa3b388ec7f10&refresh=10&show_raw=1) I am going to add some more logging to device affinity so we have a better state of the world to debug this
,
Jun 22 2018
One thing to note was that build69 on shard 0 failed: https://chrome-swarming.appspot.com/task?id=3d82a80de0455710&refresh=10&show_raw=1 The task failed, the bot didn't die. I am wondreing if somehow that id didn't get returned for shard 0 because it was a failure? Maybe something with the swarming query? That doesn't exaplin why it assigned it to a bot that was already in use by a shard, but it could explain why it didn't get the right info back.
,
Jun 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1256946df8cec50719f16b25ffe115f55d5d7241 commit 1256946df8cec50719f16b25ffe115f55d5d7241 Author: Emily Hanley <eyaich@google.com> Date: Mon Jun 25 14:13:32 2018 Adding logging to soft device affinity for debugging purposes. Bug: 855302 Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel Change-Id: I2696b04d1fd0838d4c497cfd60e43e7f15d6b7a7 Reviewed-on: https://chromium-review.googlesource.com/1112348 Commit-Queue: Emily Hanley <eyaich@chromium.org> Reviewed-by: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#570030} [modify] https://crrev.com/1256946df8cec50719f16b25ffe115f55d5d7241/testing/trigger_scripts/perf_device_trigger.py
,
Aug 31
|
||
►
Sign in to add a comment |
||
Comment 1 by bugdroid1@chromium.org
, Jun 22 2018