New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 855302 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Aug 31
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

build97-a7 in linux-perf are assigned to two different shards of performance_test_suite

Project Member Reported by nednguyen@chromium.org, Jun 22 2018

Issue description

Project Member

Comment 1 by bugdroid1@chromium.org, Jun 22 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/3d1205eeea4bd6db5649488edfdf32e72c7c178a

commit 3d1205eeea4bd6db5649488edfdf32e72c7c178a
Author: Ned Nguyen <nednguyen@google.com>
Date: Fri Jun 22 12:43:56 2018

Make sure that perf_device_trigger always try to assign different shards on same bot

Bug:  855302 
Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel
Change-Id: Ifedb4069548c3c1706ae5b95e3ad922522b89178
Reviewed-on: https://chromium-review.googlesource.com/1111461
Reviewed-by: Emily Hanley <eyaich@chromium.org>
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#569586}
[modify] https://crrev.com/3d1205eeea4bd6db5649488edfdf32e72c7c178a/testing/trigger_scripts/perf_device_trigger.py
[modify] https://crrev.com/3d1205eeea4bd6db5649488edfdf32e72c7c178a/testing/trigger_scripts/perf_device_trigger_unittest.py

Comment 2 by eyaich@chromium.org, Jun 22 2018

Ok so here is what happened, but I am racking my brain to figure out why and I think without knowing the state of all the bots at that point in time this is hard to debug. 

In build 69, right before device affinity was turned on:
https://uberchromegw.corp.google.com/i/chromium.perf/builders/linux-perf/builds/69

Shard 0 was on build67-a7 (https://chrome-swarming.appspot.com/task?id=3d82a80de0455710&refresh=10&show_raw=1) and shard 2 was on build69-a7 (https://chrome-swarming.appspot.com/task?id=3d82a818548e7110&refresh=10&show_raw=1)


When device affinity was turned on in build 170:
https://uberchromegw.corp.google.com/i/chromium.perf/builders/linux-perf/builds/70

Shard 0 jumped to build69-a7 (https://chrome-swarming.appspot.com/task?id=3d84f48deb9cc710&refresh=10&request_detail=true&show_raw=1) along with shard 2 (https://chrome-swarming.appspot.com/task?id=3d84f49242961810&refresh=10&request_detail=true&show_raw=1) and they both started running on that shard.  The logs show that build77-a7 was down (https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2Flinux-perf%2F70%2F%2B%2Frecipes%2Fsteps%2Ftest_pre_run%2F0%2Fsteps%2Fs__trigger__performance_test_suite_on_NVIDIA_GPU_on_Linux%2F0%2Fstdout) , but it doesn't indicate anything about build67-a7 so we have to assume it was still healthy, so there has to be a bug that made shard 0 jump.

Now fast forward to build 173 and build69-a7 goes down you can see that in the logs: https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2Flinux-perf%2F173%2F%2B%2Frecipes%2Fsteps%2Ftest_pre_run%2F0%2Fsteps%2Fs__trigger__performance_test_suite_on_NVIDIA_GPU_on_Linux%2F0%2Fstdout

Shard 2 correctly gets allocated to build84-a7 -- what the logs said it would (https://chrome-swarming.appspot.com/task?id=3deaa39424123210&refresh=10&show_raw=1)  But then shard 0 gets allocate dto build97-a7 (https://chrome-swarming.appspot.com/task?id=3deaa38342fbf210&refresh=10&show_raw=1) that is already running shard 10 (https://chrome-swarming.appspot.com/task?id=3deaa3b388ec7f10&refresh=10&show_raw=1)

I am going to add some more logging to device affinity so we have a better state of the world to debug this


Comment 3 by eyaich@chromium.org, Jun 22 2018

One thing to note was that build69 on shard 0 failed: https://chrome-swarming.appspot.com/task?id=3d82a80de0455710&refresh=10&show_raw=1

The task failed, the bot didn't die.  I am wondreing if somehow that id didn't get returned for shard 0 because it was a failure?  Maybe something with the swarming query?  That doesn't exaplin why it assigned it to a bot that was already in use by a shard, but it could explain why it didn't get the right info back.
Project Member

Comment 4 by bugdroid1@chromium.org, Jun 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1256946df8cec50719f16b25ffe115f55d5d7241

commit 1256946df8cec50719f16b25ffe115f55d5d7241
Author: Emily Hanley <eyaich@google.com>
Date: Mon Jun 25 14:13:32 2018

Adding logging to soft device affinity for debugging purposes.

Bug:  855302 
Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel
Change-Id: I2696b04d1fd0838d4c497cfd60e43e7f15d6b7a7
Reviewed-on: https://chromium-review.googlesource.com/1112348
Commit-Queue: Emily Hanley <eyaich@chromium.org>
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#570030}
[modify] https://crrev.com/1256946df8cec50719f16b25ffe115f55d5d7241/testing/trigger_scripts/perf_device_trigger.py

Status: Fixed (was: Assigned)

Sign in to add a comment