New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 886985 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
OOO until 2019-01-24
Closed: Sep 20
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Android , Windows , Mac
Pri: 1
Type: Bug

Blocking:
issue 898684
issue 920665
issue 871453
issue 838970



Sign in to add a comment

Improve trigger_multiple_dimensions.py logic

Project Member Reported by ynovikov@chromium.org, Sep 19

Issue description

After https://chromium-review.googlesource.com/1225609 shards started to time out on win_angle_rel_ng bot:
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3087
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3093
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3094
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3095
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3096

Also
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_deqp_rel_ng/2707

There were 4 bots with 24.20.100.6286 driver and 130 bots with 23.21.13.8816 driver at this time.
Somehow the 4 bots became oversubscribed.

Looking at https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3093, at least 9 out of around 100 shards were scheduled on the 4 new driver bots.
This is highly disproportional to 4/130 ratio.
Even more so in https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_deqp_rel_ng/2707, where 1/4 shards were scheduled on new bots.

Looks like choose_random_int() in trigger_multiple_dimensions.py has unacceptable deviation.
Maybe we should fix this with a threshold - don't schedule on configuration which is less than 10% of population.
 
Blocking: 838970
Cc: zmo@chromium.org
Status: Assigned (was: Untriaged)
Thanks Yuly for your analysis. I'll implement that check.

I'm sorry, looks like I've looked at NVIDIA bots when I reported "130 bots with 23.21.13.8816". For Intel bots we have 59 bots with 23.20.16.4877.

Also, https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3093 had 137 shards due to retries.
 
9/128 is actually pretty close to 4/59, so the deviation seems reasonable.

Now it looks to me that the load might have been spread proportionately between the old and new driver bots.
Unfortunately, I don't know how to query how many jobs were scheduled on which configuration to confirm that.

My current hypothesis is that 4 bots got a little more load due to rounding error.
Which a threshold could also solve.
Project Member

Comment 4 by bugdroid1@chromium.org, Sep 19

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/3ddf78750fd145440399b6e5f5ae8a3d2a447f93

commit 3ddf78750fd145440399b6e5f5ae8a3d2a447f93
Author: Kenneth Russell <kbr@chromium.org>
Date: Wed Sep 19 23:39:20 2018

Prune bot configurations with less than 10% of total capacity.

We're finding that these configurations are frequently getting
over-subscribed, so without more information, institute a threshold so
the multi-dimemsion trigger script doesn't schedule jobs on these
configurations *at all*.

This will likely have side effects and require refinement, but will
hopefully alleviate the immediate problem.

Bug:  886985 
Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel
Change-Id: I5ea4823920d7e7c0d8fc9509bb853a7de13c21ff
Reviewed-on: https://chromium-review.googlesource.com/1234858
Reviewed-by: John Budorick <jbudorick@chromium.org>
Commit-Queue: Kenneth Russell <kbr@chromium.org>
Cr-Commit-Position: refs/heads/master@{#592607}
[modify] https://crrev.com/3ddf78750fd145440399b6e5f5ae8a3d2a447f93/testing/trigger_scripts/trigger_multiple_dimensions.py
[modify] https://crrev.com/3ddf78750fd145440399b6e5f5ae8a3d2a447f93/testing/trigger_scripts/trigger_multiple_dimensions_unittest.py

Cc: mar...@chromium.org
Labels: OS-Android OS-Linux OS-Mac OS-Windows
Status: Fixed (was: Assigned)
I'm not sure whether this actually fixed the issue. https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng was mostly green before my CL above landed. Please help verify this change.

Also, the logic can surely still be improved.

Yes, win_angle_rel_ng was green because we didn't have many ANGLE CLs in flight.
Unfortunately (or luckily) 10 more bots had driver upgraded before the CL in #4 landed, so we'll have to wait till next driver upgrade to see if it helped.
Blocking: 871453
Blocking: 898684
Blocking: 920665

Sign in to add a comment