Improve trigger_multiple_dimensions.py logic |
||||||
Issue descriptionAfter https://chromium-review.googlesource.com/1225609 shards started to time out on win_angle_rel_ng bot: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3087 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3093 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3094 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3095 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3096 Also https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_deqp_rel_ng/2707 There were 4 bots with 24.20.100.6286 driver and 130 bots with 23.21.13.8816 driver at this time. Somehow the 4 bots became oversubscribed. Looking at https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3093, at least 9 out of around 100 shards were scheduled on the 4 new driver bots. This is highly disproportional to 4/130 ratio. Even more so in https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_deqp_rel_ng/2707, where 1/4 shards were scheduled on new bots. Looks like choose_random_int() in trigger_multiple_dimensions.py has unacceptable deviation. Maybe we should fix this with a threshold - don't schedule on configuration which is less than 10% of population.
,
Sep 19
Thanks Yuly for your analysis. I'll implement that check.
,
Sep 19
I'm sorry, looks like I've looked at NVIDIA bots when I reported "130 bots with 23.21.13.8816". For Intel bots we have 59 bots with 23.20.16.4877. Also, https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3093 had 137 shards due to retries. 9/128 is actually pretty close to 4/59, so the deviation seems reasonable. Now it looks to me that the load might have been spread proportionately between the old and new driver bots. Unfortunately, I don't know how to query how many jobs were scheduled on which configuration to confirm that. My current hypothesis is that 4 bots got a little more load due to rounding error. Which a threshold could also solve.
,
Sep 19
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3ddf78750fd145440399b6e5f5ae8a3d2a447f93 commit 3ddf78750fd145440399b6e5f5ae8a3d2a447f93 Author: Kenneth Russell <kbr@chromium.org> Date: Wed Sep 19 23:39:20 2018 Prune bot configurations with less than 10% of total capacity. We're finding that these configurations are frequently getting over-subscribed, so without more information, institute a threshold so the multi-dimemsion trigger script doesn't schedule jobs on these configurations *at all*. This will likely have side effects and require refinement, but will hopefully alleviate the immediate problem. Bug: 886985 Cq-Include-Trybots: luci.chromium.try:android_optional_gpu_tests_rel;luci.chromium.try:linux_optional_gpu_tests_rel;luci.chromium.try:mac_optional_gpu_tests_rel;luci.chromium.try:win_optional_gpu_tests_rel Change-Id: I5ea4823920d7e7c0d8fc9509bb853a7de13c21ff Reviewed-on: https://chromium-review.googlesource.com/1234858 Reviewed-by: John Budorick <jbudorick@chromium.org> Commit-Queue: Kenneth Russell <kbr@chromium.org> Cr-Commit-Position: refs/heads/master@{#592607} [modify] https://crrev.com/3ddf78750fd145440399b6e5f5ae8a3d2a447f93/testing/trigger_scripts/trigger_multiple_dimensions.py [modify] https://crrev.com/3ddf78750fd145440399b6e5f5ae8a3d2a447f93/testing/trigger_scripts/trigger_multiple_dimensions_unittest.py
,
Sep 20
I'm not sure whether this actually fixed the issue. https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng was mostly green before my CL above landed. Please help verify this change. Also, the logic can surely still be improved.
,
Sep 20
Yes, win_angle_rel_ng was green because we didn't have many ANGLE CLs in flight. Unfortunately (or luckily) 10 more bots had driver upgraded before the CL in #4 landed, so we'll have to wait till next driver upgrade to see if it helped.
,
Oct 2
,
Oct 31
,
Jan 10
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by kbr@chromium.org
, Sep 19Cc: zmo@chromium.org