New issue
Advanced search Search tips

Issue 908551 link

Starred by 14 users

Issue metadata

Status: Duplicate
Merged: issue 908929
Owner:
Closed: Nov 28
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 1
Type: Bug



Sign in to add a comment

mac_chromium_rel_ng long pending times

Project Member Reported by martiniss@chromium.org, Nov 26

Issue description

Not sure what's going on, but there are big pending times
 
Labels: -OS-Mac
Looked at a random log, saw this at the bottom. Looks suspicious:


Total duration: 36000.9s
Results from some shards are missing: 9
WARNING:root:collect_cmd had non-zero return code: 1
WARNING:root:Expected output.json file missing: set(['/b/s/w/ir/tmp/t/tmpdWo5xz/1/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/5/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/3/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/2/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/9/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/7/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/6/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/11/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/8/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/10/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/4/output.json'])
Found: []
Expected: ['/b/s/w/ir/tmp/t/tmpdWo5xz/1/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/10/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/11/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/2/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/3/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/4/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/5/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/6/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/7/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/8/output.json', '/b/s/w/ir/tmp/t/tmpdWo5xz/9/output.json']

WARNING:root:No shard json files found in task_output_dir: '/b/s/w/ir/tmp/t/tmpdWo5xz'
Found ['/b/s/w/ir/tmp/t/tmpdWo5xz/1', '/b/s/w/ir/tmp/t/tmpdWo5xz/10', '/b/s/w/ir/tmp/t/tmpdWo5xz/11', '/b/s/w/ir/tmp/t/tmpdWo5xz/2', '/b/s/w/ir/tmp/t/tmpdWo5xz/3', '/b/s/w/ir/tmp/t/tmpdWo5xz/4', '/b/s/w/ir/tmp/t/tmpdWo5xz/5', '/b/s/w/ir/tmp/t/tmpdWo5xz/6', '/b/s/w/ir/tmp/t/tmpdWo5xz/7', '/b/s/w/ir/tmp/t/tmpdWo5xz/8', '/b/s/w/ir/tmp/t/tmpdWo5xz/9', '/b/s/w/ir/tmp/t/tmpdWo5xz/summary.json']
Running ['/b/s/w/ir/cache/vpython/5b0713/bin/python', '/b/s/w/ir/cache/builder/src/third_party/blink/tools/merge_web_test_results.py', '--build-properties', '{"attempt_start_ts": 1543254523000000, "blamelist": ["xidachen@chromium.org"], "bot_id": "vm49-m1", "buildbucket": {"build": {"bucket": "luci.chromium.try", "created_by": "user:5071639625-1lppvbtck1morgivc6sq4dul7klu27sd@developer.gserviceaccount.com", "created_ts": 1543254539868255, "id": "8928774043562122880", "project": "chromium", "tags": ["builder:mac_chromium_rel_ng", "buildset:patch/gerrit/chromium-review.googlesource.com/1350331/1", "cq_experimental:false", "user_agent:cq"]}, "hostname": "cr-buildbucket.appspot.com"}, "buildername": "mac_chromium_rel_ng", "buildnumber": 193057, "category": "cq", "got_angle_revision": "15992bef28d84b59c1a815483519347896f185c8", "got_buildtools_revision": "04161ec8d7c781e4498c699254c69ba0dd959fde", "got_dawn_revision": "63997221d7d880d8d1783abe326b90cd95cd92d2", "got_nacl_revision": "f701a90597fc85979319447c0cd44c3b52201c78", "got_revision": "4671e20b1c076d3674a6d8bdc3a510b1c30578ba", "got_revision_cp": "refs/heads/master@{#610876}", "got_swarming_client_revision": "b6e9e23e4e79249bd4f95735205ffb7c3f9f0912", "got_v8_revision": "89124cf99ef9a852bdf0681599cfcc29193a4c79", "got_v8_revision_cp": "refs/heads/7.2.470@{#1}", "got_webrtc_revision": "f1c194decd51a63ba923349da96fcd9cb6dae35a", "got_webrtc_revision_cp": "refs/heads/master@{#25777}", "mastername": "tryserver.chromium.mac", "patch_gerrit_url": "https://chromium-review.googlesource.com", "patch_issue": 1350331, "patch_project": "chromium/src", "patch_ref": "refs/changes/31/1350331/1", "patch_repository_url": "https://chromium.googlesource.com/chromium/src", "patch_set": 1, "patch_storage": "gerrit", "path_config": "generic", "reason": "CQ", "recipe": "chromium_trybot", "repository": "https://chromium.googlesource.com/chromium/src", "revision": "HEAD"}', '--summary-json', '/b/s/w/ir/tmp/t/tmpdWo5xz/summary.json', '--task-output-dir', '/b/s/w/ir/tmp/t/tmpdWo5xz', u'--verbose', '-o', '/b/s/w/ir/tmp/t/tmpo7bHYu.json'] in None (env: None)
2018-11-26 11:51:09,006 - blinkpy.common.system.log_utils: [DEBUG] Debug logging enabled.
2018-11-26 11:51:09,006 - root: [INFO] Running with isolated arguments
Traceback (most recent call last):
  File "/b/s/w/ir/cache/builder/src/third_party/blink/tools/merge_web_test_results.py", line 12, in <module>
    main(sys.argv[1:])
  File "/b/s/w/ir/cache/builder/src/third_party/blink/tools/blinkpy/web_tests/merge_results.py", line 775, in main
    assert args.positional
AssertionError
Command ['/b/s/w/ir/cache/vpython/5b0713/bin/python', '/b/s/w/ir/cache/builder/src/third_party/blink/tools/merge_web_test_results.py', '--build-properties', '{"attempt_start_ts": 1543254523000000, "blamelist": ["xidachen@chromium.org"], "bot_id": "vm49-m1", "buildbucket": {"build": {"bucket": "luci.chromium.try", "created_by": "user:5071639625-1lppvbtck1morgivc6sq4dul7klu27sd@developer.gserviceaccount.com", "created_ts": 1543254539868255, "id": "8928774043562122880", "project": "chromium", "tags": ["builder:mac_chromium_rel_ng", "buildset:patch/gerrit/chromium-review.googlesource.com/1350331/1", "cq_experimental:false", "user_agent:cq"]}, "hostname": "cr-buildbucket.appspot.com"}, "buildername": "mac_chromium_rel_ng", "buildnumber": 193057, "category": "cq", "got_angle_revision": "15992bef28d84b59c1a815483519347896f185c8", "got_buildtools_revision": "04161ec8d7c781e4498c699254c69ba0dd959fde", "got_dawn_revision": "63997221d7d880d8d1783abe326b90cd95cd92d2", "got_nacl_revision": "f701a90597fc85979319447c0cd44c3b52201c78", "got_revision": "4671e20b1c076d3674a6d8bdc3a510b1c30578ba", "got_revision_cp": "refs/heads/master@{#610876}", "got_swarming_client_revision": "b6e9e23e4e79249bd4f95735205ffb7c3f9f0912", "got_v8_revision": "89124cf99ef9a852bdf0681599cfcc29193a4c79", "got_v8_revision_cp": "refs/heads/7.2.470@{#1}", "got_webrtc_revision": "f1c194decd51a63ba923349da96fcd9cb6dae35a", "got_webrtc_revision_cp": "refs/heads/master@{#25777}", "mastername": "tryserver.chromium.mac", "patch_gerrit_url": "https://chromium-review.googlesource.com", "patch_issue": 1350331, "patch_project": "chromium/src", "patch_ref": "refs/changes/31/1350331/1", "patch_repository_url": "https://chromium.googlesource.com/chromium/src", "patch_set": 1, "patch_storage": "gerrit", "path_config": "generic", "reason": "CQ", "recipe": "chromium_trybot", "repository": "https://chromium.googlesource.com/chromium/src", "revision": "HEAD"}', '--summary-json', '/b/s/w/ir/tmp/t/tmpdWo5xz/summary.json', '--task-output-dir', '/b/s/w/ir/tmp/t/tmpdWo5xz', u'--verbose', '-o', '/b/s/w/ir/tmp/t/tmpo7bHYu.json'] returned exit code 1
WARNING:root:merge_cmd had non-zero return code: 1
step returned non-zero exit code: 1


Ok, that seems wrong but like a false failure.

Looks like layout test tasks are taking forever. I think it's because some shards are trying to run every test.

For example look at build https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/193077. In that build, a few shards time out, and some are successful. The successful ones (random example is https://chromium-swarm.appspot.com/user/task/416a1117f7eba410) run about 6,000 tests (sample runs 6552). 

Another task in that build, https://chromium-swarm.appspot.com/task?id=416a113187be3310&refresh=10&show_raw=1, tries to run basically every test. This is in the log:

10:53:28.671 27362 Found 89747 tests; running 78981, skipping 10766.

Not sure why, but that seems very likely to cause these issues.

Cc: tkent@chromium.org dpranke@chromium.org robertma@chromium.org
cc-ing some blink people. I glanced through the git log of https://cs.chromium.org/chromium/src/third_party/blink/tools/blinkpy/?q=run_webkit_tests.py&sq=package:chromium&dr, but didn't see anything suspicious.
Found the bug. https://cs.chromium.org/chromium/infra/luci/client/swarming.py?type=cs&q=GTEST_TOTAL_SHARDS+file:%5Einfra/luci/+package:%5Echromium$&g=0&l=245

Looks like gtest sharding environment variables aren't being set for all task slices.
https://crrev.com/c/1351503 is a revert which should have fixed this.

I'll monitor the bot to make sure future test runs are good.
Why would that CL affect the env vars?
Project Member

Comment 8 by bugdroid1@chromium.org, Nov 26

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/d07efe857282aaa2ccd332eb974308be66c8393a

commit d07efe857282aaa2ccd332eb974308be66c8393a
Author: Brad Hall <bradhall@google.com>
Date: Mon Nov 26 22:15:38 2018

Make sure to setup_googletest on all slices

If we don't do this then GTEST_SHARD_* env variables won't be set

Bug: 871453,  908551 
Change-Id: Id11b140294906cadd2c9ca7c39e0aab2fda0c0af
Reviewed-on: https://chromium-review.googlesource.com/c/1351358
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>
Commit-Queue: Brad Hall <bradhall@google.com>

[modify] https://crrev.com/d07efe857282aaa2ccd332eb974308be66c8393a/client/swarming.py

> Why would that CL affect the env vars?

That CL enables the task slice code in swarming.py which mistakenly only sets the env vars for task slice 0.  
Ah, got it. Thanks!
Just in case it's helpful, seeing the same issue here: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/193583
And now the tests are somewhat running, but the bot still looks like it's extra sad: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/193583
Labels: OS-Mac
Cc: erikc...@chromium.org jbudorick@google.com
Ah, if some test of webkit_layout_tests step fails, failure tests are run again in without patch step. But in without patch step, failure tests runs many times to detect tests' flakiness by --gtest_repeat=10.

That cause long webkit_layout_test running time in without patch step. If swarming capacity is not sufficient, timeout happens easily and all of tests are run in without patch step. And that consumes capacity resource.

erikchen@, can we stop to run test multiple times when failure is apparently come from infra failure?
Issue 908729 has been merged into this issue.
> erikchen@, can we stop to run test multiple times when failure is apparently come from infra failure?

Agreed. I thought we already did that. https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/chromium_tests/steps.py?type=cs&q=_test_options_for_running&sq=package:chromium&g=0&l=121

Can you link to a build where you're seeing this behavior? When I look at long-running builds:

https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/194045
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/194080
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/194083

It looks like the initial set of webkit layout tests are timing out due to insufficient capacity. The shards that do run are completing in ~15 minutes. This looks like an insufficient capacity issue due to the backlog from the problems earlier, but maybe I'm missing something?
agree w/ #17, this still appears to be digging out of the backlog.
Project Member

Comment 19 by bugdroid1@chromium.org, Nov 27

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5831aa2859f46c29109f6ec3591270e25c6d6527

commit 5831aa2859f46c29109f6ec3591270e25c6d6527
Author: John Budorick <jbudorick@chromium.org>
Date: Tue Nov 27 14:53:41 2018

Temporarily remove webkit_layout_tests from mac_chromium_rel_ng.

Tbr: sergeyberezin@chromium.org,bradhall@chromium.org
No-Try: true
Bug:  908551 
Change-Id: I5f003793bc94930fa685fff74ae1217e208b0dc6
Reviewed-on: https://chromium-review.googlesource.com/c/1351936
Reviewed-by: John Budorick <jbudorick@chromium.org>
Commit-Queue: John Budorick <jbudorick@chromium.org>
Cr-Commit-Position: refs/heads/master@{#611103}
[modify] https://crrev.com/5831aa2859f46c29109f6ec3591270e25c6d6527/testing/buildbot/chromium.mac.json
[modify] https://crrev.com/5831aa2859f46c29109f6ec3591270e25c6d6527/testing/buildbot/test_suite_exceptions.pyl

Going to help it dig out of the backlog a bit:
 - removing layout tests temporarily
 - cancelling pending layout test tasks from current jobs
Cc: -jbudorick@google.com jbudorick@chromium.org
#17,

I think we don't want to repeat 10 times in without patch step when there are some shards that insufficient capacity happens and no shard has test failure.
e.g.
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/194018

#22: Can you clarify why you think that we're trying to repeat 10 times in the link you sent? I don't see any indication that we're trying to do so.

As linked in c#17, we shouldn't be trying to do the 10X repeat.

Although, if we are timing out due to insufficient capacity, there's really no point in running 'without patch' steps altogether. This seems like a small optimization that will make us fail more gracefully when there's insufficient capacity. jbudorick, wdyt?
Project Member

Comment 24 by bugdroid1@chromium.org, Nov 27

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/503000c4b8e0745a6c4b2ce19857e125fab77efa

commit 503000c4b8e0745a6c4b2ce19857e125fab77efa
Author: John Budorick <jbudorick@chromium.org>
Date: Tue Nov 27 16:51:02 2018

Temporarily remove webkit_layout_tests from mac_chromium_rel_ng, part 2.

crrev.com/c/1351936 removed the suite from Mac10.12 Tests.
mac_chromium_rel_ng mirrors Mac10.13 Tests despite running the layout
tests on 10.12.6. X(

Tbr: sergeyberezin@chromium.org,bradhall@chromium.org
No-Try: true
Bug:  908551 
Change-Id: Iad0c9bea9be0302d38cdf642951a6b9bcb731469
Reviewed-on: https://chromium-review.googlesource.com/c/1351745
Reviewed-by: John Budorick <jbudorick@chromium.org>
Commit-Queue: John Budorick <jbudorick@chromium.org>
Cr-Commit-Position: refs/heads/master@{#611148}
[modify] https://crrev.com/503000c4b8e0745a6c4b2ce19857e125fab77efa/testing/buildbot/chromium.mac.json
[modify] https://crrev.com/503000c4b8e0745a6c4b2ce19857e125fab77efa/testing/buildbot/test_suite_exceptions.pyl

Issue 908847 has been merged into this issue.
Project Member

Comment 26 by bugdroid1@chromium.org, Nov 27

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/af56597b2e83e94d3f1cfcc34f8a4b8d5f648e8a

commit af56597b2e83e94d3f1cfcc34f8a4b8d5f648e8a
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Nov 27 18:13:42 2018

Roll src/tools/swarming_client/ b6e9e23e4..157bec8a2 (4 commits)

https://chromium.googlesource.com/infra/luci/client-py.git/+log/b6e9e23e4e79..157bec8a25cc

$ git log b6e9e23e4..157bec8a2 --date=short --no-merges --format='%ad %ae %s'
2018-11-26 bradhall Make sure to setup_googletest on all slices
2018-11-26 maruel [client] Add warning about variable flags
2018-11-21 maruel [client] Stop leaking dir 'cache' when running isolateserver_test.py
2018-11-20 maruel [client] internal refactoring adding ServerRef

Created with:
  roll-dep src/tools/swarming_client

R=bradhall@google.com

Bug: 871453,  908551 
Change-Id: Ia0d51cc0584d6df90b71722fbbc9a17bbbceb563
Reviewed-on: https://chromium-review.googlesource.com/c/1351743
Reviewed-by: John Budorick <jbudorick@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#611194}
[modify] https://crrev.com/af56597b2e83e94d3f1cfcc34f8a4b8d5f648e8a/DEPS

Owner: jbudorick@chromium.org
jbudorick@ is trooper today.
Cc: bradhall@chromium.org
Owner: sergeybe...@chromium.org
nope, sergeyberezin is primary and bradhall is secondary.
Labels: -Pri-0 Pri-1
Seems the issue resolved. But we want to back webkit_layout_tests again.
#29: yes, we do. We were discussing keeping it off of mac_chromium_rel_ng until after tomorrow's branch, though.
Mergedinto: 908929
Status: Duplicate (was: Assigned)
I missed this bug since it wasn't in the trooper queue... Merging to another public bug I filed for this outage - will track the remaining progress there.

Sign in to add a comment