New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 856712 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Jul 5
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 1
Type: Bug-Regression

Blocking:
issue 862069



Sign in to add a comment

All windows testers failing on chromium.perf

Project Member Reported by eyaich@chromium.org, Jun 26 2018

Issue description

currently all four windows configurations are failing: 

https://uberchromegw.corp.google.com/i/chromium.perf/waterfall
Win 10 High-DPI perf, Win 10 Perf, Win 7 Perf and Win 7 Nvidia GPU Perf

They are all failing for the same reason and have been since June 20th.  

There are two problems here: 

1) Why didn't these show up in SOM?  They are failing purple on the waterfall

2) More importantly, what is going on.  If you look at the last build of Win 10 Perf 
https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20Perf/builds/2726

there isn't much to go on since the run fails with the first test, but there is an error in the preamble: 

https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20Perf/builds/2726/steps/steps/logs/preamble

@@@STEP_CLOSED@@@
INFO:monitoring_utils:Executing command: ['C:\\infra-python\\ENV\\Scripts\\python.exe', 'C:\\infra-python\\run.py', 'infra.tools.send_monitoring_event', '--event-mon-output-file', 'E:\\b\\rr\\tmpdgb2am\\log_request_proto', '--event-mon-run-type', 'file', '--event-mon-service-name', u'buildbot/master/master.chromium.perf', '--build-event-build-name', u'Win 10 Perf', '--build-event-build-number', '2726', '--build-event-build-scheduling-time', '1530028877000', '--build-event-type', 'BUILD', '--event-mon-timestamp-kind', 'POINT', '--build-event-hostname', 'vm3-m1']
Traceback (most recent call last):
  File "C:\infra-python\infra\tools\send_monitoring_event\__main__.py", line 55, in main
    success_metric.set(common.send_build_event(opts))
  File "C:\infra-python\infra\tools\send_monitoring_event\common.py", line 283, in send_build_event
    event_timestamp=args.event_mon_event_timestamp))
TypeError: send_build_event() got an unexpected keyword argument 'fail_type'
INFO:slave.cipd_bootstrap_v2:bootstrapping CIPD
INFO:slave.cipd_bootstrap_v2:ensuring cipd client 

 

Comment 1 by eyaich@chromium.org, Jun 26 2018

Cc: eyaich@chromium.org
Cc: oysteine@google.com yukishiino@chromium.org haraken@chromium.org adithyas@chromium.org jbroman@chromium.org
 Issue 856864  has been merged into this issue.
Pick a build, click the "stdio" link under step 1, and scroll to the bottom of the page:
https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20Perf/builds/2708/steps/steps/logs/stdio

You'll see:
"command timed out: 2400 seconds without output, attempting to kill"

That's a no IO timeout, ie: the recipe was running and didn't print anything to stdout/stderr for 40min and so was killed. During the step that gets timedout, the recipe is waiting to collect the results from a swarming task. The particular task for the most recent build is:
https://chrome-swarming.appspot.com/task?id=3e64fe55cef23f10

It's been pending for 50+ min. That's because the bot it's trying to run on (build118-a7) has 300+ pending tasks.
https://chrome-swarming.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&et=1530289980000&f=id-tag%3Abuild118-a7&f=state%3APENDING&l=50&n=true&q=pendi&s=created_ts%3Adesc&st=1530203580000

We should either:
a) kill all pending tasks for that bot so it can start picking up recent tasks again
or
b) add an option to collect_task.py to "still waiting" log messages to stdout every once and a while to avoid getting IO killed
Owner: bpastene@chromium.org
Status: Assigned (was: Untriaged)
Ben: since we don't have permission to kill all pending tasks, can you kill all the pending tassks for all perf windows builders?

https://ci.chromium.org/buildbot/chromium.perf/Win%2010%20High-DPI%20Perf/
https://ci.chromium.org/buildbot/chromium.perf/Win%2010%20Perf/
https://ci.chromium.org/buildbot/chromium.perf/Win%207%20Perf/
https://ci.chromium.org/buildbot/chromium.perf/Win%207%20Nvidia%20GPU%20Perf/

Emily: let's also convert windows builders to OBBS
Cc: u...@chromium.org hablich@chromium.org mythria@chromium.org
 Issue 859501  has been merged into this issue.
Are we saying that the stack trace is a red herring? It seems odd that all of the sudden this just started happening on all of our windows bots simultaneously.  Is there investigation into why it was pending on swarming for so long?

Yes we are converting to OBBS this week and next but I am not sure if that is going to fix this problem. 
Cancelled all pending tasks for those four builders.
#6: yes. I'd guess that something dramatically increased the runtime of the perf tests on windows.
... and my guess would've been wrong. Instead, it looks like loading.desktop has always taken a while; starting on 2018-06-21, though (https://ci.chromium.org/buildbot/chromium.perf/Win%2010%20Perf/2614 on Win 10 Perf), we stopped outputting "Waiting for results from the following shards: 0" while trying to collect its results. Not sure why.

That lack of output causes us to think the build is stuck, resulting in us killing it while tasks are still running. The next build's tasks are then stuck behind those tasks; they sit pending for long enough to cause the next build to be considered stuck.
Cc: iannucci@chromium.org
Suspect this is due to https://chromium.googlesource.com/infra/luci/recipes-py/+/0e71eb80de4e88898496b6dcc752ca9554c0433c, though I haven't yet identified the mechanism.
The bootstrap change only affects starting the recipe. If the recipe started it means it already made it past all the bootstrap stuff. Were you seeing something in this build that indicated otherwise?
I'm looking at changes to the environment variables between the 2613 & 2614 and am trying to see what changed and what would cause one of the wrappers to stop propagating swarming.py's updates. That's the most likely candidate thus far...
What environment variables are different?
Owner: jbudorick@chromium.org
I believe jbud and Robbie and actively looking into this.
Project Member

Comment 15 by bugdroid1@chromium.org, Jul 2

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/495e7a1e89ac4d60caec0f4511ba0153f5daef8a

commit 495e7a1e89ac4d60caec0f4511ba0153f5daef8a
Author: Robert Iannucci <iannucci@chromium.org>
Date: Mon Jul 02 17:18:10 2018

Set PYTHONUNBUFFERED in *_run.py scripts to ensure full unbuffered-ness.

We suspect that some intermediate processes between logdog and the
recipe engine may be stalling output. Not sure if this is the root cause,
but it should be safe and hopefully let us reconcile the buildbot logs
with the processes we see on the bot.

R=jbudorick@chromium.org

Bug:  856712 
Change-Id: I5c75295ae046067b4b713f3d87b7f1b0562c70bf
Reviewed-on: https://chromium-review.googlesource.com/1122968
Commit-Queue: John Budorick <jbudorick@chromium.org>
Reviewed-by: John Budorick <jbudorick@chromium.org>

[modify] https://crrev.com/495e7a1e89ac4d60caec0f4511ba0153f5daef8a/scripts/slave/annotated_run.py
[modify] https://crrev.com/495e7a1e89ac4d60caec0f4511ba0153f5daef8a/scripts/slave/remote_run.py

I *think* the change in #15 fixed this, but OBBS landing at the same time makes it a bit hard to tell, as the bots' steps are now completely different. Will continue watching.
Even with the OBBS land, all the build are still failed with some step timed out.
I truly think the only way to probably move forward with this is either adding a lot more hardwares for these builders or reduce number of tests of these until we have more :(

For number, each of these windows tester configs have about 5 machines, whereas the number of test machine time are 28 hours of test. 5.6 hours for each shard seem to be too much for the infrastructure to handle right now.
Project Member

Comment 18 by bugdroid1@chromium.org, Jul 5

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7b30f20fb6d53d4818b38ff1b9faa465cab8afb4

commit 7b30f20fb6d53d4818b38ff1b9faa465cab8afb4
Author: Ned Nguyen <nednguyen@google.com>
Date: Thu Jul 05 01:53:45 2018

Stop performance_test_suite on 'Win 7 Nvidia GPU Perf' for debugging

Suspecting that merge script step doesn't work on Window.

NOTRY=true  # test by CQ
TBR=eyaich@chromium.org

Bug:  856712 
Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi
Change-Id: I70e82e82ce2c1be60eaeab9780dfc44eed3fc929
Reviewed-on: https://chromium-review.googlesource.com/1126561
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#572690}
[modify] https://crrev.com/7b30f20fb6d53d4818b38ff1b9faa465cab8afb4/testing/buildbot/chromium.perf.json
[modify] https://crrev.com/7b30f20fb6d53d4818b38ff1b9faa465cab8afb4/tools/perf/core/perf_data_generator.py

Project Member

Comment 19 by bugdroid1@chromium.org, Jul 5

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build/+/236219928e431570b9fd7e9d889f8d2c167b8bcf

commit 236219928e431570b9fd7e9d889f8d2c167b8bcf
Author: Nghia Nguyen <nednguyen@google.com>
Date: Thu Jul 05 03:57:18 2018

Make merge script subprocess execution in collect_task streams its stdout/stderr to both the file and stdout

I suspect that the merge script step of "angle_perftest" (step 27) in https://ci.chromium.org/buildbot/chromium.perf/Win%207%20Nvidia%20GPU%20Perf/2788 is being stuck. So
add this feature to get logging data about where the script is stuck.

Bug:856712
Change-Id: I01156fff2614a2ed5a711cfebe04f2caa6c69f2e

TBR=jbudorick@chromium.org

Change-Id: I01156fff2614a2ed5a711cfebe04f2caa6c69f2e
Reviewed-on: https://chromium-review.googlesource.com/1126567
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Commit-Queue: Ned Nguyen <nednguyen@google.com>

[modify] https://crrev.com/236219928e431570b9fd7e9d889f8d2c167b8bcf/scripts/slave/recipe_modules/swarming/unittests/collect_task_test.py
[modify] https://crrev.com/236219928e431570b9fd7e9d889f8d2c167b8bcf/scripts/slave/recipe_modules/swarming/resources/collect_task.py

Project Member

Comment 20 by bugdroid1@chromium.org, Jul 5

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/19f10f94dfbe923a3f0f4a79020656335d8a6895

commit 19f10f94dfbe923a3f0f4a79020656335d8a6895
Author: Ned Nguyen <nednguyen@google.com>
Date: Thu Jul 05 11:32:24 2018

Set timeout limit for uploading data to perf dashboard in parallel

TBR=eyaich@chromium.org

Bug:856712
Change-Id: I3db6fce100e3a4b92b449de50a8aa5b09c89171d
Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi

NOTRY=true

Change-Id: I3db6fce100e3a4b92b449de50a8aa5b09c89171d
Reviewed-on: https://chromium-review.googlesource.com/1126962
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#572757}
[modify] https://crrev.com/19f10f94dfbe923a3f0f4a79020656335d8a6895/tools/perf/process_perf_results.py

Ok, with the extra logs, we now have more clues about why the collect step timed out:

Duration of Generating perf log streams: 0 seconds
Duration of Analyzing perf json test results: 0 seconds
Uploading perf results from angle_perftests benchmark
[E2018-07-05T06:29:10.956308-07:00 4612 0 service.go:91] Failed to load private key JSON - open /creds/service_accounts/service-account-chromium-perf-histograms.json: The system cannot find the path specified.
[W2018-07-05T06:29:10.967312-07:00 4612 0 auth.go:1032] Failed to read auth token from cache: open /creds/service_accounts/service-account-chromium-perf-histograms.json: The system cannot find the path specified.
[E2018-07-05T06:29:10.967312-07:00 4612 0 service.go:91] Failed to load private key JSON - open /creds/service_accounts/service-account-chromium-perf-histograms.json: The system cannot find the path specified.
open /creds/service_accounts/service-account-chromium-perf-histograms.json: The system cannot find the path specified.
Exception in thread Thread-3:
Traceback (most recent call last):
  File "E:\b\depot_tools\win_tools-2_7_6_bin\python\bin\lib\threading.py", line 810, in __bootstrap_inner
    self.run()
  File "E:\b\depot_tools\win_tools-2_7_6_bin\python\bin\lib\threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "E:\b\depot_tools\win_tools-2_7_6_bin\python\bin\lib\multiprocessing\pool.py", line 376, in _handle_results
    task = get()
TypeError: ('__init__() takes at least 3 arguments (1 given)', <class 'subprocess.CalledProcessError'>, ())
Failed uploading benchmarks to perf dashboard in parallel
Terminate the pool

(https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FWin_7_Nvidia_GPU_Perf%2F2791%2F%2B%2Frecipes%2Fsteps%2Fangle_perftests_on_NVIDIA_GPU_on_Windows_on_Windows-2008ServerR2-SP1%2F0%2Flogs%2FMerge_script_log%2F0)

So something is off with the service account on these Windows builder bots
Project Member

Comment 22 by bugdroid1@chromium.org, Jul 5

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d6182f70dd052d5a3d5be457f3eab73937865d89

commit d6182f70dd052d5a3d5be457f3eab73937865d89
Author: Ned Nguyen <nednguyen@google.com>
Date: Thu Jul 05 14:53:51 2018

Fix histogram service account path on windows platform

NOTRY=true
TBR=eyaich@chromium.org

Bug:  chromium:856712 
Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi
Change-Id: Ib9f44a7fb7c6454bb8dcd3a96e1b4872b4e84ae4
Reviewed-on: https://chromium-review.googlesource.com/1127142
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#572791}
[modify] https://crrev.com/d6182f70dd052d5a3d5be457f3eab73937865d89/testing/buildbot/chromium.perf.json
[modify] https://crrev.com/d6182f70dd052d5a3d5be457f3eab73937865d89/tools/perf/core/perf_data_generator.py

Status: Fixed (was: Assigned)
After the histogram service account fixed, the tests are running fine again.

https://ci.chromium.org/buildbot/chromium.perf/Win%207%20Nvidia%20GPU%20Perf/2792
Owner: nednguyen@chromium.org
Project Member

Comment 25 by bugdroid1@chromium.org, Jul 5

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/19439259b8d576ec1e5c3831331fc924ebc39c6a

commit 19439259b8d576ec1e5c3831331fc924ebc39c6a
Author: Ned Nguyen <nednguyen@google.com>
Date: Thu Jul 05 16:45:29 2018

Renable performance_test_suite on 'Win 7 Nvidia GPU Perf'

NOTRY=true
TBR=eyaich@chromium.org

Bug:  856712 
Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi
Change-Id: Ie238a9e63f413d7995ff3d14858e5c6f04b3555a
Reviewed-on: https://chromium-review.googlesource.com/1127155
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#572815}
[modify] https://crrev.com/19439259b8d576ec1e5c3831331fc924ebc39c6a/testing/buildbot/chromium.perf.json
[modify] https://crrev.com/19439259b8d576ec1e5c3831331fc924ebc39c6a/tools/perf/core/perf_data_generator.py

Sorry for many TBR'ed CLs last week. Folks were being away, and all the windows builders were failing. Here is an explanation of what I did:

1) To improve the iteration debug cycle and to diagnose whether performance_test_suite is the cause of the I/O timeout, I disable the performance_test_suite on  'Win 7 Nvidia GPU Perf' (#18). After issue is fixed, I reenabled in #25.

2) Due to the I/O timeout, collec_test_step was failing completely, thus no log file was produced for the merge_script, hence I added the ability to stream the log of merge_script both to stdout & both to the log file (#19)

3) After (2), there is still no log upon the I/O timeout, so I set an internal timeout for process_perf_results (#20). Now the internal timeout is fired, and shows that we have problem opening the credential file stored in /creds/service_accounts/service-account-chromium-perf-histograms.json

4) I land #22 which fix the credential file path on Windows.
Blocking: 862069

Sign in to add a comment