Issue metadata
Sign in to add a comment
|
All windows testers failing on chromium.perf |
||||||||||||||||||||||
Issue descriptioncurrently all four windows configurations are failing: https://uberchromegw.corp.google.com/i/chromium.perf/waterfall Win 10 High-DPI perf, Win 10 Perf, Win 7 Perf and Win 7 Nvidia GPU Perf They are all failing for the same reason and have been since June 20th. There are two problems here: 1) Why didn't these show up in SOM? They are failing purple on the waterfall 2) More importantly, what is going on. If you look at the last build of Win 10 Perf https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20Perf/builds/2726 there isn't much to go on since the run fails with the first test, but there is an error in the preamble: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20Perf/builds/2726/steps/steps/logs/preamble @@@STEP_CLOSED@@@ INFO:monitoring_utils:Executing command: ['C:\\infra-python\\ENV\\Scripts\\python.exe', 'C:\\infra-python\\run.py', 'infra.tools.send_monitoring_event', '--event-mon-output-file', 'E:\\b\\rr\\tmpdgb2am\\log_request_proto', '--event-mon-run-type', 'file', '--event-mon-service-name', u'buildbot/master/master.chromium.perf', '--build-event-build-name', u'Win 10 Perf', '--build-event-build-number', '2726', '--build-event-build-scheduling-time', '1530028877000', '--build-event-type', 'BUILD', '--event-mon-timestamp-kind', 'POINT', '--build-event-hostname', 'vm3-m1'] Traceback (most recent call last): File "C:\infra-python\infra\tools\send_monitoring_event\__main__.py", line 55, in main success_metric.set(common.send_build_event(opts)) File "C:\infra-python\infra\tools\send_monitoring_event\common.py", line 283, in send_build_event event_timestamp=args.event_mon_event_timestamp)) TypeError: send_build_event() got an unexpected keyword argument 'fail_type' INFO:slave.cipd_bootstrap_v2:bootstrapping CIPD INFO:slave.cipd_bootstrap_v2:ensuring cipd client
,
Jun 29 2018
Issue 856864 has been merged into this issue.
,
Jun 29 2018
Pick a build, click the "stdio" link under step 1, and scroll to the bottom of the page: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20Perf/builds/2708/steps/steps/logs/stdio You'll see: "command timed out: 2400 seconds without output, attempting to kill" That's a no IO timeout, ie: the recipe was running and didn't print anything to stdout/stderr for 40min and so was killed. During the step that gets timedout, the recipe is waiting to collect the results from a swarming task. The particular task for the most recent build is: https://chrome-swarming.appspot.com/task?id=3e64fe55cef23f10 It's been pending for 50+ min. That's because the bot it's trying to run on (build118-a7) has 300+ pending tasks. https://chrome-swarming.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&et=1530289980000&f=id-tag%3Abuild118-a7&f=state%3APENDING&l=50&n=true&q=pendi&s=created_ts%3Adesc&st=1530203580000 We should either: a) kill all pending tasks for that bot so it can start picking up recent tasks again or b) add an option to collect_task.py to "still waiting" log messages to stdout every once and a while to avoid getting IO killed
,
Jul 1
Ben: since we don't have permission to kill all pending tasks, can you kill all the pending tassks for all perf windows builders? https://ci.chromium.org/buildbot/chromium.perf/Win%2010%20High-DPI%20Perf/ https://ci.chromium.org/buildbot/chromium.perf/Win%2010%20Perf/ https://ci.chromium.org/buildbot/chromium.perf/Win%207%20Perf/ https://ci.chromium.org/buildbot/chromium.perf/Win%207%20Nvidia%20GPU%20Perf/ Emily: let's also convert windows builders to OBBS
,
Jul 2
Issue 859501 has been merged into this issue.
,
Jul 2
Are we saying that the stack trace is a red herring? It seems odd that all of the sudden this just started happening on all of our windows bots simultaneously. Is there investigation into why it was pending on swarming for so long? Yes we are converting to OBBS this week and next but I am not sure if that is going to fix this problem.
,
Jul 2
Cancelled all pending tasks for those four builders.
,
Jul 2
#6: yes. I'd guess that something dramatically increased the runtime of the perf tests on windows.
,
Jul 2
... and my guess would've been wrong. Instead, it looks like loading.desktop has always taken a while; starting on 2018-06-21, though (https://ci.chromium.org/buildbot/chromium.perf/Win%2010%20Perf/2614 on Win 10 Perf), we stopped outputting "Waiting for results from the following shards: 0" while trying to collect its results. Not sure why. That lack of output causes us to think the build is stuck, resulting in us killing it while tasks are still running. The next build's tasks are then stuck behind those tasks; they sit pending for long enough to cause the next build to be considered stuck.
,
Jul 2
Suspect this is due to https://chromium.googlesource.com/infra/luci/recipes-py/+/0e71eb80de4e88898496b6dcc752ca9554c0433c, though I haven't yet identified the mechanism.
,
Jul 2
The bootstrap change only affects starting the recipe. If the recipe started it means it already made it past all the bootstrap stuff. Were you seeing something in this build that indicated otherwise?
,
Jul 2
I'm looking at changes to the environment variables between the 2613 & 2614 and am trying to see what changed and what would cause one of the wrappers to stop propagating swarming.py's updates. That's the most likely candidate thus far...
,
Jul 2
What environment variables are different?
,
Jul 2
I believe jbud and Robbie and actively looking into this.
,
Jul 2
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/495e7a1e89ac4d60caec0f4511ba0153f5daef8a commit 495e7a1e89ac4d60caec0f4511ba0153f5daef8a Author: Robert Iannucci <iannucci@chromium.org> Date: Mon Jul 02 17:18:10 2018 Set PYTHONUNBUFFERED in *_run.py scripts to ensure full unbuffered-ness. We suspect that some intermediate processes between logdog and the recipe engine may be stalling output. Not sure if this is the root cause, but it should be safe and hopefully let us reconcile the buildbot logs with the processes we see on the bot. R=jbudorick@chromium.org Bug: 856712 Change-Id: I5c75295ae046067b4b713f3d87b7f1b0562c70bf Reviewed-on: https://chromium-review.googlesource.com/1122968 Commit-Queue: John Budorick <jbudorick@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> [modify] https://crrev.com/495e7a1e89ac4d60caec0f4511ba0153f5daef8a/scripts/slave/annotated_run.py [modify] https://crrev.com/495e7a1e89ac4d60caec0f4511ba0153f5daef8a/scripts/slave/remote_run.py
,
Jul 2
I *think* the change in #15 fixed this, but OBBS landing at the same time makes it a bit hard to tell, as the bots' steps are now completely different. Will continue watching.
,
Jul 4
Even with the OBBS land, all the build are still failed with some step timed out. I truly think the only way to probably move forward with this is either adding a lot more hardwares for these builders or reduce number of tests of these until we have more :( For number, each of these windows tester configs have about 5 machines, whereas the number of test machine time are 28 hours of test. 5.6 hours for each shard seem to be too much for the infrastructure to handle right now.
,
Jul 5
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7b30f20fb6d53d4818b38ff1b9faa465cab8afb4 commit 7b30f20fb6d53d4818b38ff1b9faa465cab8afb4 Author: Ned Nguyen <nednguyen@google.com> Date: Thu Jul 05 01:53:45 2018 Stop performance_test_suite on 'Win 7 Nvidia GPU Perf' for debugging Suspecting that merge script step doesn't work on Window. NOTRY=true # test by CQ TBR=eyaich@chromium.org Bug: 856712 Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi Change-Id: I70e82e82ce2c1be60eaeab9780dfc44eed3fc929 Reviewed-on: https://chromium-review.googlesource.com/1126561 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#572690} [modify] https://crrev.com/7b30f20fb6d53d4818b38ff1b9faa465cab8afb4/testing/buildbot/chromium.perf.json [modify] https://crrev.com/7b30f20fb6d53d4818b38ff1b9faa465cab8afb4/tools/perf/core/perf_data_generator.py
,
Jul 5
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build/+/236219928e431570b9fd7e9d889f8d2c167b8bcf commit 236219928e431570b9fd7e9d889f8d2c167b8bcf Author: Nghia Nguyen <nednguyen@google.com> Date: Thu Jul 05 03:57:18 2018 Make merge script subprocess execution in collect_task streams its stdout/stderr to both the file and stdout I suspect that the merge script step of "angle_perftest" (step 27) in https://ci.chromium.org/buildbot/chromium.perf/Win%207%20Nvidia%20GPU%20Perf/2788 is being stuck. So add this feature to get logging data about where the script is stuck. Bug:856712 Change-Id: I01156fff2614a2ed5a711cfebe04f2caa6c69f2e TBR=jbudorick@chromium.org Change-Id: I01156fff2614a2ed5a711cfebe04f2caa6c69f2e Reviewed-on: https://chromium-review.googlesource.com/1126567 Reviewed-by: Ned Nguyen <nednguyen@google.com> Commit-Queue: Ned Nguyen <nednguyen@google.com> [modify] https://crrev.com/236219928e431570b9fd7e9d889f8d2c167b8bcf/scripts/slave/recipe_modules/swarming/unittests/collect_task_test.py [modify] https://crrev.com/236219928e431570b9fd7e9d889f8d2c167b8bcf/scripts/slave/recipe_modules/swarming/resources/collect_task.py
,
Jul 5
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/19f10f94dfbe923a3f0f4a79020656335d8a6895 commit 19f10f94dfbe923a3f0f4a79020656335d8a6895 Author: Ned Nguyen <nednguyen@google.com> Date: Thu Jul 05 11:32:24 2018 Set timeout limit for uploading data to perf dashboard in parallel TBR=eyaich@chromium.org Bug:856712 Change-Id: I3db6fce100e3a4b92b449de50a8aa5b09c89171d Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi NOTRY=true Change-Id: I3db6fce100e3a4b92b449de50a8aa5b09c89171d Reviewed-on: https://chromium-review.googlesource.com/1126962 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#572757} [modify] https://crrev.com/19f10f94dfbe923a3f0f4a79020656335d8a6895/tools/perf/process_perf_results.py
,
Jul 5
Ok, with the extra logs, we now have more clues about why the collect step timed out:
Duration of Generating perf log streams: 0 seconds
Duration of Analyzing perf json test results: 0 seconds
Uploading perf results from angle_perftests benchmark
[E2018-07-05T06:29:10.956308-07:00 4612 0 service.go:91] Failed to load private key JSON - open /creds/service_accounts/service-account-chromium-perf-histograms.json: The system cannot find the path specified.
[W2018-07-05T06:29:10.967312-07:00 4612 0 auth.go:1032] Failed to read auth token from cache: open /creds/service_accounts/service-account-chromium-perf-histograms.json: The system cannot find the path specified.
[E2018-07-05T06:29:10.967312-07:00 4612 0 service.go:91] Failed to load private key JSON - open /creds/service_accounts/service-account-chromium-perf-histograms.json: The system cannot find the path specified.
open /creds/service_accounts/service-account-chromium-perf-histograms.json: The system cannot find the path specified.
Exception in thread Thread-3:
Traceback (most recent call last):
File "E:\b\depot_tools\win_tools-2_7_6_bin\python\bin\lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "E:\b\depot_tools\win_tools-2_7_6_bin\python\bin\lib\threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "E:\b\depot_tools\win_tools-2_7_6_bin\python\bin\lib\multiprocessing\pool.py", line 376, in _handle_results
task = get()
TypeError: ('__init__() takes at least 3 arguments (1 given)', <class 'subprocess.CalledProcessError'>, ())
Failed uploading benchmarks to perf dashboard in parallel
Terminate the pool
(https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FWin_7_Nvidia_GPU_Perf%2F2791%2F%2B%2Frecipes%2Fsteps%2Fangle_perftests_on_NVIDIA_GPU_on_Windows_on_Windows-2008ServerR2-SP1%2F0%2Flogs%2FMerge_script_log%2F0)
So something is off with the service account on these Windows builder bots
,
Jul 5
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d6182f70dd052d5a3d5be457f3eab73937865d89 commit d6182f70dd052d5a3d5be457f3eab73937865d89 Author: Ned Nguyen <nednguyen@google.com> Date: Thu Jul 05 14:53:51 2018 Fix histogram service account path on windows platform NOTRY=true TBR=eyaich@chromium.org Bug: chromium:856712 Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi Change-Id: Ib9f44a7fb7c6454bb8dcd3a96e1b4872b4e84ae4 Reviewed-on: https://chromium-review.googlesource.com/1127142 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#572791} [modify] https://crrev.com/d6182f70dd052d5a3d5be457f3eab73937865d89/testing/buildbot/chromium.perf.json [modify] https://crrev.com/d6182f70dd052d5a3d5be457f3eab73937865d89/tools/perf/core/perf_data_generator.py
,
Jul 5
After the histogram service account fixed, the tests are running fine again. https://ci.chromium.org/buildbot/chromium.perf/Win%207%20Nvidia%20GPU%20Perf/2792
,
Jul 5
,
Jul 5
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/19439259b8d576ec1e5c3831331fc924ebc39c6a commit 19439259b8d576ec1e5c3831331fc924ebc39c6a Author: Ned Nguyen <nednguyen@google.com> Date: Thu Jul 05 16:45:29 2018 Renable performance_test_suite on 'Win 7 Nvidia GPU Perf' NOTRY=true TBR=eyaich@chromium.org Bug: 856712 Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi Change-Id: Ie238a9e63f413d7995ff3d14858e5c6f04b3555a Reviewed-on: https://chromium-review.googlesource.com/1127155 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#572815} [modify] https://crrev.com/19439259b8d576ec1e5c3831331fc924ebc39c6a/testing/buildbot/chromium.perf.json [modify] https://crrev.com/19439259b8d576ec1e5c3831331fc924ebc39c6a/tools/perf/core/perf_data_generator.py
,
Jul 9
Sorry for many TBR'ed CLs last week. Folks were being away, and all the windows builders were failing. Here is an explanation of what I did: 1) To improve the iteration debug cycle and to diagnose whether performance_test_suite is the cause of the I/O timeout, I disable the performance_test_suite on 'Win 7 Nvidia GPU Perf' (#18). After issue is fixed, I reenabled in #25. 2) Due to the I/O timeout, collec_test_step was failing completely, thus no log file was produced for the merge_script, hence I added the ability to stream the log of merge_script both to stdout & both to the log file (#19) 3) After (2), there is still no log upon the I/O timeout, so I set an internal timeout for process_perf_results (#20). Now the internal timeout is fired, and shows that we have problem opening the credential file stored in /creds/service_accounts/service-account-chromium-perf-histograms.json 4) I land #22 which fix the credential file path on Windows.
,
Jul 12
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by eyaich@chromium.org
, Jun 26 2018