Issue metadata
Sign in to add a comment
|
system_health.common_desktop failing on many desktop builders due to trace size explosion |
||||||||||||||||||||||
Issue descriptionsystem_health.common_desktop failing on 2 builders Builders failed on: - Win 10 Perf: https://build.chromium.org/p/chromium.perf/builders/Win%2010%20Perf - Win 8 Perf: https://build.chromium.org/p/chromium.perf/builders/Win%208%20Perf Looks like page is: - browse:social:facebook_infinite_scroll@{'case': 'browse', 'group': 'social'}
,
Aug 16 2017
From the logs: Currently: https://chromium-swarm.appspot.com/task?id=38023685ac652210&refresh=10&show_raw=1 (INFO) 2017-08-16 01:30:05,161 trace_data.Serialize:191 Trace sizes in bytes: {'cpuSnapshots': 695685L, 'traceEvents': 1522544015L, 'telemetry': 395223L, 'tabIds': 40L} Chrome trace size is: 1522544015L bytes = 1.52 Gb Before: https://chromium-swarm.appspot.com/task?id=37f0ece2a8308410&refresh=10&show_raw=1 (INFO) 2017-08-12 16:50:46,436 trace_data.Serialize:191 Trace sizes in bytes: {'cpuSnapshots': 689485L, 'traceEvents': 50111338L, 'telemetry': 392807L, 'tabIds': 40L} Chrome trace size is: 50111338L bytes = 50 Mb So something is going really wrong here.
,
Aug 16 2017
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8971091656371313536
,
Aug 16 2017
It looks like the bisect was launched on the benchmark_duration, which is going to take forever. The benchmark takes ~2h to run, and we run the benchmark 10 times at each revision during a bisect. That means that testing each revision is going to take ~20h. There are about 103 revisions in the range where it started failing, so that means that it'll take log2(103) ~~ 7 revisions to find the culprit, or about 6 days, which is longer than the overall bisect deadline of 24h. Compare that with doing a return code bisect on browse:social:facebook_infinite_scroll, which is currently taking ~15m to reliably fail. That means that each revision takes 15m to run. It only needs to run each revision once because it's a return code bisect so, assuming 15m of overhead for each revision and 15m of testing, it'll find the culprit in 7 * 30m ~~ 3.5h. I'm going to cancel the already-launched bisect and kick off a return code.
,
Aug 16 2017
NM. Canceling bisects isn't implemented, so I'm just going to kick off the next one.
,
Aug 16 2017
Bah, just noticed that it's a *return code* bisect on the entire benchmark, not a metric bisect. I'm still going to kick off a new bisect where we limited the story filter to the facebook infinite scroll case, which should significantly speed it up.
,
Aug 16 2017
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8971084986099977056
,
Aug 16 2017
Thanks for rekicking with a better filter. Cancelling can be done via uberchromegw if needed. And yeah, return_code bisect ignores the metric, the dashboard just happens to send it anyway.
,
Aug 16 2017
Note: it looks like this is happening on all Windows desktop bots. Changing the name to reflect that.
,
Aug 16 2017
,
Aug 16 2017
I also suspect that this might be responsible for the new system_health.common_desktop failures that we're seeing on Mac 10.12 Perf too, where we're all of a sudden hitting starting to hit the swarming timeout for a job (~2h). I think that this is somewhat masked because system_health.common_desktop was already flaky on Mac (due to BattOr failures? I'm still looking into these).
,
Aug 16 2017
Issue 756013 has been merged into this issue.
,
Aug 16 2017
I duped bug 756013 into this, which contained some Mac and other Windows failures. (The Linux failures were also grouped in with the Mac failures, but are unrelated. Unfortunately, there's no way to ungroup alerts in sheriff-o-matic.) I think what's happening is that the same trace size bug is manifesting in different ways on different platforms. The Mac bots, which were already near the swarming timeout before the trace size exploded, are either 1) failing to process the traces on the larger stories, like system_health.common_desktop/browse:social:twitter_infinite_scroll, system_health.common_desktop/browse:social:facebook_infinite_scroll, or system_health.common_desktop/multitab:misc:typical24 2) hitting the swarming timeout due to the increased time required to process the larger traces
,
Aug 16 2017
,
Aug 16 2017
=== Auto-CCing suspected CL author hubbe@google.com === Hi hubbe@google.com, the bisect results pointed to your CL, please take a look at the results. === BISECT JOB RESULTS === Test failure found with culprit Suspected Commit Author : Fredrik Hubinette Commit : d2506f13c3fba1e6f3280c5ccb0a6802360aad8b Date : Mon Aug 14 22:22:34 2017 Subject: Improve suspending logic. Bisect Details Configuration: winx64_10_perf_bisect Benchmark : system_health.common_desktop Metric : cpuTimeToFirstMeaningfulPaint_avg/browse_social/browse_social_facebook_infinite_scroll Revision Exit Code N chromium@494119 0 +- N/A 2 good chromium@494172 0 +- N/A 2 good chromium@494199 0 +- N/A 2 good chromium@494206 0 +- N/A 2 good chromium@494209 0 +- N/A 2 good chromium@494211 0 +- N/A 2 good chromium@494212 1 +- N/A 2 bad <-- chromium@494225 1 +- N/A 2 bad To Run This Test src/tools/perf/run_benchmark -v --browser=release_x64 --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=browse.social.facebook.infinite.scroll system_health.common_desktop More information on addressing performance regressions: http://g.co/ChromePerformanceRegressions Debug information about this bisect: https://chromeperf.appspot.com/buildbucket_job_status/8971084986099977056 For feedback, file a bug with component Speed>Bisection
,
Aug 16 2017
It seems unlikely that my change is causing this problem. Is there a way to see what is filling up the trace logs?
,
Aug 16 2017
I tried to get traces, but I could only get one from before your CL: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_0-2017-08-16_12-07-30-30784.html Here's the process to get traces from bisect bots (working on making this easier) 1) Click the "Debug information about this bisect" link (https://chromeperf.appspot.com/buildbucket_job_status/8971084986099977056) 2) Click the "buildbot link" from there (https://build.chromium.org/p/tryserver.chromium.perf/builders/winx64_10_perf_bisect/builds/1407) 3) Ctrl+F for "494211" (revision before your CL), find the step "Bisecting revision (6)" 4) Expand "Nested step(s) for: Bisecting revision (6)" 5) Click the "json.output" link for "Bisecting revision.Reading chartjson results (11)" 6) A '"cloud_url": "https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_0-2017-08-16_12-07-30-30784.html"' link is in the output under "trace" 7) repeat for r494212 8) The test fails before trace files are uploaded -- https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Ftryserver.chromium.perf%2Fwinx64_10_perf_bisect%2F1407%2F%2B%2Frecipes%2Fsteps%2FBisecting_revision__3_%2F0%2Fsteps%2FBisecting_revision.Performance_Test_1_of_2__3_%2F0%2Fstdout I will try running another bisect to confirm whether it is this CL.
,
Aug 16 2017
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8971075202851467408
,
Aug 17 2017
Yep, the follow-up bisect also seemed to find your CL responsible hubbe@. Could you revert? I'm happy to look into this later to try and figure out what trace event (or events) in particular are blowing up the trace size.
,
Aug 17 2017
(Filed bug at https://github.com/catapult-project/catapult/issues/3781 about the bisect completing but not posting a comment to the bug.)
,
Aug 17 2017
I'm going to revert this for now, but am spending time right now investigating why this seems to have blown up the trace size. Given that this didn't add any trace events, it's pretty confusing why it did.
,
Aug 17 2017
Sigh. Looks like the regression didn't happen on Linux. Going to try again on my MBP.
,
Aug 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7fab27426a0c1ac0925b9b3f25cbbfaf515439a9 commit 7fab27426a0c1ac0925b9b3f25cbbfaf515439a9 Author: Charlie Andrews <charliea@chromium.org> Date: Fri Aug 18 02:56:08 2017 Revert "Improve suspending logic." This reverts commit d2506f13c3fba1e6f3280c5ccb0a6802360aad8b. Reason for revert: Caused trace size explosion on Windows Bug: 756003 Original change's description: > Improve suspending logic. > > When loading takes too long, we can suspend the codec before > we have a chance to load the first frame. This is all normal, > but should only happen when we are waiting on the network. > This CL makes sure that we have additional progress callbacks > expected before suspending. > > Bug: 728468 > Change-Id: I815520e0ba832617b04948f7fbefb88498cf79b7 > Reviewed-on: https://chromium-review.googlesource.com/614360 > Reviewed-by: Dan Sanders <sandersd@chromium.org> > Commit-Queue: Fredrik Hubinette <hubbe@chromium.org> > Cr-Commit-Position: refs/heads/master@{#494212} TBR=hubbe@chromium.org,sandersd@chromium.org # Not skipping CQ checks because original CL landed > 1 day ago. Bug: 728468 Change-Id: I3752dc7c2beefad52ad4137797ebfea63832178e Reviewed-on: https://chromium-review.googlesource.com/619066 Commit-Queue: Charlie Andrews <charliea@chromium.org> Reviewed-by: Charlie Andrews <charliea@chromium.org> Reviewed-by: Fredrik Hubinette <hubbe@chromium.org> Cr-Commit-Position: refs/heads/master@{#495442} [modify] https://crrev.com/7fab27426a0c1ac0925b9b3f25cbbfaf515439a9/media/blink/buffered_data_source_host_impl.cc [modify] https://crrev.com/7fab27426a0c1ac0925b9b3f25cbbfaf515439a9/media/blink/webmediaplayer_impl.cc [modify] https://crrev.com/7fab27426a0c1ac0925b9b3f25cbbfaf515439a9/media/blink/webmediaplayer_impl_unittest.cc
,
Aug 18 2017
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by nedngu...@google.com
, Aug 16 2017Hmhh, looks like this is because the trace size is too big: Traceback (most recent call last): File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_controller_backend.py", line 289, in CollectAgentTraceData os.remove(self._trace_log) WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\b\\s\\w\\itjlflj3\\tmp2loyky' (INFO) 2017-08-14 21:35:22,592 trace_data.Serialize:191 Trace sizes in bytes: {'traceEvents': 1938027798L, 'telemetry': 382117L, 'tabIds': 40L} Traceback (most recent call last): File "c:\b\s\w\ir\third_party\catapult\tracing\tracing\trace_data\..\..\bin\trace2html", line 14, in <module> sys.exit(trace2html.Main(sys.argv)) File "c:\b\s\w\ir\third_party\catapult\tracing\tracing_build\trace2html.py", line 51, in Main config_name=args.config_name) File "c:\b\s\w\ir\third_party\catapult\tracing\tracing_build\trace2html.py", line 116, in WriteHTMLForTracesToFile trace_data = f.read() MemoryError So we are getting MemoryError when just trying to transform the trace from .json file to html file :-/