New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 756003 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit 15 days ago
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: ----
Type: ----



Sign in to add a comment

system_health.common_desktop failing on many desktop builders due to trace size explosion

Project Member Reported by simonhatch@chromium.org, Aug 16 2017

Issue description

system_health.common_desktop failing on 2 builders

Builders failed on: 
- Win 10 Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Win%2010%20Perf
- Win 8 Perf: 
  https://build.chromium.org/p/chromium.perf/builders/Win%208%20Perf


Looks like page is:
- browse:social:facebook_infinite_scroll@{'case': 'browse', 'group': 'social'}
 
Hmhh, looks like this is because the trace size is too big:

Traceback (most recent call last):
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_controller_backend.py", line 289, in CollectAgentTraceData
    os.remove(self._trace_log)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\b\\s\\w\\itjlflj3\\tmp2loyky'
(INFO) 2017-08-14 21:35:22,592 trace_data.Serialize:191  Trace sizes in bytes: {'traceEvents': 1938027798L, 'telemetry': 382117L, 'tabIds': 40L}
Traceback (most recent call last):
  File "c:\b\s\w\ir\third_party\catapult\tracing\tracing\trace_data\..\..\bin\trace2html", line 14, in <module>
    sys.exit(trace2html.Main(sys.argv))
  File "c:\b\s\w\ir\third_party\catapult\tracing\tracing_build\trace2html.py", line 51, in Main
    config_name=args.config_name)
  File "c:\b\s\w\ir\third_party\catapult\tracing\tracing_build\trace2html.py", line 116, in WriteHTMLForTracesToFile
    trace_data = f.read()
MemoryError


So we are getting MemoryError when just trying to transform the trace from .json file to html file :-/
Cc: primiano@chromium.org erikc...@chromium.org oysteine@chromium.org fmea...@chromium.org
Components: Speed>Tracing
From the logs: 

Currently:
https://chromium-swarm.appspot.com/task?id=38023685ac652210&refresh=10&show_raw=1
(INFO) 2017-08-16 01:30:05,161 trace_data.Serialize:191  Trace sizes in bytes: {'cpuSnapshots': 695685L, 'traceEvents': 1522544015L, 'telemetry': 395223L, 'tabIds': 40L}

Chrome trace size is: 1522544015L bytes = 1.52 Gb

Before:
https://chromium-swarm.appspot.com/task?id=37f0ece2a8308410&refresh=10&show_raw=1
(INFO) 2017-08-12 16:50:46,436 trace_data.Serialize:191  Trace sizes in bytes: {'cpuSnapshots': 689485L, 'traceEvents': 50111338L, 'telemetry': 392807L, 'tabIds': 40L}

Chrome trace size is: 50111338L bytes = 50 Mb 


So something is going really wrong here.

It looks like the bisect was launched on the benchmark_duration, which is going to take forever. The benchmark takes ~2h to run, and we run the benchmark 10 times at each revision during a bisect. That means that testing each revision is going to take ~20h. There are about 103 revisions in the range where it started failing, so that means that it'll take log2(103) ~~ 7 revisions to find the culprit, or about 6 days, which is longer than the overall bisect deadline of 24h.

Compare that with doing a return code bisect on browse:social:facebook_infinite_scroll, which is currently taking ~15m to reliably fail. That means that each revision takes 15m to run. It only needs to run each revision once because it's a return code bisect so, assuming 15m of overhead for each revision and 15m of testing, it'll find the culprit in 7 * 30m ~~ 3.5h. 

I'm going to cancel the already-launched bisect and kick off a return code.
NM. Canceling bisects isn't implemented, so I'm just going to kick off the next one.
Bah, just noticed that it's a *return code* bisect on the entire benchmark, not a metric bisect. I'm still going to kick off a new bisect where we limited the story filter to the facebook infinite scroll case, which should significantly speed it up.
Thanks for rekicking with a better filter.

Cancelling can be done via uberchromegw if needed. And yeah, return_code bisect ignores the metric, the dashboard just happens to send it anyway.
Summary: system_health.common_desktop failing on Windows builders (was: system_health.common_desktop failing on 2 builders)
Note: it looks like this is happening on all Windows desktop bots. Changing the name to reflect that. 
Summary: system_health.common_desktop failing on Windows builders due to too-large trace (was: system_health.common_desktop failing on Windows builders)
I also suspect that this might be responsible for the new system_health.common_desktop failures that we're seeing on Mac 10.12 Perf too, where we're all of a sudden hitting starting to hit the swarming timeout for a job (~2h). I think that this is somewhat masked because system_health.common_desktop was already flaky on Mac (due to BattOr failures? I'm still looking into these).
 Issue 756013  has been merged into this issue.
Summary: system_health.common_desktop failing on many desktop builders due to too-large trace (was: system_health.common_desktop failing on Windows builders due to too-large trace)
I duped  bug 756013  into this, which contained some Mac and other Windows failures. (The Linux failures were also grouped in with the Mac failures, but are unrelated. Unfortunately, there's no way to ungroup alerts in sheriff-o-matic.)

I think what's happening is that the same trace size bug is manifesting in different ways on different platforms. The Mac bots, which were already near the swarming timeout before the trace size exploded, are either

1) failing to process the traces on the larger stories, like system_health.common_desktop/browse:social:twitter_infinite_scroll, system_health.common_desktop/browse:social:facebook_infinite_scroll, or system_health.common_desktop/multitab:misc:typical24
2) hitting the swarming timeout due to the increased time required to process the larger traces

Summary: system_health.common_desktop failing on many desktop builders due to trace size explosion (was: system_health.common_desktop failing on many desktop builders due to too-large trace)
Project Member

Comment 15 by 42576172...@developer.gserviceaccount.com, Aug 16 2017

Cc: hubbe@google.com
Owner: hubbe@google.com
Status: Assigned (was: Available)

=== Auto-CCing suspected CL author hubbe@google.com ===

Hi hubbe@google.com, the bisect results pointed to your CL, please take a look at the
results.


=== BISECT JOB RESULTS ===
Test failure found with culprit

Suspected Commit
  Author : Fredrik Hubinette
  Commit : d2506f13c3fba1e6f3280c5ccb0a6802360aad8b
  Date   : Mon Aug 14 22:22:34 2017
  Subject: Improve suspending logic.

Bisect Details
  Configuration: winx64_10_perf_bisect
  Benchmark    : system_health.common_desktop
  Metric       : cpuTimeToFirstMeaningfulPaint_avg/browse_social/browse_social_facebook_infinite_scroll

Revision             Exit Code      N
chromium@494119      0 +- N/A       2      good
chromium@494172      0 +- N/A       2      good
chromium@494199      0 +- N/A       2      good
chromium@494206      0 +- N/A       2      good
chromium@494209      0 +- N/A       2      good
chromium@494211      0 +- N/A       2      good
chromium@494212      1 +- N/A       2      bad       <--
chromium@494225      1 +- N/A       2      bad

To Run This Test
  src/tools/perf/run_benchmark -v --browser=release_x64 --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=browse.social.facebook.infinite.scroll system_health.common_desktop

More information on addressing performance regressions:
  http://g.co/ChromePerformanceRegressions

Debug information about this bisect:
  https://chromeperf.appspot.com/buildbucket_job_status/8971084986099977056


For feedback, file a bug with component Speed>Bisection

Comment 16 by hubbe@chromium.org, Aug 16 2017

It seems unlikely that my change is causing this problem.
Is there a way to see what is filling up the trace logs?

I tried to get traces, but I could only get one from before your CL: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_0-2017-08-16_12-07-30-30784.html

Here's the process to get traces from bisect bots (working on making this easier)

1) Click the "Debug information about this bisect" link (https://chromeperf.appspot.com/buildbucket_job_status/8971084986099977056)
2) Click the "buildbot link" from there (https://build.chromium.org/p/tryserver.chromium.perf/builders/winx64_10_perf_bisect/builds/1407)
3) Ctrl+F for "494211" (revision before your CL), find the step "Bisecting revision (6)"
4) Expand "Nested step(s) for: Bisecting revision (6)"
5) Click the "json.output" link for "Bisecting revision.Reading chartjson results (11)"
6) A '"cloud_url": "https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_0-2017-08-16_12-07-30-30784.html"' link is in the output under "trace"
7) repeat for r494212
8) The test fails before trace files are uploaded  -- https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Ftryserver.chromium.perf%2Fwinx64_10_perf_bisect%2F1407%2F%2B%2Frecipes%2Fsteps%2FBisecting_revision__3_%2F0%2Fsteps%2FBisecting_revision.Performance_Test_1_of_2__3_%2F0%2Fstdout

I will try running another bisect to confirm whether it is this CL.
Yep, the follow-up bisect also seemed to find your CL responsible hubbe@. Could you revert? I'm happy to look into this later to try and figure out what trace event (or events) in particular are blowing up the trace size.
(Filed bug at https://github.com/catapult-project/catapult/issues/3781 about the bisect completing but not posting a comment to the bug.)
I'm going to revert this for now, but am spending time right now investigating why this seems to have blown up the trace size. Given that this didn't add any trace events, it's pretty confusing why it did.
Sigh. Looks like the regression didn't happen on Linux. Going to try again on my MBP.
Project Member

Comment 23 by bugdroid1@chromium.org, Aug 18 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7fab27426a0c1ac0925b9b3f25cbbfaf515439a9

commit 7fab27426a0c1ac0925b9b3f25cbbfaf515439a9
Author: Charlie Andrews <charliea@chromium.org>
Date: Fri Aug 18 02:56:08 2017

Revert "Improve suspending logic."

This reverts commit d2506f13c3fba1e6f3280c5ccb0a6802360aad8b.

Reason for revert: Caused trace size explosion on Windows

Bug:  756003 

Original change's description:
> Improve suspending logic.
> 
> When loading takes too long, we can suspend the codec before
> we have a chance to load the first frame. This is all normal,
> but should only happen when we are waiting on the network.
> This CL makes sure that we have additional progress callbacks
> expected before suspending.
> 
> Bug: 728468
> Change-Id: I815520e0ba832617b04948f7fbefb88498cf79b7
> Reviewed-on: https://chromium-review.googlesource.com/614360
> Reviewed-by: Dan Sanders <sandersd@chromium.org>
> Commit-Queue: Fredrik Hubinette <hubbe@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#494212}

TBR=hubbe@chromium.org,sandersd@chromium.org

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug: 728468
Change-Id: I3752dc7c2beefad52ad4137797ebfea63832178e
Reviewed-on: https://chromium-review.googlesource.com/619066
Commit-Queue: Charlie Andrews <charliea@chromium.org>
Reviewed-by: Charlie Andrews <charliea@chromium.org>
Reviewed-by: Fredrik Hubinette <hubbe@chromium.org>
Cr-Commit-Position: refs/heads/master@{#495442}
[modify] https://crrev.com/7fab27426a0c1ac0925b9b3f25cbbfaf515439a9/media/blink/buffered_data_source_host_impl.cc
[modify] https://crrev.com/7fab27426a0c1ac0925b9b3f25cbbfaf515439a9/media/blink/webmediaplayer_impl.cc
[modify] https://crrev.com/7fab27426a0c1ac0925b9b3f25cbbfaf515439a9/media/blink/webmediaplayer_impl_unittest.cc

Status: Fixed (was: Assigned)

Sign in to add a comment