New issue
Advanced search Search tips

Issue 801596 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 1
Type: Bug



Sign in to add a comment

system_health.common_desktop failing on Win 10 High-DPI Perf

Project Member Reported by eyaich@chromium.org, Jan 12 2018

Issue description

https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20High-DPI%20Perf?numbuilds=200

So system_health.common_Desktop is still failing after the fix for  crbug.com/799106 .  It looks like it has been timing out since at least November in tracing_controller_backend.py line 291 trying to access a log file (https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py?q=tracing_controller_backend&sq=package:chromium&l=291)

(INFO) 2018-01-12 02:42:15,437 cache_temperature.EnsurePageCacheTemperature:181  PageCacheTemperature: any
(INFO) 2018-01-12 02:44:10,438 chrome_tracing_agent.RecordClockSyncMarker:183  Chrome version: 3320
(INFO) 2018-01-12 02:45:30,969 battor_wrapper.CollectTraceData:271  CollectTraceData serial log:
(INFO) 2018-01-12 02:45:30,969 cloud_storage.Insert:378  Uploading c:\b\s\w\it2ypqoi\tmpwpyjyr to gs://chrome-telemetry-output/battor-serial-log-2018-01-12_02-45.txt-65225.txt
View BattOr serial log at https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/battor-serial-log-2018-01-12_02-45.txt-65225.txt
(ERROR) 2018-01-12 02:45:37,596 tracing_controller_backend.CollectAgentTraceData:295  Error when deleting c:\b\s\w\it2ypqoi\tmpvs6vtp, will try again at exit.
Traceback (most recent call last):
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\platform\tracing_controller_backend.py", line 291, in CollectAgentTraceData
    os.remove(self._trace_log)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\b\\s\\w\\it2ypqoi\\tmpvs6vtp'
(INFO) 2018-01-12 02:45:37,878 trace_data.Serialize:191  Trace sizes in bytes: {'cpuSnapshots': 1595363L, 'traceEvents': 368114071L, 'telemetry': 98040L, 'tabIds': 143L, 'powerTraceAsString': 25980869L}

I can't tell if it is due to a particular story from the logs.

Ned and Charlie, you had a few CLs in early November around the return value of TraceValue.  They could be completely unrelated given I don't know when it started timing out, but do either of you have insight onto what file it could be locked on?
 
Cc: -nednguyen@chromium.org nedngu...@google.com
Status: Assigned (was: Untriaged)
I think this error has been showing up in the Telemetry logs for as long as I can remember and, though it's certainly not good, it's a little hard to tell whether it's related to the actual failure.

The reason that the benchmark is failing is because it's timing out. Given that the one that I'm looking at only ran for 1h3m (https://chromium-swarm.appspot.com/task?id=3b1468f2bb9ac810&refresh=10&show_raw=1) and the fact that the swarming timeout for this benchmark is 3 hours (https://cs.chromium.org/chromium/src/tools/perf/core/perf_data_generator.py?type=cs&q=perf_data_generator&sq=package:chromium&l=622), that means that we must have been hitting the I/O timeout, not the hard swarming timeout, because 3h hasn't elapsed yet. The I/O timeout is just a timeout that kills the swarming job when it doesn't emit any output for too long - in this case 10 minutes.

On that swarming task page, the last line of output before the I/O timeout is:

(INFO) 2018-01-15 17:28:49,075 trace_data.Serialize:191  Trace sizes in bytes: {'cpuSnapshots': 1621616L, 'traceEvents': 376059248L, 'telemetry': 98422L, 'tabIds': 144L, 'powerTraceAsString': 26052027L}

Searching for "trace sizes in bytes" in code search yields this: https://cs.chromium.org/chromium/src/third_party/catapult/tracing/tracing/trace_data/trace_data.py?type=cs&q=%22trace+sizes+in+bytes%22&sq=package:chromium&l=191

Basically, that line is emitted *right* before we call trace2html, which packages up all of the traces that we get into a single HTML file that you can open in the browser.

Looking at the actual numbers in that log line, this trace is very, very big. traceEvents is 376MB and the power trace is another 26MB, which is definitely pushing the limits of what trace viewer can handle quickly.

For reference, the trace for the previous benchmark was 51359144L (51MB) and gave two log lines for it:

(INFO) 2018-01-15 17:24:47,119 trace_data.Serialize:191  Trace sizes in bytes: {'cpuSnapshots': 1636895L, 'traceEvents': 51359144L, 'telemetry': 100140L, 'tabIds': 178L, 'powerTraceAsString': 25841331L}

(WARNING) 2018-01-15 17:25:21,618 timeline_based_measurement._ComputeTimelineBasedMetrics:310  Processing resulting traces took 23.196 seconds

In other words, it took 23s to compute the trace for a 51MB file. Assuming that the trace processing time scales linearly (it may be worse than that), we're probably looking at around 3 minutes just to compute the metrics. Assuming that running trace2html takes a similar amount of time, then that's 6 minutes without any logs emitted.

As a short term fix to this, I'm going to emit a log line after trace2html finishes. This may just fix the problem. If it doesn't, we may want to emit some heartbeat while we're waiting for the result in _ComputeTimelineBasedMetrics.

I have a potential fix for this in the CQ. My plan is to try each of these, in order, stopping when the problem is fixed:

1) Add logging after trace2html completes
2) If the problem persists and it's apparent that trace2html isn't completing, add a heartbeat during the trace2html command. If it's apparent that it's the metric processing, add a heartbeat during the metric computation command

I'll check back tomorrow to see if that first fix works.
Project Member

Comment 3 by bugdroid1@chromium.org, Jan 16 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/da6be7b0dc364fb96c688e504b2d9c70073f83d7

commit da6be7b0dc364fb96c688e504b2d9c70073f83d7
Author: Charlie Andrews <charliea@chromium.org>
Date: Tue Jan 16 15:43:18 2018

Emit output when trace2html is complete

We should be emitting output after all long-running subtasks are
complete in order to give hints to swarming that we're making forward
progress and therefore shouldn't run up against the I/O timeout. In this
case, it's not unusual for this command to take upwards of a minute and
it's immediately followed by another long-running task (calculating
metrics from the trace).

Bug:  chromium:801596 
Change-Id: Ia393ca6d5120cc3d7f5e46c82911fb82bca26204
Reviewed-on: https://chromium-review.googlesource.com/867935
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Commit-Queue: Charlie Andrews <charliea@chromium.org>

[modify] https://crrev.com/da6be7b0dc364fb96c688e504b2d9c70073f83d7/telemetry/telemetry/timeline/trace_data.py

Status: Fixed (was: Assigned)
I've confirmed that this seems to have fixed the problem for now. If this timeout happens in the future, on this platform or any other, we may have to consider either adding heartbeat messages as the tracing metrics are computed or trying to optimize the tracing metric computation script.

Sign in to add a comment