New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 755981 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: 2017-08-17
OS: All
Pri: 1
Type: Bug-Regression



Sign in to add a comment

Some Telemetry tests don't exit after running all stories and reporting failures

Project Member Reported by charliea@chromium.org, Aug 16 2017

Issue description

Revision range first seen: ???? (but a long time)

Example: https://luci-milo.appspot.com/buildbot/chromium.perf/Mac%20Air%2010.11%20Perf/1160, see battor.trivial_pages and battor.trivial_pages.reference

Both of these benchmarks failed very quickly because the attached BattOrs need to be reset. 

Digging into battor.trivial_pages (http://bit.ly/2vDn1Qc), the first log line looks like:

-----------------------------------------------------------------
(WARNING) 2017-08-15 20:00:20,200 desktop_browser_finder.FindAllAvailableBrowsers:171  Chrome build location for mac_x86_64 not found. Browser will be run without Flash.
-----------------------------------------------------------------

The end of the log looks like:

-----------------------------------------------------------------
(INFO) 2017-08-15 20:01:47,038 cloud_storage.Insert:377  Uploading /b/s/w/itT1LWMF/tmp7uruIS.png to gs://chrome-telemetry-output/profiler-file-id_0-2017-08-15_20-01-4792731.png
View generated profiler files online at https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/profiler-file-id_0-2017-08-15_20-01-4792731.png for page TrivialScrollingPageSharedPageState
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ]  TrivialScrollingPageSharedPageState

1 FAILED TEST

View result at file:///b/s/w/itT1LWMF/tmph6VKfctelemetry/results-chart.json
View result at file:///b/s/w/itT1LWMF/tmph6VKfctelemetry/test-results.json
-----------------------------------------------------------------

Based on the logging, you can reasonably infer that the test took about 1m47s to fail. However, swarming doesn't corroborate this: it says:

Started:   8/15/2017, 11:00:06 PM (EDT)
Completed: 8/16/2017, 12:01:49 AM (EDT)

In otherwise, swarming says that the task takes a full hour to fail (!), and is eventually killed a by a swarming shard timeout. Looking at other benchmark runs on the same bot (build125-b1) seems to corroborate swarming's story: the next benchmark to run on the bot is blink_perf.layout, which has a "pending" time of 2h4m. This is suspiciously close to the 2h we would expect to see if both battor.trivial_pages and battor.trivial_pages.reference each took an hour to time out.


This doesn't appear to be a Telemetry-wide problem: on the same bot, smoothness.top_25 is failing in 8m14s.

Ned suggested that we add a timestamp before the the "View result at file..." log lines at the end of the trace in order to determine how long each of those uploads are taking. My suspicion, though, is that they aren't the problem, and that instead there's some problem with some atexit handler that the BattOr code registers.
 
NextAction: 2017-08-17
Going to wait a day to look at this once we have the improved atexit logging in.
Project Member

Comment 2 by bugdroid1@chromium.org, Aug 16 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/91c6b1bce0497e1d95aaf950e82e901ddbeb8219

commit 91c6b1bce0497e1d95aaf950e82e901ddbeb8219
Author: Charlie Andrews <charliea@chromium.org>
Date: Wed Aug 16 16:55:39 2017

Decrease the swarming I/O timeout for perf tests from 1h to 10m

We are having a problem where the atexit handler on some BattOr-related
code is hanging indefinitely, which in our case, means for an hour.

In general, tests should probably never hang for an hour without I/O. If
they do, we should probably special-case that test's I/O timeout rather
than having a default I/O timeout of 1 hour.

Bug: 755981
Change-Id: I29142379bc80009a7012397390e9dc2571f8db5f
Reviewed-on: https://chromium-review.googlesource.com/617102
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Charlie Andrews <charliea@chromium.org>
Commit-Queue: Charlie Andrews <charliea@chromium.org>
Cr-Commit-Position: refs/heads/master@{#494828}
[modify] https://crrev.com/91c6b1bce0497e1d95aaf950e82e901ddbeb8219/testing/buildbot/chromium.perf.fyi.json
[modify] https://crrev.com/91c6b1bce0497e1d95aaf950e82e901ddbeb8219/testing/buildbot/chromium.perf.json
[modify] https://crrev.com/91c6b1bce0497e1d95aaf950e82e901ddbeb8219/tools/perf/core/perf_data_generator.py
[modify] https://crrev.com/91c6b1bce0497e1d95aaf950e82e901ddbeb8219/tools/perf/core/perf_data_generator_unittest.py

Project Member

Comment 3 by bugdroid1@chromium.org, Aug 16 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5f27e44dcf49e79d287d0068bab57e3824cdb9fe

commit 5f27e44dcf49e79d287d0068bab57e3824cdb9fe
Author: catapult-deps-roller@chromium.org <catapult-deps-roller@chromium.org>
Date: Wed Aug 16 19:51:00 2017

Roll src/third_party/catapult/ 818332ed8..18998c1fd (2 commits)

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/818332ed8043..18998c1fd0cc

$ git log 818332ed8..18998c1fd --date=short --no-merges --format='%ad %ae %s'
2017-08-16 simonhatch Dashboard - Bump alert limits for group_report.
2017-08-16 charliea Move atexit_with_log into py_utils and make BattOrWrapper use it

Created with:
  roll-dep src/third_party/catapult
BUG= 755661 ,755981


Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls


CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=sullivan@chromium.org

Change-Id: If9400d60d42f1e6dc8c413231386c43170a7f4c4
Reviewed-on: https://chromium-review.googlesource.com/617269
Reviewed-by: <catapult-deps-roller@chromium.org>
Commit-Queue: <catapult-deps-roller@chromium.org>
Cr-Commit-Position: refs/heads/master@{#494906}
[modify] https://crrev.com/5f27e44dcf49e79d287d0068bab57e3824cdb9fe/DEPS

The NextAction date has arrived: 2017-08-17
Summary: Some Telemetry tests don't exit after running all stories and reporting failures (was: BattOr test failures never take less than 1h, even when they immediately fail)
Based on what's happening in https://bugs.chromium.org/p/chromium/issues/detail?id=795060#c12, I'm going to say that this can happen with other Telemetry tests besides just the power ones.
Components: Speed>Telemetry
Owner: ----
Status: Available (was: Assigned)
Marking this as "available" given my recent pivot away from core Telemetry work
Components: Test>Telemetry
Components: -Speed>Telemetry

Sign in to add a comment