New issue
Advanced search Search tips

Issue 830834 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner: ----
Closed: Jul 27
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug-Regression

Blocked on:
issue 832207



Sign in to add a comment

media.mobile reference builds fail but error points to telemetry startup code.

Project Member Reported by chcunningham@chromium.org, Apr 9 2018

Issue description

Ref graph busted. Time to audio play regression.
 
All graphs for this bug:
  https://chromeperf.appspot.com/group_report?bug_id=830834

(For debugging:) Original alerts at time of bug-filing:
  https://chromeperf.appspot.com/group_report?sid=4b93ada58e4c74d300216ca461b6fad406bd8e490603f97f31de41e57a898d50


Bot(s) for this bug's original alert(s):

android-nexus7v2
📍 Couldn't reproduce a difference.
https://pinpoint-dot-chromeperf.appspot.com/job/12c0108cc40000
Project Member

Comment 5 by 42576172...@developer.gserviceaccount.com, Apr 10 2018

📍 Couldn't reproduce a difference.
https://pinpoint-dot-chromeperf.appspot.com/job/148a5a6cc40000
Cc: sullivan@chromium.org crouleau@chromium.org
The ref graph hasn't produced a value in days - its build is failing.

https://ci.chromium.org/buildbot/chromium.perf/Android%20Nexus7v2%20Perf/
https://chromeperf.appspot.com/group_report?bug_id=830834

I'm tempted to close the bug (2 pinpoints, no repro), but I want to make sure we have eyes on this ref graph issue. Who should I follow up with for that bot?
Chris, if you can find the error message from the failing build and post it here or file a new bug, then we can figure out what is wrong.
Sure. Of the last 25 builds (going back to April 5), 22 failed. 

One thing I don't follow is, why didn't the passing builds update the ref graph? We have one as recently as April 6?

The failures have a varied failure reasons, but one thing that is common is: 
"failed loading.mobile on Android"

When I pop open the last 3 failed builds, I see "Unexpected Failures: loading.mobile/FlipKart"

Looking at the log, we see the renderer is crashing, but the stack is not symbolized. I'm building android chrome now...
I'm not able to get a crash directing my dev build to flipkart.com. I'm using a nexus7 and the exact build from the swarming bot. 

I've attempted to run the benchmark using my binary, but can't seem to get that to work... please help.

$ tools/perf/run_benchmark media.mobile --device="android" --browser=exact --browser-executable=/ssd/blink3/src/out/Android/apks/ChromePublic.apk
(...lots of debug output...)
Cannot find browser of type exact. To list out all available browsers, rerun your command with --browser=list


$ tools/perf/run_benchmark loading.mobile --device="android" --browser=list
(... lots of debug output ...)
Available browsers:
   Android device 079030cf

$ tools/perf/run_benchmark loading.mobile --device="android" --browser="Android device 079030cf"
Cannot find browser of type Android device 079030cf. To list out all available browsers, rerun your command with --browser=list
Owner: nedngu...@google.com
Status: Assigned (was: Untriaged)
Summary: media.mobile reference builds fail but error points to telemetry startup code. (was: 84.1% regression in media.mobile at 548640:548708)
flipkart tests are independent of media.mobile. Their failure should have no effect on whether the media.mobile.reference benchmark is run.

I got to https://ci.chromium.org/buildbot/chromium.perf/Android%20Nexus7v2%20Perf/

are search for "media.mobile.reference"

I see this error:

Unexpected Failures:
* media.mobile/video.html?src=crowd.ogg&type=audio
 
so I click the "stdout" link under that error. That takes me to a log file that has all the logs from running media.mobile.reference.

in those logs I see a couple errors:

"
Traceback (most recent call last):
  _RunStoryAndProcessErrorIfNeeded at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py:147
    test.DidRunStory(state.platform, results)
  DidRunStory at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/web_perf/timeline_based_measurement.py:289
    trace_result, _ = platform.tracing_controller.StopTracing()
  StopTracing at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/core/tracing_controller.py:53
    return self._tracing_controller_backend.StopTracing()
  StopTracing at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py:130
    self._IssueClockSyncMarker()
  _IssueClockSyncMarker at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py:234
    self._RecordIssuerClockSyncMarker)
  RecordClockSyncMarker at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/chrome_tracing_agent.py:178
    raise ChromeClockSyncError('Cannot issue clock sync. No devtools clients')
"

"
Traceback (most recent call last):
  File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py", line 116, in _RunStoryAndProcessErrorIfNeeded
    state.WillRunStory(story)
  File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/page/shared_page_state.py", line 234, in WillRunStory
    self._StartBrowser(page)
  File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/page/shared_page_state.py", line 198, in _StartBrowser
    self._browser = self._possible_browser.Create()
  File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/backends/chrome/android_browser_finder.py", line 165, in Create
    return self._GetBrowserInstance(existing=False)
  File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/backends/chrome/android_browser_finder.py", line 187, in _GetBrowserInstance
    find_existing=existing)
  File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/browser/browser.py", line 49, in __init__
    self._browser_backend.Start(startup_args, startup_url=startup_url)
  File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/backends/chrome/android_browser_backend.py", line 73, in Start
    blocking=True)
  File "/b/swarming/w/ir/third_party/catapult/devil/devil/android/decorators.py", line 57, in timeout_retry_wrapper
    retry_if_func=retry_if_func)
  File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/timeout_retry.py", line 158, in Run
    error_log_func=error_log_func)
  File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 189, in JoinAll
    self._JoinAll(watcher, timeout)
  File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 154, in _JoinAll
    (len(alive_threads), len(self._threads)))
CommandTimeoutError: Timed out waiting for 1 of 1 threads.
"

None of the errors have any media-specific code in their traces. This is another example of Telemetry failures at startup being incorrectly summarized as a failure of a specific story. This type of thing makes it really difficult for people who don't know the details of the whole system to understand what is going on, and I think it needs to be fixed by the Telemetry team once they are done putting out fires.
This is definitely a browser crash: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/ba8e63d8-3e70-11e8-a2d6-0242ac110005

(Tip: search for "Uploading screenshot".. in the log to find these. Note that we are also working with SOM team to surface these links in SOM directly -  issue 772212 )

Also apologize for the confusing log. Making the log easier to understand for people definitely on our radar once he fire are putting out.
Cc: perezju@chromium.org
Components: Speed>Telemetry
I think the Telemetry bug here is why we don't see Android logcat upon the browser crash.. Hmhh
+Juan: I think we may lost the ability to get system log upon browser startup crash with the refactoring to move browser startup to browser_finder*
Owner: nednguyen@chromium.org
Blockedon: 832207
Owner: ----
Status: Available (was: Assigned)
I don't have expertise to debug Clank crash, so leave this bug available. However, filing issue 832207 as blocker for this
Thanks a ton for looking. New questions:


0.
> I got to https://ci.chromium.org/buildbot/chromium.perf/Android%20Nexus7v2%20Perf/
> are search for "media.mobile.reference"

When I search (cntrl f) for media.mobile.reference on that page, I don't find anything?

1. 
From comment #11, sounds like the browser (not renderer?) really did crash. I'm not able to open this link (Should I be able? I get 403): https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/ba8e63d8-3e70-11e8-a2d6-0242ac110005

So still not clear whether media/ code is possibly at fault.

2. Re comment #10 - do we also have some telemetry errors in addition to the browser crash?

3. Anyone know how to fix my tools/perf/run_benchmark commands from Comment #9 to run the test on my own android build?
0. 

after you go to https://ci.chromium.org/buildbot/chromium.perf/Android%20Nexus7v2%20Perf/ and click on #1603 to get to https://ci.chromium.org/buildbot/chromium.perf/Android%20Nexus7v2%20Perf/1603

and then you will be able to find "media.mobile.reference".

https://ci.chromium.org/buildbot/chromium.perf/Android%20Nexus7v2%20Perf/  is the main page that links to each of the runs on that bot. That bot is running tests continuously.


1. It works for me, so maybe there is some permission issue. Are you logged into your @google.com account on pantheon.corp.google.com ?

2. Ned, please look into these. it's difficult to find the real error source when there are a bunch of stack traces all over the place.

3. 

I run 

$tools/perf/run_benchmark loading.mobile --device="android" --browser=list
(WARNING) 2018-04-12 13:14:00,068 binary_manager._WrapDependencyManagerFunction:59  Cannot find path for chrome_stable on platform android_n_armeabi-v7a. Falling back to android_armeabi-v7a.
Available browsers:
   Android device ZX1G22L8XD
     android-chrome-dev
     android-system-chrome

so then I run
$ tools/perf/run_benchmark loading.mobile --device="android" --browser=android-chrome-dev

Perhaps your device doesn't have Chrome on it? Go to the play store and install Chrome Dev.


0. thanks
1. fixed now
2. I'd love to see a stack trace for the *chrome* crash - not finding one?

3. 
My device has my dev build of chromium on it. Is there some way to tell it to use that?

It also has dev channel chrome... but my --browser=list command just yields this cryptic result:

$ tools/perf/run_benchmark loading.mobile --device="android" --browser=list
(... tons of debug output... )

Available browsers:
   Android device 079030cf

Finally, a new set of questions...
A. how often does the ref get updated?
B. how do I determine the current revision is for the ref?

What I'd like to do in these situations where the ref graph has been busted for a long time is just check in on A and B above and make sure things are still working as expected. 

Also, getting the call stack for the crash would help me be able to query crbug for that stack and see that the issue was addressed.
#18: simpifying stack trace in Telemetry in general a hard problem. The reason is because is the life cycle of a benchmark is complex, there are many actors involved in it, e.g: telemetry, the browser, the OS, the web page replay service, battor...

So when a crash happens, we need to make sure that all the available logging is there to make sure that we the right data that capture the error state. The problem with this is the log then becomes difficult to consume.

So we have two choices: simple log with the worry that we won't have the right data to debug; complex log which is hard for people who are new to the framework to understand. So far we have picked former (as difficult to debug is better than impossible to debug).

Having said that, a few things we are trying now is to split the log to artifacts which will be stored separately ( issue 772212 ).

I am also open to other suggestions in how we make the log simpler yet still contain all the relevant info for debugging.
> complex log which is hard for people who are new to the framework to understand. So far we have picked former (as difficult to debug is better than impossible to debug)

I agree, this is the safer bet. What I'm confused about is that the log doesn't appear to have a chrome stack trace? 

Also, bump to my other questions in comment 19.
#21: In general, if browser crashes at startup, we won't get chrome stack trace. The best we have is system log which can shows stuff relevant to Chrome (issue 832207)
#22: Is this true of debugging android in general, or some limitation of the test infra? My thinking is I can catch chrome stack traces at startup easily if I just start the logcat before starting chrome.
#23: what I meant is we won't get chrome stack trace from the tombstone. But the logcat usually contain Chrome stack as you mentioned.
Thanks. Is the logcat uploaded somewhere?
#25: not yet (until we fix issue 832207)
Ok, thanks. With that I think the bug is ready to be closed. Regression could not be reproduced, and I assume the ref crash is fixed in later revisions (as usual).

I just have these 2 lingering questions for next time I go looking ref-debugging.

A. how often does the ref get updated?
B. how do I determine the current revision is for the ref?
Cc: -crouleau@chromium.org nedngu...@google.com
Project Member

Comment 29 by bugdroid1@chromium.org, Apr 16 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/5d0c16d725ca28dcadddd44d2309ea2257e8b92b

commit 5d0c16d725ca28dcadddd44d2309ea2257e8b92b
Author: Caleb Rouleau <crouleau@chromium.org>
Date: Mon Apr 16 17:26:32 2018

[Telemetry] Explain when no browsers available for a device

Bug:  chromium:830834 
Change-Id: Idd8f25d2863f2235951783665ae0a6972c76616e
Reviewed-on: https://chromium-review.googlesource.com/1011107
Commit-Queue: Caleb Rouleau <crouleau@chromium.org>
Reviewed-by: Ned Nguyen <nednguyen@google.com>

[modify] https://crrev.com/5d0c16d725ca28dcadddd44d2309ea2257e8b92b/telemetry/telemetry/internal/browser/browser_options.py

Ned, re comment 20: do any of the non-chromium stacks point to some other telemetry issue we should be tracking in this bug (i.e. is the current title accurate?)

> A. how often does the ref get updated?

I have mostly answered this by just mousing over the graph and checking the revs. Looks like refs are usually updated often - anywhere from every 2-10 revisions.


#30:
> do any of the non-chromium stacks point to some other telemetry issue we should be tracking in this bug

Not that I am aware of. This is purely a Chrome Android startup problem

> how often does the ref get updated?

Ideally once per quarter. Right now it's a script that someone needs to run. I will try to make a update now.
Project Member

Comment 32 by bugdroid1@chromium.org, Apr 16 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4c4a55dce59dae529d82f7f4371a459e12e6373c

commit 4c4a55dce59dae529d82f7f4371a459e12e6373c
Author: catapult-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <catapult-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Mon Apr 16 22:03:58 2018

Roll src/third_party/catapult/ 037198303..5d0c16d72 (1 commit)

https://chromium.googlesource.com/catapult.git/+log/0371983030f2..5d0c16d725ca

$ git log 037198303..5d0c16d72 --date=short --no-merges --format='%ad %ae %s'
2018-04-12 crouleau [Telemetry] Explain when no browsers available for a device

Created with:
  roll-dep src/third_party/catapult
BUG= chromium:830834 


The AutoRoll server is located here: https://catapult-roll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.


TBR=sullivan@chromium.org

Change-Id: I89bb2ef6dda43a6c723fc7686e01f9278876c1d3
Reviewed-on: https://chromium-review.googlesource.com/1014276
Commit-Queue: catapult-chromium-autoroll <catapult-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Reviewed-by: catapult-chromium-autoroll <catapult-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#551139}
[modify] https://crrev.com/4c4a55dce59dae529d82f7f4371a459e12e6373c/DEPS

>> do any of the non-chromium stacks point to some other telemetry issue we should be tracking in this bug

> Not that I am aware of. This is purely a Chrome Android startup problem

Ack. In that case we can close and the follow up work of getting the adb log with chrome stack traces is already tracked in the separate bug you filed.


>> how often does the ref get updated?

> Ideally once per quarter. Right now it's a script that someone needs to run. I will try to make a update now.


Interesting! So, when I click around in ref graph in the UI and observe that the "Chromium commit position range" is regularly increasing - is that just a bad info? Or am I misinterpreting? 

Also, that makes me wonder how it is that ref graph's ever get to be persistently busted. If it worked at the time of the last roll, I would expect it to just continue working (save for the occasional flake)?
Project Member

Comment 34 by bugdroid1@chromium.org, Apr 17 2018

Labels: merge-merged-testbranch
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4c4a55dce59dae529d82f7f4371a459e12e6373c

commit 4c4a55dce59dae529d82f7f4371a459e12e6373c
Author: catapult-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <catapult-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Mon Apr 16 22:03:58 2018

Roll src/third_party/catapult/ 037198303..5d0c16d72 (1 commit)

https://chromium.googlesource.com/catapult.git/+log/0371983030f2..5d0c16d725ca

$ git log 037198303..5d0c16d72 --date=short --no-merges --format='%ad %ae %s'
2018-04-12 crouleau [Telemetry] Explain when no browsers available for a device

Created with:
  roll-dep src/third_party/catapult
BUG= chromium:830834 


The AutoRoll server is located here: https://catapult-roll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.


TBR=sullivan@chromium.org

Change-Id: I89bb2ef6dda43a6c723fc7686e01f9278876c1d3
Reviewed-on: https://chromium-review.googlesource.com/1014276
Commit-Queue: catapult-chromium-autoroll <catapult-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Reviewed-by: catapult-chromium-autoroll <catapult-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#551139}
[modify] https://crrev.com/4c4a55dce59dae529d82f7f4371a459e12e6373c/DEPS

Status: WontFix (was: Available)
Closing per comment 33. 

Comment 36 by benhenry@google.com, Jan 16 (6 days ago)

Components: Test>Telemetry

Comment 37 by benhenry@google.com, Jan 16 (6 days ago)

Components: -Speed>Telemetry

Sign in to add a comment