New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 635255 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

"telemetry_perf_unittests (with patch)" is flaky

Project Member Reported by chromium...@appspot.gserviceaccount.com, Aug 6 2016

Issue description

"telemetry_perf_unittests (with patch)" is flaky.

This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label.

We have detected 6 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyMAsSBUZsYWtlIiV0ZWxlbWV0cnlfcGVyZl91bml0dGVzdHMgKHdpdGggcGF0Y2gpDA.



This flaky test/step was previously tracked in  issue 633579 .
 
Project Member

Comment 1 by chromium...@appspot.gserviceaccount.com, Aug 8 2016

Detected 3 new flakes for test/step "telemetry_perf_unittests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyMAsSBUZsYWtlIiV0ZWxlbWV0cnlfcGVyZl91bml0dGVzdHMgKHdpdGggcGF0Y2gpDA. This message was posted automatically by the chromium-try-flakes app.
Browser seems to be crashing.
Error output for one of the tests:

[9/23] benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_desktop.load:tools:dropbox failed unexpectedly 39.1720s:
  [ RUN      ] load:tools:dropbox
  Traceback (most recent call last):
    File "e:\b\swarm_slave\w\irbajsx2\third_party\catapult\telemetry\telemetry\internal\story_runner.py", line 78, in _RunStoryAndProcessErrorIfNeeded
      state.WillRunStory(story)
    File "e:\b\swarm_slave\w\irbajsx2\third_party\catapult\telemetry\telemetry\page\shared_page_state.py", line 218, in WillRunStory
      self._StartBrowser(page)
    File "e:\b\swarm_slave\w\irbajsx2\third_party\catapult\telemetry\telemetry\page\shared_page_state.py", line 178, in _StartBrowser
      self._browser = self._possible_browser.Create(self._finder_options)
    File "e:\b\swarm_slave\w\irbajsx2\third_party\catapult\telemetry\telemetry\internal\backends\chrome\desktop_browser_finder.py", line 68, in Create
      browser_backend, self._platform_backend, self._credentials_path)
    File "e:\b\swarm_slave\w\irbajsx2\third_party\catapult\telemetry\telemetry\internal\browser\browser.py", line 55, in __init__
      self._browser_backend.Start()
    File "e:\b\swarm_slave\w\irbajsx2\third_party\catapult\telemetry\telemetry\internal\backends\chrome\desktop_browser_backend.py", line 294, in Start
      self._WaitForBrowserToComeUp()
    File "e:\b\swarm_slave\w\irbajsx2\third_party\catapult\telemetry\telemetry\internal\backends\chrome\chrome_browser_backend.py", line 173, in _WaitForBrowserToComeUp
      raise exceptions.BrowserGoneException(self.browser, e)
  BrowserGoneException: Return code: -2147483645
  Found Minidump: True
For most of the recent failures I've inspected on Windows, the failure appears to be caused by a failing DCHECK in metrics::MetricsStateManager::LoadClientInfoAndMaybeMigrate, just like  issue 635770 .
Components: Internals>Metrics
Labels: -Sheriff-Chromium
Owner: gab@chromium.org
gab@ landed this assert that looks like it's causing instability all over the tree.
https://chromium.googlesource.com/chromium/src/+/8e885de1c9cd3bec58fbeff156bbc31e15fac053

gab@ Can you please route this appropriately?

From https://chromium-swarm.appspot.com/user/task/30898b8ed2fbbd10

[24/24] measurements.v8_detached_context_age_in_gc_unittest.V8DetachedContextAgeInGCTests.testWithSimplePage failed unexpectedly 4.2470s:

...

chrome_7feef280000!base::debug::BreakDebugger+0xd
chrome_7feef280000!logging::InitChromeLogging+0x406
chrome_7feef280000!logging::LogMessage::~LogMessage+0x20e
chrome_7feef280000!metrics::MetricsStateManager::LoadClientInfoAndMaybeMigrate+0x257
chrome_7feef280000!metrics::MetricsStateManager::ForceClientIdCreation+0x117
chrome_7feef280000!metrics::MetricsService::EnableRecording+0x8c
chrome_7feef280000!metrics::MetricsService::StartRecordingForTests+0x15

chrome_7feef280000!ChromeBrowserMainParts::StartMetricsRecording+0xba
chrome_7feef280000!ChromeBrowserMainParts::PreMainMessageLoopRunImpl+0x39f
chrome_7feef280000!ChromeBrowserMainParts::PreMainMessageLoopRun+0xf8
chrome_7feef280000!content::BrowserMainLoop::PreMainMessageLoopRun+0x9b
chrome_7feef280000!content::StartupTaskRunner::RunAllTasksNow+0x29
chrome_7feef280000!content::BrowserMainLoop::CreateStartupTasks+0x259
chrome_7feef280000!content::BrowserMainRunnerImpl::Initialize+0x32b
chrome_7feef280000!content::BrowserMain+0xfc
chrome_7feef280000!content::RunNamedProcessTypeMain+0x1c0
chrome_7feef280000!content::ContentMainRunnerImpl::Run+0x166
chrome_7feef280000!content::ContentMain+0x30
chrome_7feef280000!ChromeMain+0xc6
chrome!MainDllLoader::Launch+0x346
chrome!wWinMain+0x1ec
chrome!invoke_main+0x21
chrome!__scrt_common_main_seh+0x11e
kernel32!BaseThreadInitThunk+0xd
ntdll!RtlUserThreadStart+0x21

chrome_elf!crashpad::CrashpadClient::StartHandler+0xd3c
kernel32!UnhandledExceptionFilter+0x160
ntdll!MD5Final+0x1e50
ntdll!_C_specific_handler+0x9c
ntdll!RtlDecodePointer+0xad
ntdll!RtlUnwindEx+0xbbf
ntdll!KiUserExceptionDispatcher+0x2e
chrome_7feef280000!base::debug::BreakDebugger+0xd
chrome_7feef280000!logging::InitChromeLogging+0x406
chrome_7feef280000!logging::LogMessage::~LogMessage+0x20e
chrome_7feef280000!metrics::MetricsStateManager::LoadClientInfoAndMaybeMigrate+0x257
chrome_7feef280000!metrics::MetricsStateManager::ForceClientIdCreation+0x117
chrome_7feef280000!metrics::MetricsService::EnableRecording+0x8c
chrome_7feef280000!metrics::MetricsService::StartRecordingForTests+0x15

chrome_7feef280000!ChromeBrowserMainParts::StartMetricsRecording+0xba
chrome_7feef280000!ChromeBrowserMainParts::PreMainMessageLoopRunImpl+0x39f
chrome_7feef280000!ChromeBrowserMainParts::PreMainMessageLoopRun+0xf8
chrome_7feef280000!content::BrowserMainLoop::PreMainMessageLoopRun+0x9b
chrome_7feef280000!content::StartupTaskRunner::RunAllTasksNow+0x29
chrome_7feef280000!content::BrowserMainLoop::CreateStartupTasks+0x259
chrome_7feef280000!content::BrowserMainRunnerImpl::Initialize+0x32b
chrome_7feef280000!content::BrowserMain+0xfc
chrome_7feef280000!content::RunNamedProcessTypeMain+0x1c0
chrome_7feef280000!content::ContentMainRunnerImpl::Run+0x166
chrome_7feef280000!content::ContentMain+0x30
chrome_7feef280000!ChromeMain+0xc6
chrome!MainDllLoader::Launch+0x346

Comment 5 by gab@chromium.org, Aug 9 2016

Cc: gab@chromium.org esprehn@chromium.org
Owner: asvitk...@chromium.org
LoadClientInfoAndMaybeMigrate gets |client_info| from [1] which in practice in chrome is calling into [2] (or [3] on POSIX).

The DCHECK failing means it got *a* |client_info| but not one with a valid GUID as |client_info->client_id|.

First the fact that it even needs to call this code is weird (it means it doesn't have a valid client ID in prefs). But then getting something which isn't a valid GUID is even weirder.

Log in to the bots and check the value in the registry under
HKEY_CURRENT_USER\Software\Google\Update\ClientState\{$CHROMEGUID}[metricsid]
where $CHROMEGUID is {4EA16AC7-FD5A-47C3-875B-DBF4A2008C20} for Canary and {8A69D345-D564-463c-AFF1-A69D9E530F96} for Google Chrome (or for Chromium I think this is just under HKEY_CURRENT_USER\Software\Chromium[metricsid])

It should look something like this: 12345678-1234-1234-1234-123456789abc

@asvitkine to triage metrics issue on telemetry bots

[1] https://cs.chromium.org/chromium/src/components/metrics/metrics_state_manager.cc?rcl=0&l=233
[2] https://cs.chromium.org/chromium/src/chrome/installer/util/google_update_settings.cc?rcl=0&l=353
[3] https://cs.chromium.org/chromium/src/chrome/browser/google/google_update_settings_posix.cc?rcl=0&l=98

Comment 6 by gab@chromium.org, Aug 9 2016

My guess is it's flaky because some VMs have bad state in there. The code should probably be made resilient to that but I don't think it is right now.

Comment 7 by foolip@chromium.org, Aug 10 2016

Cc: asvitk...@chromium.org
Labels: Sheriff-Chromium
Adding Sheriff-Chromium label back since it is causing failures still:
https://build.chromium.org/p/chromium.mac/builders/Mac10.10%20Tests/builds/7772

 Issue 635770  and  issue 636052  are also caused by the same failing DCHECK.

Comment 8 by foolip@chromium.org, Aug 10 2016

Cc: -asvitk...@chromium.org
Cc: sullivan@chromium.org
I'm in training today and tomorrow but might be able to take a look on Friday. Is this something that started happening recently? AFAIK, that code has not changed recently.

Who can help dumping the registry value from the bots per gab@'s instructions in comment 5?
I don't know how to tell when this assert started failing, "telemetry_perf_unittests (with patch)" has failed way back but for lots of different reasons (presumably) and since the logs are gone I can't see when this started happening.
Cc: benhenry@chromium.org
Components: Infra>Client>Perf
+benhenry: who is the right person to route a request like #5 to log into the bots to? Should telemetry team, the trooper, or someone else handle it?
Labels: Infra-Troopers
Owner: ----
Unassigning and adding Infra-Troopers label.

Can the trooper check the value on the bot as suggested in #5, report back and reassign to asvitkine?
Trooper here, which bot? It looks like the flake is all over the fleet (linux bots included).  Will a registry dump from any windows bot suffice?

Comment 14 by gab@chromium.org, Aug 10 2016

Yes, the values mentioned in #5 (or just a dump of HKCU\Software\Chromium\* and HKCU\Software\Google\Update\*) on any Windows bot experiencing this failure will do.
Labels: -Infra-Troopers
Here you go, they're from vm129-m4, the failing swarming bot.
HSC.reg
3.7 KB Download
HSGU.reg
1.1 KB Download

Comment 16 by gab@chromium.org, Aug 10 2016

Thanks. That's it.

Assuming these bots are running Chromium, the values under HKCU\Software\Chromium are borked:

"metricsid"="123"
"metricsid_installdate"="123"
"metricsid_enableddate"="345"

In comparison the Google Chrome bits seem fine (but they don't matter if Chromium is under test):
"metricsid"="23EDB276-BA2E-4093-AA38-36E035FF647E"
"metricsid_installdate"="1434603654"
"metricsid_enableddate"="1434603654"


Solution: teach LoadClientInfoAndMaybeMigrate() or GoogleUpdateSettings::LoadMetricsClientInfo() to self-heal (or at least discard) bad values? Currently it DCHECKs that the values are valid, I guess we assumed we'd be the only ones ever touching that (and wanted to know if things regularly went corrupted) but not recuperating from an occasional bad value doesn't sound great either.
Sigh. This is my fault. I have a CL in the works that was setting those dummy values in a test that I've been running through some try bots - and I guess they weren't being restored.

On Friday, when I'm back from training I can look at what we can do here.

My first thought is perhaps unit tests should use a fake registry or clean up the registry entirely between runs, but I'm not sure if we have any precedent for best practice in this area already. Gab, do you know?

For this specific case, I guess we can remove the DCHECK and make the code robust to dealing with bad values - mainly since it's definitely possible for something like that to happen in the real world if someone decides to modify those settings.

Comment 19 by samli@chromium.org, Aug 10 2016

Owner: asvitk...@chromium.org
Status: Assigned (was: Untriaged)
Can we at least disable the tests until this is fixed if we can't get it fixed very soon? It's causing constant tree flake:
https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/271606
Disabling SGTM if someone has cycles to do this. I don't have any cycles until Friday.
Cc: petrcermak@chromium.org
#20: Which tests should be disabled? telemetry_perf_unittests?
Cc: nednguyen@chromium.org
+nednguyen

Is there some subset of the telemetry_perf_unittests we can disable instead of the entire suite? These catch a lot of crashes on the CQ that break the perf waterfall.

Comment 24 by gab@chromium.org, Aug 11 2016

Uploaded https://codereview.chromium.org/2236383002/ to hack around this until asvitkine@ has cycles to do the right thing here.
Project Member

Comment 25 by bugdroid1@chromium.org, Aug 11 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/18e3366f1863eb3a9f7a19211e184acc2f501e06

commit 18e3366f1863eb3a9f7a19211e184acc2f501e06
Author: gab <gab@chromium.org>
Date: Thu Aug 11 15:32:34 2016

Temporary hack to fix telemetry_perf_unittests flakes.

While asvitkine@ works on proper fix.

BUG= 635255 

Review-Url: https://codereview.chromium.org/2236383002
Cr-Commit-Position: refs/heads/master@{#411336}

[modify] https://crrev.com/18e3366f1863eb3a9f7a19211e184acc2f501e06/components/metrics/metrics_state_manager.cc

Comment 26 by gab@chromium.org, Aug 11 2016

@sullivan, not sure how to verify, but I think it should be fixed now.
If we know longer see the telemetry_perf_unittest failing, I think the bug is fixed? Though should we file another bug for asvitkine@ to land the correct fix (comment in 18e3366f1863eb3a9f7a19211e184acc2f501e06)

Comment 28 by gab@chromium.org, Aug 11 2016

Cc: -nednguyen@chromium.org
Status: Assigneded (was: Assigned)
@27, yes please do.
Status: Fixed (was: Assigneded)
Marking this as Fixed per comment 27.

I filed a bug for making a proper fix here:  http://crbug.com/637322 

I also filed another bug on whether we could make all tests always run with a mock registry:  http://crbug.com/637324 

Sign in to add a comment