"telemetry_perf_unittests (with patch)" is flaky |
||||||||||||||
Issue description"telemetry_perf_unittests (with patch)" is flaky. This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label. We have detected 6 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyMAsSBUZsYWtlIiV0ZWxlbWV0cnlfcGVyZl91bml0dGVzdHMgKHdpdGggcGF0Y2gpDA. This flaky test/step was previously tracked in issue 633579 .
,
Aug 9 2016
Browser seems to be crashing.
Error output for one of the tests:
[9/23] benchmarks.system_health_smoke_test.SystemHealthBenchmarkSmokeTest.system_health.memory_desktop.load:tools:dropbox failed unexpectedly 39.1720s:
[ RUN ] load:tools:dropbox
Traceback (most recent call last):
File "e:\b\swarm_slave\w\irbajsx2\third_party\catapult\telemetry\telemetry\internal\story_runner.py", line 78, in _RunStoryAndProcessErrorIfNeeded
state.WillRunStory(story)
File "e:\b\swarm_slave\w\irbajsx2\third_party\catapult\telemetry\telemetry\page\shared_page_state.py", line 218, in WillRunStory
self._StartBrowser(page)
File "e:\b\swarm_slave\w\irbajsx2\third_party\catapult\telemetry\telemetry\page\shared_page_state.py", line 178, in _StartBrowser
self._browser = self._possible_browser.Create(self._finder_options)
File "e:\b\swarm_slave\w\irbajsx2\third_party\catapult\telemetry\telemetry\internal\backends\chrome\desktop_browser_finder.py", line 68, in Create
browser_backend, self._platform_backend, self._credentials_path)
File "e:\b\swarm_slave\w\irbajsx2\third_party\catapult\telemetry\telemetry\internal\browser\browser.py", line 55, in __init__
self._browser_backend.Start()
File "e:\b\swarm_slave\w\irbajsx2\third_party\catapult\telemetry\telemetry\internal\backends\chrome\desktop_browser_backend.py", line 294, in Start
self._WaitForBrowserToComeUp()
File "e:\b\swarm_slave\w\irbajsx2\third_party\catapult\telemetry\telemetry\internal\backends\chrome\chrome_browser_backend.py", line 173, in _WaitForBrowserToComeUp
raise exceptions.BrowserGoneException(self.browser, e)
BrowserGoneException: Return code: -2147483645
Found Minidump: True
,
Aug 9 2016
For most of the recent failures I've inspected on Windows, the failure appears to be caused by a failing DCHECK in metrics::MetricsStateManager::LoadClientInfoAndMaybeMigrate, just like issue 635770 .
,
Aug 9 2016
gab@ landed this assert that looks like it's causing instability all over the tree. https://chromium.googlesource.com/chromium/src/+/8e885de1c9cd3bec58fbeff156bbc31e15fac053 gab@ Can you please route this appropriately? From https://chromium-swarm.appspot.com/user/task/30898b8ed2fbbd10 [24/24] measurements.v8_detached_context_age_in_gc_unittest.V8DetachedContextAgeInGCTests.testWithSimplePage failed unexpectedly 4.2470s: ... chrome_7feef280000!base::debug::BreakDebugger+0xd chrome_7feef280000!logging::InitChromeLogging+0x406 chrome_7feef280000!logging::LogMessage::~LogMessage+0x20e chrome_7feef280000!metrics::MetricsStateManager::LoadClientInfoAndMaybeMigrate+0x257 chrome_7feef280000!metrics::MetricsStateManager::ForceClientIdCreation+0x117 chrome_7feef280000!metrics::MetricsService::EnableRecording+0x8c chrome_7feef280000!metrics::MetricsService::StartRecordingForTests+0x15 chrome_7feef280000!ChromeBrowserMainParts::StartMetricsRecording+0xba chrome_7feef280000!ChromeBrowserMainParts::PreMainMessageLoopRunImpl+0x39f chrome_7feef280000!ChromeBrowserMainParts::PreMainMessageLoopRun+0xf8 chrome_7feef280000!content::BrowserMainLoop::PreMainMessageLoopRun+0x9b chrome_7feef280000!content::StartupTaskRunner::RunAllTasksNow+0x29 chrome_7feef280000!content::BrowserMainLoop::CreateStartupTasks+0x259 chrome_7feef280000!content::BrowserMainRunnerImpl::Initialize+0x32b chrome_7feef280000!content::BrowserMain+0xfc chrome_7feef280000!content::RunNamedProcessTypeMain+0x1c0 chrome_7feef280000!content::ContentMainRunnerImpl::Run+0x166 chrome_7feef280000!content::ContentMain+0x30 chrome_7feef280000!ChromeMain+0xc6 chrome!MainDllLoader::Launch+0x346 chrome!wWinMain+0x1ec chrome!invoke_main+0x21 chrome!__scrt_common_main_seh+0x11e kernel32!BaseThreadInitThunk+0xd ntdll!RtlUserThreadStart+0x21 chrome_elf!crashpad::CrashpadClient::StartHandler+0xd3c kernel32!UnhandledExceptionFilter+0x160 ntdll!MD5Final+0x1e50 ntdll!_C_specific_handler+0x9c ntdll!RtlDecodePointer+0xad ntdll!RtlUnwindEx+0xbbf ntdll!KiUserExceptionDispatcher+0x2e chrome_7feef280000!base::debug::BreakDebugger+0xd chrome_7feef280000!logging::InitChromeLogging+0x406 chrome_7feef280000!logging::LogMessage::~LogMessage+0x20e chrome_7feef280000!metrics::MetricsStateManager::LoadClientInfoAndMaybeMigrate+0x257 chrome_7feef280000!metrics::MetricsStateManager::ForceClientIdCreation+0x117 chrome_7feef280000!metrics::MetricsService::EnableRecording+0x8c chrome_7feef280000!metrics::MetricsService::StartRecordingForTests+0x15 chrome_7feef280000!ChromeBrowserMainParts::StartMetricsRecording+0xba chrome_7feef280000!ChromeBrowserMainParts::PreMainMessageLoopRunImpl+0x39f chrome_7feef280000!ChromeBrowserMainParts::PreMainMessageLoopRun+0xf8 chrome_7feef280000!content::BrowserMainLoop::PreMainMessageLoopRun+0x9b chrome_7feef280000!content::StartupTaskRunner::RunAllTasksNow+0x29 chrome_7feef280000!content::BrowserMainLoop::CreateStartupTasks+0x259 chrome_7feef280000!content::BrowserMainRunnerImpl::Initialize+0x32b chrome_7feef280000!content::BrowserMain+0xfc chrome_7feef280000!content::RunNamedProcessTypeMain+0x1c0 chrome_7feef280000!content::ContentMainRunnerImpl::Run+0x166 chrome_7feef280000!content::ContentMain+0x30 chrome_7feef280000!ChromeMain+0xc6 chrome!MainDllLoader::Launch+0x346
,
Aug 9 2016
LoadClientInfoAndMaybeMigrate gets |client_info| from [1] which in practice in chrome is calling into [2] (or [3] on POSIX).
The DCHECK failing means it got *a* |client_info| but not one with a valid GUID as |client_info->client_id|.
First the fact that it even needs to call this code is weird (it means it doesn't have a valid client ID in prefs). But then getting something which isn't a valid GUID is even weirder.
Log in to the bots and check the value in the registry under
HKEY_CURRENT_USER\Software\Google\Update\ClientState\{$CHROMEGUID}[metricsid]
where $CHROMEGUID is {4EA16AC7-FD5A-47C3-875B-DBF4A2008C20} for Canary and {8A69D345-D564-463c-AFF1-A69D9E530F96} for Google Chrome (or for Chromium I think this is just under HKEY_CURRENT_USER\Software\Chromium[metricsid])
It should look something like this: 12345678-1234-1234-1234-123456789abc
@asvitkine to triage metrics issue on telemetry bots
[1] https://cs.chromium.org/chromium/src/components/metrics/metrics_state_manager.cc?rcl=0&l=233
[2] https://cs.chromium.org/chromium/src/chrome/installer/util/google_update_settings.cc?rcl=0&l=353
[3] https://cs.chromium.org/chromium/src/chrome/browser/google/google_update_settings_posix.cc?rcl=0&l=98
,
Aug 9 2016
My guess is it's flaky because some VMs have bad state in there. The code should probably be made resilient to that but I don't think it is right now.
,
Aug 10 2016
Adding Sheriff-Chromium label back since it is causing failures still: https://build.chromium.org/p/chromium.mac/builders/Mac10.10%20Tests/builds/7772 Issue 635770 and issue 636052 are also caused by the same failing DCHECK.
,
Aug 10 2016
,
Aug 10 2016
I'm in training today and tomorrow but might be able to take a look on Friday. Is this something that started happening recently? AFAIK, that code has not changed recently. Who can help dumping the registry value from the bots per gab@'s instructions in comment 5?
,
Aug 10 2016
I don't know how to tell when this assert started failing, "telemetry_perf_unittests (with patch)" has failed way back but for lots of different reasons (presumably) and since the logs are gone I can't see when this started happening.
,
Aug 10 2016
+benhenry: who is the right person to route a request like #5 to log into the bots to? Should telemetry team, the trooper, or someone else handle it?
,
Aug 10 2016
Unassigning and adding Infra-Troopers label. Can the trooper check the value on the bot as suggested in #5, report back and reassign to asvitkine?
,
Aug 10 2016
Trooper here, which bot? It looks like the flake is all over the fleet (linux bots included). Will a registry dump from any windows bot suffice?
,
Aug 10 2016
Yes, the values mentioned in #5 (or just a dump of HKCU\Software\Chromium\* and HKCU\Software\Google\Update\*) on any Windows bot experiencing this failure will do.
,
Aug 10 2016
Here you go, they're from vm129-m4, the failing swarming bot.
,
Aug 10 2016
Thanks. That's it. Assuming these bots are running Chromium, the values under HKCU\Software\Chromium are borked: "metricsid"="123" "metricsid_installdate"="123" "metricsid_enableddate"="345" In comparison the Google Chrome bits seem fine (but they don't matter if Chromium is under test): "metricsid"="23EDB276-BA2E-4093-AA38-36E035FF647E" "metricsid_installdate"="1434603654" "metricsid_enableddate"="1434603654" Solution: teach LoadClientInfoAndMaybeMigrate() or GoogleUpdateSettings::LoadMetricsClientInfo() to self-heal (or at least discard) bad values? Currently it DCHECKs that the values are valid, I guess we assumed we'd be the only ones ever touching that (and wanted to know if things regularly went corrupted) but not recuperating from an occasional bad value doesn't sound great either.
,
Aug 10 2016
Sigh. This is my fault. I have a CL in the works that was setting those dummy values in a test that I've been running through some try bots - and I guess they weren't being restored. On Friday, when I'm back from training I can look at what we can do here. My first thought is perhaps unit tests should use a fake registry or clean up the registry entirely between runs, but I'm not sure if we have any precedent for best practice in this area already. Gab, do you know? For this specific case, I guess we can remove the DCHECK and make the code robust to dealing with bad values - mainly since it's definitely possible for something like that to happen in the real world if someone decides to modify those settings.
,
Aug 10 2016
See RegistryOverrideManager :-) : https://cs.chromium.org/chromium/src/base/test/test_reg_util_win.h?dr=CSs&q=RegistryOverrideManager+file:test_reg_util_win.h&sq=package:chromium&l=33 A 2 liner in a unittest will mock HKCU :-) : https://cs.chromium.org/chromium/src/chrome/installer/util/beacons_unittest.cc?q=RegistryOverrideManager+file:.*unittest.cc&sq=package:chromium&l=57&dr=C
,
Aug 10 2016
,
Aug 11 2016
Can we at least disable the tests until this is fixed if we can't get it fixed very soon? It's causing constant tree flake: https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/271606
,
Aug 11 2016
Disabling SGTM if someone has cycles to do this. I don't have any cycles until Friday.
,
Aug 11 2016
#20: Which tests should be disabled? telemetry_perf_unittests?
,
Aug 11 2016
+nednguyen Is there some subset of the telemetry_perf_unittests we can disable instead of the entire suite? These catch a lot of crashes on the CQ that break the perf waterfall.
,
Aug 11 2016
Uploaded https://codereview.chromium.org/2236383002/ to hack around this until asvitkine@ has cycles to do the right thing here.
,
Aug 11 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/18e3366f1863eb3a9f7a19211e184acc2f501e06 commit 18e3366f1863eb3a9f7a19211e184acc2f501e06 Author: gab <gab@chromium.org> Date: Thu Aug 11 15:32:34 2016 Temporary hack to fix telemetry_perf_unittests flakes. While asvitkine@ works on proper fix. BUG= 635255 Review-Url: https://codereview.chromium.org/2236383002 Cr-Commit-Position: refs/heads/master@{#411336} [modify] https://crrev.com/18e3366f1863eb3a9f7a19211e184acc2f501e06/components/metrics/metrics_state_manager.cc
,
Aug 11 2016
@sullivan, not sure how to verify, but I think it should be fixed now.
,
Aug 11 2016
If we know longer see the telemetry_perf_unittest failing, I think the bug is fixed? Though should we file another bug for asvitkine@ to land the correct fix (comment in 18e3366f1863eb3a9f7a19211e184acc2f501e06)
,
Aug 11 2016
@27, yes please do.
,
Aug 12 2016
Marking this as Fixed per comment 27. I filed a bug for making a proper fix here: http://crbug.com/637322 I also filed another bug on whether we could make all tests always run with a mock registry: http://crbug.com/637324 |
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by chromium...@appspot.gserviceaccount.com
, Aug 8 2016