Telemetry test spins up slowly on ChromeOS |
||||||||||||||||
Issue descriptionVersion: CrOS 8708 OS: CrOS Oak 8708 What steps will reproduce the problem? Boot with CrOS test image, ssh into the machine, in /usr/local/telemetry/src/tools/perf, run "./run_benchmark octane -v" What is the expected output? The browser should spin up and load the page in a few second. What do you see instead? The screen shows a blank white box for about 20 seconds. The info log shows it takes about 20 seconds to see the Chrome browser: (INFO) 2016-08-16 01:35:28,366 oobe._ExecuteOobeApi:39 Invoking Oobe.loginForTesting (INFO) 2016-08-16 01:35:48,524 cros_browser_backend.Start:154 Browser is up! (detailed info log attached) Please use labels and text to provide additional information.
,
Aug 16 2016
Observed on Elm: smoothness.top_25_smooth: (INFO) 2016-08-16 17:20:42,459 oobe._ExecuteOobeApi:39 Invoking Oobe.loginForTesting (INFO) 2016-08-16 17:20:45,984 cros_browser_backend.Start:154 Browser is up! octane: (INFO) 2016-08-16 17:18:39,346 oobe._ExecuteOobeApi:39 Invoking Oobe.loginForTesting (INFO) 2016-08-16 17:18:59,149 cros_browser_backend.Start:154 Browser is up!
,
Aug 16 2016
I found that the "starting chrome" command args differ (specifically, the array:string part). smooth (the faster benchmark) adds the following args: ['--running-performance-benchmark', '--touch-events=enabled', '--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/537.36 (KHTML\\, like Gecko) Chrome/40.0.2194.2 Safari/537.36']
,
Aug 17 2016
After some experiment, the command args is not the root cause. I tried octane with --use-live-sites flag, it doesn't stuck on login screen for more than 10 seconds. I suspected it is related to WPR. Here I attached run logs for both smoothness and octane, and also WPR access log. The weird thing is that though smoothness passed login screen relatively fast (around 5-7 seconds) I don't see anything special between smoothness and octane WPR.
,
Aug 17 2016
I found the root cause. The octane benchmark's page doesn't specify user_agent, where smoothness.top_25_smooth benchmark does. Smoothness benchmark's stories have specify shared_page_state_class in constructor with value shared_page_state.SharedDesktopPageState . It makes browser starts with --user-agent argument. After I specify shared_page_state_class for octane story instance, octane benchmark's start up time becomes as fast as smoothness.
,
Aug 17 2016
Achuith/Annie, do you have any comments on the proper user agent setting and why it plays such a big role?
,
Aug 17 2016
,
Aug 17 2016
The user agent string is there to make sure that the page serves the expected content no matter which platform we run it. Back in the old day, we don't split between desktop vs mobile case, hence sometimes pages that are meant for mobiles is run on desktop and vice versa by us manipulating the user agent. The new direction we are heading to is to split the benchmarks into mobile vs desktop cases, and probably make it no longer important to add the user agent string. Any how, I think the core bug here should be trying to understand why ChromeOs takes so long to start the octane test without the user agent. I would expect starting Chrome with/without user agent string takes roughly the same time. It's possible that the octane page does something weird with Cros's user agent?
,
Aug 18 2016
Ned, The slow start is unrelated to Octane page. It happens before browser is up (i.e. it waits OOBE (the login box) disappear for more than 10 seconds.) I'll investigate why the user-agent cause this. In the mean time, shall we specify SharedDesktopPageState as default if we find the DUT is CrOS? I'm looking for adding a check in Page base class and would like to hear your advice.
,
Aug 18 2016
Why is it urgent to do the workaround fix? I would prefer not doing any workaround until we identify the root cause.
,
Sep 22 2016
Dean has narrowed down the probable cause. Dean, please share your findings and assign to the proper owner.
,
Sep 22 2016
After a in-depth trace, the root cause is found. It is a story that a unhandled exception accidentally bring faster CrOS guest login. I raised some issues at the end. Achuith, please help us comment on this as login/OOBE is your expertise. Another key change was made by ygorshenin@chromium.org (https://codereview.chromium.org/755203002), which makes screen.js raises unhandled exception when cr.isChromeOS is false. The background story is that I found running Telemetry on CrOS, some benchmark take a long time starting/logging Chrome and some are not. The difference is that if user-agent is specified (e.g. desktop Chrome's user-agent), the login speed is normal. Otherwise, it takes more than 20 seconds to login to test. For specified user-agent case, a Chrome's internal js field, cr.isChromeOS, is set to false. And it prevents screenContext being initialized: https://cs.chromium.org/chromium/src/ui/login/screen.js?q=createScreen+initializeImpl_&sq=package:chromium&l=171&dr=C which cause login.HIDDetectionScreen().register() (in https://cs.chromium.org/chromium/src/chrome/browser/resources/chromeos/login/oobe.js?q=chromeos/login/oobe.js+initialize&sq=package:chromium&l=89) raises an unhandled TypeError exception, which prevents chrome.send('screenStateInitialize'); being called, which links to C++'s CoreOobeHandler::HandleInitialized(). So it skips SigninScreenHandler's initialization, which sets page_is_ready() to false, and in SigninScreenHandler::ShowImpl ( https://cs.chromium.org/chromium/src/chrome/browser/ui/webui/chromeos/login/signin_screen_handler.cc?q=signin_screen_handler.cc+showimpl&sq=package:chromium&l=600&dr=CSs ), it shortcuts and skips UpdateUIState(UI_STATE_ACCOUNT_PICKER, ¶ms). For default case, the above goes another branch so UpdateUIState is called, which calls BaseScreenHandler::ShowScreenWithData, which calls JS cr.ui.Oobe.showScreen(8) // SCREEN_ACCOUNT_PICKER, which eventually changes web_view src attribute, which triggers view.createGuest(), which eventually triggers CreateGuest(), which sets signin_partition, thus is required for chromeos::login::GetSigninContext() return something. The place where the excessive long waiting login time is found here: in UserCloudPolicyManagerChromeOS::FetchPolicyOAuthToken (https://cs.chromium.org/chromium/src/chrome/browser/chromeos/policy/user_cloud_policy_manager_chromeos.cc?q=FetchPolicyOAuthToken&sq=package:chromium&dr=CSs&l=327): scoped_refptr<net::URLRequestContextGetter> signin_context = chromeos::login::GetSigninContext(); if (!signin_context.get()) { LOG(ERROR) << "No signin context for policy oauth token fetch!"; OnOAuth2PolicyTokenFetched( std::string(), GoogleServiceAuthError(GoogleServiceAuthError::NONE)); return; } token_fetcher_.reset(PolicyOAuth2TokenFetcher::CreateInstance()); token_fetcher_->StartWithSigninContext( signin_context.get(), g_browser_process->system_request_context(), base::Bind(&UserCloudPolicyManagerChromeOS::OnOAuth2PolicyTokenFetched, base::Unretained(this))); For specified user-agent case, signin_context.get() is null, so it calls OnOAuth2PolicyTokenFetched(…) which is quite fast. For default case, it calls token_fetcher_->StartWithSigninContext(…), which eventually retry five times, each time waits for three seconds. After 15 seconds retry, it emits “Unrecoverable error or retry count max reached” and continuing logging. After the trace, I’d like to raises some issues: 1. What is the appropriate way to handle createScreen in screen.js throwing an exception when cr.isChromeOS is false? 2. Why in user-agent case where signin_context is unset, it can still proceed login? 3. What’s the function of PolicyOAuth2TokenFetcher? After max retry is reached, why it can still proceed login? Is it expected behavior?
,
Oct 7 2016
+Sonny, who is running Telemetry in an emulator and also noticed the 15 second delay. The emulator speed can be 100 times slower than real world clock, so the 15 sec delay here can cost hours. Achuith, could you provide some insight on what the best strategy is to address the questions Dean raised at the end of comment #12? We can implement it afterwards but we need some suggestions. Thanks!
,
Oct 21 2016
,
Oct 22 2016
,
Nov 17 2016
The 15 second delay for Telemetry login has existed in ChromeOS probably since the beginning. For example here is the login timestamps on R44 6978.0.0 BootTime.LoginNewUser: 16.61 0.00 +0.0000 LoginStarted 0.00 +0.0001 AuthStarted 0.00 +0.0033 CryptohomeMount-Start 0.02 +0.0134 CryptohomeMount-End 0.02 +0.0001 CryptohomeMount-Start 0.52 +0.5078 TabLoad-Start: 0.96 +0.4332 CryptohomeMount-End 0.96 +0.0001 BootTime.Authenticate 0.96 +0.0007 StartSession-Start 0.96 +0.0001 StartSession-End 0.96 +0.0000 UserLoggedIn-Start 1.04 +0.0849 UserLoggedIn-End 1.38 +0.3387 TabLoad-End: 1.71 +0.3258 TabLoad-Start: 2.14 +0.4361 TabLoad-End: 2.23 +0.0873 TabLoad-Start: 2.37 +0.1424 TabLoad-End: 16.31 +13.9313 IMEStarted 16.33 +0.0238 UserProfileGotten 16.36 +0.0351 TabLoad-Start: 16.43 +0.0684 TabLoad-Start: 16.44 +0.0100 TPMOwn-Start And in /home/chronos/user/log/chrome, it has the same oauth2 failures: [17173:17173:1116/235510:ERROR:policy_oauth2_token_fetcher.cc(135)] Unrecoverable error or retry count max reached. It will retry 5 times, and delay 3 seconds between retries. I will see if I can fix it.
,
Nov 22 2016
* If the --user-agent string is not explicitly specified in the test, Chrome will build one and it will be "Mozilla/5.0 (X11; CrOS x86_64 8996.0.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2921.0 Safari/537.36" for ChromeOS.
* in screen.js it has the following code, and "cr.isChromeOS" checks the user-agent string where if it contains "CrOS" it will be true.
initializeImpl_: function() {
if (cr.isChromeOS)
this.screenContext_ = new login.ScreenContext();
this.decorate();
* In smoothness.top_25_smooth the user agent string is specified as Mac OS so it is not following the ChromeOS signin flow. If I supplied a bogus user agent string simply as --user-agent='CrOS' the slow login path will be triggered.
So the problem is related to the authentication process for the test@test.test account for the CrOS signin flow.
,
Nov 22 2016
Thank you for this excellent investigation. I'm also looking at this bug, but haven't made much progress so far: https://bugs.chromium.org/p/chromium/issues/detail?id=663509
,
Jan 12 2017
Issue 657698 has been merged into this issue.
,
Jan 12 2017
I think we have had Telemetry login authentication problems all along. In policy_oauth2_token_fetcher.cc, the logic is to retry authentication on transient failures up to 5 times with 3 sec delay in between: https://cs.chromium.org/chromium/src/chrome/browser/chromeos/policy/policy_oauth2_token_fetcher.cc?q=PolicyOAuth2TokenFetcherImpl::RetryOnError&sq=package:chromium&dr=CSs&l=228 I added some code to Chrome to print the actual error type. Prior to the TsProxy change which landed in R55-8810, the error is considered permanent: [27205:27205:0110/210757:ERROR:policy_oauth2_token_fetcher.cc(235)] permanent error: Unexpected service response (ClientLogin response cookies didn't contain an auth code) After the TsProxy change, the error is considered transient so it retried 5 times hence the full 15 seconds delay: [11738:11738:0110/190216:ERROR:policy_oauth2_token_fetcher.cc(233)] transient error:Connection failed (-100). So I see two things that need to be fixed here: 1) The connection problem in the login phase introduced by the TsProxy change. 2) The ClientLogin response cookie problem.
,
Jan 13 2017
For the connection problem after the TsProxy change, if I comment out the following line in catapult/telemetry/telemetry/internal/backends/chrome/chrome_browser_backend.py:
replay_args.append('--proxy-server=socks://localhost:%s' % proxy_port)
the 5 transient "Service unavailable; try again later." errors are gone.
Ned/Achuith, do you have any suggestion to properly set up the proxy with TsProxy in the OOBE stage?
,
Jan 13 2017
Thanks for the fantastic investigation Ben. I can take a look.
,
Jan 13 2017
This CL gets rid of the retries, reducing login time from 22 sec to 8 sec. https://codereview.chromium.org/2630023002/
,
Jan 14 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/db3849840dea3dae2de134d875423ad05aa27f78 commit db3849840dea3dae2de134d875423ad05aa27f78 Author: achuith <achuith@chromium.org> Date: Sat Jan 14 01:40:41 2017 Do not fetch tokens for telemetry tests. Time reduced for each telemetry login from 22 sec to 8 sec. BUG= chromium:638131 TEST=manual. Review-Url: https://codereview.chromium.org/2630023002 Cr-Commit-Position: refs/heads/master@{#443763} [modify] https://crrev.com/db3849840dea3dae2de134d875423ad05aa27f78/chrome/browser/chromeos/policy/policy_oauth2_token_fetcher.cc
,
Jan 16 2017
Cool thanks for the fix! I cherry-picked this CL and the login time (including both startup.cold.blank_page and octane) significantly reduced. Mark as fixed for now and will wait for it to show up on crosbolt.
,
Jan 17 2017
Thanks again for your investigation, Ben!
,
Jan 17 2017
The time to run the telemetry/perf unit tests on cros dropped from ~80 min to <40 min.
,
Jan 20 2017
Issue 624316 has been merged into this issue.
,
Mar 4 2017
,
Apr 17 2017
,
May 30 2017
,
Aug 1 2017
,
Oct 14 2017
|
||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||
Comment 1 by bccheng@chromium.org
, Aug 16 2016