New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 638131 link

Starred by 6 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Telemetry test spins up slowly on ChromeOS

Project Member Reported by deanliao@chromium.org, Aug 16 2016

Issue description

Version: CrOS 8708
OS: CrOS Oak 8708 

What steps will reproduce the problem?
Boot with CrOS test image, ssh into the machine, 
in /usr/local/telemetry/src/tools/perf, run "./run_benchmark octane -v"

What is the expected output?
The browser should spin up and load the page in a few second.

What do you see instead?
The screen shows a blank white box for about 20 seconds. The info log shows it takes about 20 seconds to see the Chrome browser:
(INFO) 2016-08-16 01:35:28,366 oobe._ExecuteOobeApi:39  Invoking Oobe.loginForTesting
(INFO) 2016-08-16 01:35:48,524 cros_browser_backend.Start:154  Browser is up!

(detailed info log attached)


Please use labels and text to provide additional information.

 
slow_browser_up.txt
2.6 KB View Download
Cc: bccheng@chromium.org shunhsingou@chromium.org
Observed on Elm:
smoothness.top_25_smooth:
(INFO) 2016-08-16 17:20:42,459 oobe._ExecuteOobeApi:39  Invoking Oobe.loginForTesting
(INFO) 2016-08-16 17:20:45,984 cros_browser_backend.Start:154  Browser is up!

octane:
(INFO) 2016-08-16 17:18:39,346 oobe._ExecuteOobeApi:39  Invoking Oobe.loginForTesting
(INFO) 2016-08-16 17:18:59,149 cros_browser_backend.Start:154  Browser is up!

I found that the "starting chrome" command args differ (specifically, the array:string part). smooth (the faster benchmark) adds the following args:
['--running-performance-benchmark', '--touch-events=enabled', '--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/537.36 (KHTML\\, like Gecko) Chrome/40.0.2194.2 Safari/537.36']

After some experiment, the command args is not the root cause.

I tried octane with --use-live-sites flag, it doesn't stuck on login screen for more than 10 seconds. I suspected it is related to WPR. Here I attached run logs for both smoothness and octane, and also WPR access log. The weird thing is that though smoothness passed login screen relatively fast (around 5-7 seconds) I don't see anything special between smoothness and octane WPR.

octane.runlog
1.4 MB Download
smoothness.runlog
816 KB Download
octane.wprlog
2.8 KB Download
smoothness.wprlog
3.2 KB Download
I found the root cause. The octane benchmark's page doesn't specify user_agent, where smoothness.top_25_smooth benchmark does. Smoothness benchmark's stories have specify shared_page_state_class in constructor with value shared_page_state.SharedDesktopPageState . It makes browser starts with --user-agent argument.

After I specify shared_page_state_class for octane story instance, octane benchmark's start up time becomes as fast as smoothness.
Cc: sullivan@chromium.org achuith@chromium.org
Achuith/Annie, do you have any comments on the proper user agent setting and why it plays such a big role?
Cc: nednguyen@chromium.org
The user agent string is there to make sure that the page serves the expected content no matter which platform we run it.

Back in the old day, we don't split between desktop vs mobile case, hence sometimes pages that are meant for mobiles is run on desktop and vice versa by us manipulating the user agent.

The new direction we are heading to is to split the benchmarks into mobile vs desktop cases, and probably make it no longer important to add the user agent string.

Any how, I think the core bug here should be trying to understand why ChromeOs takes so long to start the octane test without the user agent. I would expect starting Chrome with/without user agent string takes roughly the same time.  It's possible that the octane page does something weird with Cros's user agent?
Ned, 

The slow start is unrelated to Octane page. It happens before browser is up (i.e. it waits OOBE (the login box) disappear for more than 10 seconds.)

I'll investigate why the user-agent cause this. In the mean time, shall we specify SharedDesktopPageState as default if we find the DUT is CrOS? I'm looking for adding a check in Page base class and would like to hear your advice.
Why is it urgent to do the workaround fix? I would prefer not doing any workaround until we identify the root cause.
Owner: deanliao@chromium.org
Status: Started (was: Untriaged)
Dean has narrowed down the probable cause. Dean, please share your findings and assign to the proper owner.
Cc: ygorshenin@chromium.org
After a in-depth trace, the root cause is found. It is a story that a unhandled exception accidentally bring faster CrOS guest login. I raised some issues at the end. Achuith, please help us comment on this as login/OOBE is your expertise. Another key change was made by ygorshenin@chromium.org (https://codereview.chromium.org/755203002), which makes screen.js raises unhandled exception when cr.isChromeOS is false. 

The background story is that I found running Telemetry on CrOS, some benchmark take a long time starting/logging Chrome and some are not. The difference is that if user-agent is specified (e.g. desktop Chrome's user-agent), the login speed is normal. Otherwise, it takes more than 20 seconds to login to test.

For specified user-agent case, a Chrome's internal js field, cr.isChromeOS, is set to false. And it prevents screenContext being initialized:
https://cs.chromium.org/chromium/src/ui/login/screen.js?q=createScreen+initializeImpl_&sq=package:chromium&l=171&dr=C
which cause login.HIDDetectionScreen().register() (in https://cs.chromium.org/chromium/src/chrome/browser/resources/chromeos/login/oobe.js?q=chromeos/login/oobe.js+initialize&sq=package:chromium&l=89) raises an unhandled TypeError exception, which prevents chrome.send('screenStateInitialize'); being called, which links to C++'s CoreOobeHandler::HandleInitialized(). So it skips SigninScreenHandler's initialization, which sets page_is_ready() to false, and in SigninScreenHandler::ShowImpl ( https://cs.chromium.org/chromium/src/chrome/browser/ui/webui/chromeos/login/signin_screen_handler.cc?q=signin_screen_handler.cc+showimpl&sq=package:chromium&l=600&dr=CSs ), it shortcuts and skips UpdateUIState(UI_STATE_ACCOUNT_PICKER, &params).

For default case, the above goes another branch so UpdateUIState is called, which calls BaseScreenHandler::ShowScreenWithData, which calls JS cr.ui.Oobe.showScreen(8) // SCREEN_ACCOUNT_PICKER, which eventually changes web_view src attribute, which triggers view.createGuest(), which eventually triggers CreateGuest(), which sets signin_partition, thus is required for chromeos::login::GetSigninContext() return something.

The place where the excessive long waiting login time is found here: in UserCloudPolicyManagerChromeOS::FetchPolicyOAuthToken (https://cs.chromium.org/chromium/src/chrome/browser/chromeos/policy/user_cloud_policy_manager_chromeos.cc?q=FetchPolicyOAuthToken&sq=package:chromium&dr=CSs&l=327):
  scoped_refptr<net::URLRequestContextGetter> signin_context =
      chromeos::login::GetSigninContext();
  if (!signin_context.get()) {
    LOG(ERROR) << "No signin context for policy oauth token fetch!";
    OnOAuth2PolicyTokenFetched(
        std::string(), GoogleServiceAuthError(GoogleServiceAuthError::NONE));
    return;
  }

  token_fetcher_.reset(PolicyOAuth2TokenFetcher::CreateInstance());
  token_fetcher_->StartWithSigninContext(
      signin_context.get(), g_browser_process->system_request_context(),
      base::Bind(&UserCloudPolicyManagerChromeOS::OnOAuth2PolicyTokenFetched,
                 base::Unretained(this)));

For specified user-agent case, signin_context.get() is null, so it calls OnOAuth2PolicyTokenFetched(…) which is quite fast.
For default case, it calls token_fetcher_->StartWithSigninContext(…), which eventually retry five times, each time waits for three seconds. After 15 seconds retry, it emits “Unrecoverable error or retry count max reached” and continuing logging.


After the trace, I’d like to raises some issues:
1. What is the appropriate way to handle createScreen in screen.js throwing an exception when cr.isChromeOS is false?
2. Why in user-agent case where signin_context is unset, it can still proceed login?
3. What’s the function of PolicyOAuth2TokenFetcher? After max retry is reached, why it can still proceed login? Is it expected behavior?


Cc: deanliao@chromium.org sonnyrao@chromium.org
Owner: achuith@chromium.org
+Sonny, who is running Telemetry in an emulator and also noticed the 15 second delay. The emulator speed can be 100 times slower than real world clock, so the 15 sec delay here can cost hours.

Achuith, could you provide some insight on what the best strategy is to address the questions Dean raised at the end of comment #12? We can implement it afterwards but we need some suggestions. Thanks!
Cc: llozano@chromium.org laszio@chromium.org
Cc: -llozano@chromium.org yunlian@chromium.org
Owner: bccheng@chromium.org
The 15 second delay for Telemetry login has existed in ChromeOS probably since the beginning. For example here is the login timestamps on R44 6978.0.0

BootTime.LoginNewUser: 16.61
0.00 +0.0000 LoginStarted
0.00 +0.0001 AuthStarted
0.00 +0.0033 CryptohomeMount-Start
0.02 +0.0134 CryptohomeMount-End
0.02 +0.0001 CryptohomeMount-Start
0.52 +0.5078 TabLoad-Start:
0.96 +0.4332 CryptohomeMount-End
0.96 +0.0001 BootTime.Authenticate
0.96 +0.0007 StartSession-Start
0.96 +0.0001 StartSession-End
0.96 +0.0000 UserLoggedIn-Start
1.04 +0.0849 UserLoggedIn-End
1.38 +0.3387 TabLoad-End:
1.71 +0.3258 TabLoad-Start:
2.14 +0.4361 TabLoad-End:
2.23 +0.0873 TabLoad-Start:
2.37 +0.1424 TabLoad-End:
16.31 +13.9313 IMEStarted
16.33 +0.0238 UserProfileGotten
16.36 +0.0351 TabLoad-Start:
16.43 +0.0684 TabLoad-Start:
16.44 +0.0100 TPMOwn-Start

And in /home/chronos/user/log/chrome, it has the same oauth2 failures:

[17173:17173:1116/235510:ERROR:policy_oauth2_token_fetcher.cc(135)] Unrecoverable error or retry count max reached.

It will retry 5 times, and delay 3 seconds between retries. I will see if I can fix it.
* If the --user-agent string is not explicitly specified in the test, Chrome will build one and it will be "Mozilla/5.0 (X11; CrOS x86_64 8996.0.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2921.0 Safari/537.36" for ChromeOS.

* in screen.js it has the following code, and "cr.isChromeOS" checks the user-agent string where if it contains "CrOS" it will be true.


    initializeImpl_: function() {
      if (cr.isChromeOS)
        this.screenContext_ = new login.ScreenContext();

      this.decorate();

* In smoothness.top_25_smooth the user agent string is specified as Mac OS so it is not following the ChromeOS signin flow. If I supplied a bogus user agent string simply as --user-agent='CrOS' the slow login path will be triggered.

So the problem is related to the authentication process for the test@test.test account for the CrOS signin flow.

Thank you for this excellent investigation.

I'm also looking at this bug, but haven't made much progress so far:
https://bugs.chromium.org/p/chromium/issues/detail?id=663509


Cc: durga.behera@chromium.org hctsai@chromium.org pucchakayala@chromium.org songsuk@chromium.org ajha@chromium.org kavvaru@chromium.org
 Issue 657698  has been merged into this issue.
I think we have had Telemetry login authentication problems all along. In policy_oauth2_token_fetcher.cc, the logic is to retry authentication on transient failures up to 5 times with 3 sec delay in between: 
https://cs.chromium.org/chromium/src/chrome/browser/chromeos/policy/policy_oauth2_token_fetcher.cc?q=PolicyOAuth2TokenFetcherImpl::RetryOnError&sq=package:chromium&dr=CSs&l=228

I added some code to Chrome to print the actual error type. Prior to the TsProxy change which landed in R55-8810, the error is considered permanent: 

[27205:27205:0110/210757:ERROR:policy_oauth2_token_fetcher.cc(235)] permanent error: Unexpected service response (ClientLogin response cookies didn't contain an auth code)

After the TsProxy change, the error is considered transient so it retried 5 times hence the full 15 seconds delay:

[11738:11738:0110/190216:ERROR:policy_oauth2_token_fetcher.cc(233)] transient error:Connection failed (-100).

So I see two things that need to be fixed here:
1) The connection problem in the login phase introduced by the TsProxy change.
2) The ClientLogin response cookie problem.


For the connection problem after the TsProxy change, if I comment out the following line in catapult/telemetry/telemetry/internal/backends/chrome/chrome_browser_backend.py:

 replay_args.append('--proxy-server=socks://localhost:%s' % proxy_port)

the 5 transient "Service unavailable; try again later." errors are gone.

Ned/Achuith, do you have any suggestion to properly set up the proxy with TsProxy in the OOBE stage?
Thanks for the fantastic investigation Ben. I can take a look.
This CL gets rid of the retries, reducing login time from 22 sec to 8 sec. 
https://codereview.chromium.org/2630023002/
Project Member

Comment 24 by bugdroid1@chromium.org, Jan 14 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/db3849840dea3dae2de134d875423ad05aa27f78

commit db3849840dea3dae2de134d875423ad05aa27f78
Author: achuith <achuith@chromium.org>
Date: Sat Jan 14 01:40:41 2017

Do not fetch tokens for telemetry tests.

Time reduced for each telemetry login from 22 sec to 8 sec.

BUG= chromium:638131 
TEST=manual.

Review-Url: https://codereview.chromium.org/2630023002
Cr-Commit-Position: refs/heads/master@{#443763}

[modify] https://crrev.com/db3849840dea3dae2de134d875423ad05aa27f78/chrome/browser/chromeos/policy/policy_oauth2_token_fetcher.cc

Status: Fixed (was: Started)
Cool thanks for the fix! I cherry-picked this CL and the login time (including both startup.cold.blank_page and octane) significantly reduced. Mark as fixed for now and will wait for it to show up on crosbolt.
Thanks again for your investigation, Ben!
The time to run the telemetry/perf unit tests on cros dropped from ~80 min to <40 min. 
 Issue 624316  has been merged into this issue.

Comment 29 by dchan@google.com, Mar 4 2017

Labels: VerifyIn-58

Comment 30 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 31 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 33 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)

Sign in to add a comment