New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 653357 link

Starred by 2 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

calling login_LoginSuccess in autoupdate_EndToEndTest instead of just running it in bvt-inline hides failures from wmatrix and goldeney

Project Member Reported by ihf@chromium.org, Oct 6 2016

Issue description

Login is tested everywhere, so one would think the failure to login would show up in one of these suites:
login_LoginSuccess/control:ATTRIBUTES = "suite:bvt-inline, suite:push_to_prod, suite:smoke"

But the first caller of LoginSuccess is not one of the listed suites, but
autoupdate_EndToEndTest.py:            client_at.run_test('login_LoginSuccess', arc_mode='enabled')

But autoupdate_EndToEndTest is not run in any standard suite (but dynamic suite paygen_au_stable). One needs to search for these results explicitly which then shows Intel was dead, ARM limping along:
https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=autoupdate_EndToEndTest.%2A&releases=53&days_back=30

Now that we have introduced strong survivor bias the only way login_LoginSuccess can fail is by flaking! Indeed if looking at the results of login_LoginSuccess for the .86 build they all pass on ARM devices but miss any (red, blue, yellow, green) result boxes on Intel devices. This is because on Intel the test was never scheduled inside of a suite like bvt-inline (as it is known to be failing); and the call from autoupdate_EndToEndTest was not recorded in the tko result DB (I think at least it should have been recorded as part of paygen_au_stable?):
https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=login_LoginSuccess&releases=53&days_back=30

Now seemingly the easiest would have been to remove login_LoginSuccess and give let later stages of testing plenty of opportunity to discover the failure. 

 
> But the first caller of LoginSuccess is not one of the listed suites, but
> autoupdate_EndToEndTest.py:            client_at.run_test('login_LoginSuccess', arc_mode='enabled')

I'm not sure of the basis for this assertion.  As a general
rule, there's no single test guaranteed to be first.  "First"
doesn't really even have meaning in a world where different DUTs
have a different "first test", and there's no fixed set of tests
run on any give set of DUTs.

In practical terms, the 'bvt-inline' test suite usually runs to
completion prior to the Paygen tests, so on a typical BVT run,
the login_LoginSuccess test will run on some DUT prior to the
Paygen test running on any DUT.  However, there's no guarantee
that the tests will run on the same DUT.

This type of failure is not completely hidden. It is shown as missing when we create a GoldenEye release for a build (which is when test/TPMs check it out in the morning)

Compare the Installer Test column when you go to here and click on .86 stable
https://cros-goldeneye.corp.google.com/chromeos/console/listBuild?milestone=53#/details

with the AU Tests column when you create a GoldenEye release for the same build:
https://cros-goldeneye.corp.google.com/chromeos/console/qaRelease?releaseName=M53-STABLE-4
I would imagine a failing test and missing test both show up as red in GoldenEye under the AU Tests column, so if it was ignored for missing it probably would have been ignored for failed. 
Owner: ----
It's not clear to me what change is actually being requested
here.  Can someone describe in more exact detail what's wrong?

In general, tests are allowed to call other tests.  In particular,
it's _not_ a bug that the AU tests exercise login through
LoginSuccess.  If the problem is that wmatrix or GE are reporting
unwanted results, it would help to have examples to motivate
what needs to change.

Comment 5 by autumn@chromium.org, Oct 12 2016

Owner: ihf@chromium.org
Back to  ihf@, can you please clarify based on the question above?

Comment 6 by ihf@chromium.org, Oct 13 2016

Status: Started (was: Untriaged)
I thought about it and nobody has brought compelling reasons why login should be tested in the autoupdate test. It does not belong there. (Notice nobody suggests cramming all tests from all suites in there either.) So my proposal is to remove login form autoupdate test and rerun the build on the failing R53 build as discussed here
https://chromium-review.googlesource.com/#/c/397867/
I don't get what you're trying to achieve with this change Ilja? 

The Paygen phase failing doesn't stop bvt-inline from running. 
bvt-inline didn't get kicked off on SOME boards because the sanity (provision) phase failed. 

https://uberchromegw.corp.google.com/i/chromeos_release/builders/buddy-release%20release-R53-8530.B/builds/74

https://uberchromegw.corp.google.com/i/chromeos_release/builders/gandof-release%20release-R53-8530.B/builds/73

https://uberchromegw.corp.google.com/i/chromeos_release/builders/mccloud-release%20release-R53-8530.B/builds/73

In these cases, removing login_LoginSuccess from autoupdate_EndToEndTest would have no effect because it still wouldn't have been run in bvt-inline. 

On other Intel boards (e.g link), the sanity phase did pass and bvt-inline was kicked off BUT it didn't run all of the bvt-inline tests including login_LoginSuccess.

https://uberchromeos-server38.corp.google.com/new_tko/#tab_id=spreadsheet_view&row=job_name%252Ctest_name&column=status&show_incomplete=true&show_only_latest=false&show_invalid=true&condition=job_name%2520LIKE%2520%27link-release/R53-8530.86.0/bvt-inline/%2525%27%2520AND%2520test_name%2520%253C%253E%2520%27SERVER_JOB%27%2520AND%2520test_name%2520NOT%2520LIKE%2520%27CLIENT%255C_JOB%2525%27

Comment 8 by ihf@chromium.org, Oct 14 2016

1) Good point. provision phase failed due to another "Can Chrome login?" check. This needs to be removed as well.

https://bugs.chromium.org/p/chromium/issues/detail?id=597499#c31
https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/common_lib/cros/autoupdater.py?q=%22Chrome+failed+to+reach+login+screen%22+package:%5Echromeos_public$&l=667

2) You do understand that this whole postmortem is about tko/wmatrix/goldeneye not showing missing tests? Hence giving me a link to tko misses the point. If you look at the link to the link board you will see that all tests that it shows are "green". Yet if you find the parent suite you will see that some tests failed (provision error due to lab problems!), while others were aborted due to 1800s timeout
http://cautotest/afe/#tab_id=view_job&object_id=76996110

Yes it did not run login_LoginSuccess because it was a victim of the timeout. I installed the R53-8530.86.0 image just fine. The screen is black.
The reason for tests being very slow and hence the suite timeout is that Chrome continuously crashes a few times a second slowing everything to a crawl writing dumps etc.

Interestingly login_LoginSuccess is very slow, but it passes, just like several other tests that I would have expected to catch the problem (even though Chrome crashed a few hundred times during the test):

localhost autotest # bin/autotest tests/login_LoginSuccess/control
DEBUG:root:Failed to import elasticsearch. Mock classes will be used and calls to Elasticsearch server will be no-op. Test run is not affected by the missing elasticsearch module.
23:07:05 INFO | Writing results to /usr/local/autotest/results/default
23:07:05 ERROR| Symlink init scripts failed with [Errno 30] Read-only file system
23:09:17 INFO | START	----	----	timestamp=1476425357	localtime=Oct 13 23:09:17	
23:09:18 INFO | File /var/lib/cleanup_logs_paused was already present
23:09:18 INFO | 	START	login_LoginSuccess	login_LoginSuccess	timestamp=1476425358	localtime=Oct 13 23:09:18	
23:09:24 INFO | ChromeOS BOARD = link_1.8GHz_4GB
23:09:25 INFO | Chose browser: PossibleCrOSBrowser(browser_type=system)
23:09:25 INFO | (Re)starting the ui (logs the user out)
23:09:28 INFO | Restarting Chrome with flags and login
23:09:30 INFO | Invoking Oobe.loginForTesting
23:09:41 INFO | Browser is up!
23:09:41 INFO | OS: chromeos 

[...] (hundreds if not thousands of Chrome crashes happening here)

23:09:42 INFO | (Re)starting the ui (logs the user out)
23:11:10 INFO | 		GOOD	login_LoginSuccess	login_LoginSuccess	timestamp=1476425470	localtime=Oct 13 23:11:10	completed successfully
23:11:10 INFO | 	END GOOD	login_LoginSuccess	login_LoginSuccess	timestamp=1476425470	localtime=Oct 13 23:11:10	
23:11:10 INFO | Leaving existing /var/lib/cleanup_logs_paused file
23:11:11 INFO | END GOOD	----	----	timestamp=1476425470	localtime=Oct 13 23:11:10	

3) Here is another interesting observation. All login_LoginSuccess control files have the same NAME. This can't possibly be good. Do we run login_LoginSuccess sometimes 50 times and sometimes once?

ihf@ql:~/cros_tot/src/third_party/autotest/files/client/site_tests/login_LoginSuccess$ grep NAME control*
control:NAME = "login_LoginSuccess"
control.stress:NAME = "login_LoginSuccess"
control.stress2:NAME = "login_LoginSuccess"
control.stress3:NAME = "login_LoginSuccess"

4) Now graphics_WebGLAquarium, which is part of bvt-cq which did not run, does detect problems:
01:00:22 INFO | autoserv| STATUS: FAIL	graphics_WebGLAquarium	graphics_WebGLAquarium	timestamp=1476424649	localtime=Oct 14 00:57:29	Could not get idle CPU.
It also reports a gazillion Chrome crashes and GPU hangs.
[ 4269.655314] chrome[6148]: segfault at 101b2d145000 ip 00007f3bf6de29a4 sp 00007ffc9b846450 error 4 in libEGL.so.1.0.0[7f3bf6dd6000+1d000]

5) If I remove the idle CPU etc. checks from WebGLAquarium it still correctly detects the problem:
TestFail: Could not get FPS count.

6) A philosophical question. If you still think that calling login_LoginSuccess (plus starting ARC!) is a good idea in update/provision tests, then why not add the whole bvt suite there? Do you agree that a test should focus on testing minimal functionality to be able to reason about its outcome?

7) Side note: I also want bvt-cq to run whenever possible. There is no reason for early aborts and bvt-cq is what Goldeneye reports. Also notice WebGLAquarium in bvt-cq would have detected the problem.

Comment 9 by ihf@chromium.org, Oct 14 2016

Cc: jrbarnette@chromium.org dhadd...@chromium.org dcasta...@chromium.org
Blerg, yes for no.2 I actually had the AFE page open in one tab but ended up pasting the TKO link (too many cautotest tabs open while I was replying ;) 


I get what you are trying to do now but as mentioned it wouldn't have had the desired effect since provision/sanity failing blocked bvt-inline anyway. 

Richard can hopefully comment about the chrome check during provision. 
> 1) Good point. provision phase failed due to another
>   "Can Chrome login?" check. This needs to be removed as well.

No, this isn't a check for "can Chrome login".  This is a check for
"did chrome start".  Moreover, the particular check isn't even an
error check; it's diagnosis.  The real error is that the image wasn't
marked good, and if the DUT is rebooted, it could roll back to the
previous build.  That's a critical failure, and it can only be
reliably detected during provisioning:  We _must_ make the check
during provisioning, and we _must_ fail the build when that check fails.

IIUC, the key problem here is simply this:  The Chrome bug in
question left the system unstable, but able to limp along enough
to pass some tests sometimes.  The result was that the symptom of
the bug was not "tests failed" but rather "tests didn't run".
And the interactions among dynammic suites, TKO, and GoldenEye are
such that our reporting system DOESN'T REPORT THE PROBLEM.

So, the question we ought to be discussing is "how do we plumb
the fact that tests didn't run from dynamic suites (which knows
about it) to GoldenEye (which doesn't get told)".

Comment 13 by ihf@chromium.org, Oct 15 2016

I disagree that "did Chrome start" is a meaningful question, as we are interested in the correct completion of a small but meaningful task. (Starting is not meaningful as I can point the CPU at any random bits.) Worse, the call is

    client_at.run_test('login_LoginSuccess', arc_mode='enabled')

Clearly the intention is to test ARC in some way, which has absolutely nothing to do with autoupdate. Even if arc_mode was not enabled the rollback to the previous image would happen only due to problems *before* the login screen - no reason to login. What bugs me with requiring Chrome is that about half of all tests (graphics and kernel) don't care about Chrome and even blow it away (stop ui) to do their testing. We could make it the default on test images to "stop ui" and only "start ui" for tests that instantiate chrome.Chrome(). Then a crashing Chrome will stop causing fear for everyone else.

Anyhow, I do agree with the "the problem is not reporting" part. There are multiple levels to "not reporting". I will make sure this gets tracked.
Labels: akeshet-pending-downgrade
ChromeOS Infra P1 Bugscrub.

P1 Bugs in this component should be important enough to get weekly status updates.

Is this already fixed?  -> Fixed
Is this no longer relevant? -> Archived or WontFix
Is this not a P1, based on go/chromeos-infra-bug-slo rubric? -> lower priority.
Is this a Feature Request rather than a bug? Type -> Feature
Is this missing important information or scope needed to decide how to proceed? -> Ask question on bug, possibly reassign.
Does this bug have the wrong owner? -> reassign.

Bugs that remain in this state next week will be downgraded to P2.
Labels: -akeshet-pending-downgrade Pri-2
ChromeOS Infra P1 Bugscrub.

Issue untouched in a week after previous message. Downgrading to P2.
What ever happened here?

Comment 17 by ihf@chromium.org, Sep 28 2017

The problem still exists. The current call sequence circumvents reporting of problems to goldeneye.
> The current call sequence circumvents reporting of problems to goldeneye.

What does "circumvents reporting" mean here?  And what's the mechanism?
I've reviewed the bug history, and I've yet to see a clear explanation
of what problem is meant to be solved here.  The subject implies that
login_LoginSuccess isn't run from bvt-inline, but that's never been
true.

Regarding, specifically, the invocation of login_LoginSuccess from AU
tests, the history is this:
  * There was an AU bug where the presence of a user profile broke
    AU.  I forget the exact symptom, but the problem could only
    occur if a user had already logged in at least once.
  * The bug slipped through AU testing because there was no
    coverage of "update from a system with an existing profile".
    Indeed, since AU testing always starts from a fresh system,
    the only test case was "update from a system at OOBE."

This bug left us with a dilemma:  AU tests are expensive, so we
can't test every system state that might affect update.  However,
the bug was important, and we'd missed it merely for lack of coverage.
We decided that testing with a user profile present was more important
that testing with no profile, so we changed the test to make sure
there would be a profile present during update.

I'm still happy with the decision to change the test, and I don't
think we should change the coverage without a compelling reason.

Hi, this bug has not been updated recently. Please acknowledge the bug and provide status within two weeks (6/22/2018), or the bug will be archived. Thank you.

Sign in to add a comment