rialto canary channel not updating reliably due to lab dependency on "login" screen |
|||||||||||
Issue description(migrated from https://code.google.com/p/chrome-os-partner/issues/detail?id=54224) veyron_rialto canary build has only pushed once in the last two months: 8530.6.0 https://cros-goldeneye.corp.google.com/rialto/console/listBuild?milestone=53#/ See this thread for long discussion on cause: https://groups.google.com/a/google.com/forum/#!topic/chromeos-infra-discuss/N9Vkww8X-zs Summary is the lab is under resourced for release builders and this can cause paygen step to fail
,
Jul 13 2016
> Summary is the lab is under resourced for release builders
> and this can cause paygen step to fail
Although it's true that in general, a number of boards fail
a lot because the master scheduler is slow, and can't always
complete all the jobs all the time, that's not what's happening
to veyron_rialto.
I did a quick check on DUT history during the last canary.
Here's a snippet for one DUT:
chromeos4-row4-rack9-host22
2016-07-13 07:55:03 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row4-rack9-host22/57331952-repair/
2016-07-13 07:51:25 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row4-rack9-host22/57331783-reset/
2016-07-13 07:47:39 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row4-rack9-host22/57331623-repair/
2016-07-13 07:44:07 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row4-rack9-host22/57331467-reset/
2016-07-13 07:40:56 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row4-rack9-host22/57331342-repair/
There were no tests run in that interval. Looking at the individual
DUTs, there were basically no tests run at all because of the behavior
see above.
Moving to debug, I note first that the sequence above means that each
Reset job fails, followed by a successful repair. I checked status.log
one of the repair jobs, and the short summary is that at the time of repair,
there was nothing reported wrong, therefore nothing to fix, and therefore
success.
The reset jobs are a bit more of a nuisance; status.log has nothing, so
you have to read the autoserv.DEBUG log. A sample log shows this:
07/13 05:46:27.246 DEBUG| base_utils:0278| [stdout] ui stop/waiting
07/13 05:46:27.280 DEBUG| base_utils:0278| [stdout] ui start/running, process 10720
07/13 05:46:27.283 DEBUG| ssh_host:0180| Running (ssh) 'bootstat_get_last login-prompt-visible'
[ ... ]
07/13 05:47:56.678 DEBUG| ssh_host:0180| Running (ssh) 'bootstat_get_last login-prompt-visible'
07/13 05:47:56.911 ERROR| site_utils:0869| Timed out waiting for login prompt
07/13 05:47:56.913 ERROR| reset:0032| Timed out waiting for login prompt
What this means is that after stopping the 'ui' job, chrome
failed to report going back to login prompt.
,
Jul 13 2016
Thanks for investigating. Rialto does not have a login prompt (it has no UI) so it would be expected for any test depending on it to fail. What test / step is that? Can you explain why 8530.6.0 completed successfully? There's been no code change so I don't understand why that build did pass when all others fail. Anyway, if it is a test failure then it brings me full circle back to my original request on https://bugs.chromium.org/p/chromium/issues/detail?id=621750 that we disable all the HW tests for rialto until the kiosk mode migration has completed. AIUI the way to do that is to deallocate all the DUTs from the pool in the lab (bring rialto into line with Jetstream)
,
Jul 13 2016
> Rialto does not have a login prompt (it has no UI) so it
> would be expected for any test depending on it to fail.
Strictly speaking, it's not looking for the login prompt as
such. It's looking for an event that Chrome emits when it
reaches login prompt. If there's no Chrome, though, it's
likely a distinction without a difference...
> What test / step is that?
The code in question happens in between tests during suite
execution. The code stops the 'ui' job (which is considered
equivalent to stopping Chrome), and then waits for the
event that says Chrome is back at login prompt.
> Can you explain why 8530.6.0 completed successfully?
No. From the data at my disposal, it looks like it failed:
https://uberchromegw.corp.google.com/i/chromeos_release/builders/veyron_rialto-release%20release-R53-8530.B/builds/5
> There's been no code change so I don't understand why that
> build did pass when all others fail.
,
Jul 13 2016
> > What test / step is that?
>
> The code in question happens in between tests during suite
> execution. The code stops the 'ui' job (which is considered
> equivalent to stopping Chrome), and then waits for the
> event that says Chrome is back at login prompt.
Following up, the thing that runs in between tests is the
'Reset' task. Boiled down to its essence it runs the
`cleanup()` and `verify()` methods on a host passed in to
the task. For a rialto DUT, the host class will be `CrosHost`.
The code that fails is the cleanup method; it's in this
file:
server/hosts/cros_host.py
,
Jul 15 2016
Thanks Richard So, it looks like deploy_chrome.py has some smarts to detect if app_shell was installed instead of chrome, and so set startui = False. We would have lost that behavior when switching away from app_shell binary https://cs.corp.google.com/chromeos_public/chromite/scripts/deploy_chrome.py?q=app_shell&sq=package:(%5Echromeos_public$%7C%5Echromeos_internal$)&l=288&dr=C What would be the correct way for Rialto to configure (or the script detect) that startui should be false on Rialto too?
,
Jul 15 2016
One other question - when you fond "Timed out waiting for login prompt" which build step would that have occurred in? Paygen, AUTest, or something else? https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_rialto-release/builds/209 In https://bugs.chromium.org/p/chromium/issues/detail?id=621750#c9 it was stated the issue blocking canary is all around the paygen step failure, which is to do with stateful.tgz not existing, so I want to clarify which symptom this error actually connects with
,
Jul 15 2016
> One other question - when you fond > "Timed out waiting for login prompt" which build step > would that have occurred in? Paygen, AUTest, or something else? The complaint itself comes from the test lab, not from a build step as such. Tests against the lab happen with the steps named HWTest, AUTest, ASyncHWTest, and as part of Paygen. The source code reference in Autotest is mentioned in c#5.
,
Jul 15 2016
> So, it looks like deploy_chrome.py has some smarts to detect > if app_shell was installed instead of chrome, and so set > startui = False. We would have lost that behavior when > switching away from app_shell binary deploy_chrome.py isn't used in the test lab; this isn't the source of your problem. We might benefit from adapting some of that code to Autotest, but I don't fully understand what it's doing. Plus, if that code would no longer work for rialto, adapting it to Autotest can't fix Rialto, either. :-(
,
Jul 20 2016
- do you have any idea how other UI-less device would bypass the dependency on having chrome emit the login prompt event? ... When rialto was using app_shell (up until M50) it never had this problem. I can't work out when exactly it was introduced though (as the logs are not visible in build reports) to know exactly what changed to introduce this dependency. - is there a way for the DUT to detect it is running in the lab? ... it seems the simplest hack to work around this would be to guard the rialto special-casing in chrome_setup.cc to not trigger in the lab. This would cause rialto to boot just as if it is any other chromebox-like device, right to the login screen, and hence be far more test friendly https://cs.corp.google.com/chromeos_internal/src/platform2/login_manager/chrome_setup.cc?type=cs&q=veyron_rialto&l=214 If we can't do that, I think disabling/removing the DUTs from the lab, until kiosk migration is complete, is the only sensible path.
,
Aug 3 2016
,
Aug 3 2016
,
Aug 4 2016
OK. Pausing to think, and recollect all the code pieces...
The Autotest code calls this command:
stop ui ; start ui
Then Autotest waits for this command to return a new
value:
bootstat_get_last login-prompt-visible
So, to get the result you want, it's sufficient to create an
upstart job that includes the following:
start on started ui
exec bootstat login-prompt-visible
No actual login screen is harmed in the execution of this
command.
,
Aug 4 2016
@jrbarnette, this sounds a great suggestion. Are there any more general docs on how bootstat works? (It looks a useful mechanism for us to collect device stats, but this bug was the first I heard of it) Anyhow, I'll see if I have chance to give this a go in the next 2 days, if not, @drustsmith - this might be a great starter CL for niranjan, either Alex or CrOS mentor should be able to guide the steps.
,
Aug 4 2016
> Are there any more general docs on how bootstat works?
> (It looks a useful mechanism for us to collect device
> stats, but this bug was the first I heard of it)
You've seen virtually all there is to see.
The sources are in src/platform2/bootstat. There's a
README in the source which should be a good starting
point for understanding. Everything (README, .c/.h,
Makefile, scripts, everything) comes to 894 lines, so
it's not a heavy lift.
For samples of how to use the output, take a look at this
test in the Autotest sources:
client/site_tests/platform_BootPerf/platform_BootPerf.py
Or, for a simpler introduction, log in to your favorite test
device, and run this command:
bootstat_summary
,
Aug 4 2016
https://chrome-internal-review.googlesource.com/272898 re any docs on it, my other curiosity was about how/when the lab infrastructure came to depend on it. Up until R50, we used app_shell instead of chrome which also had no concept of displaying a login prompt, yet we never saw this issue. I wonder how app_shell avoided the problem or if it was some change that came about more recently than R50. (Since rialto moved off app_shell I believe no device is using it, hence it might also have regressed in this way and we wouldn't know). Thanks
,
Aug 5 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/overlays/overlay-variant-veyron-rialto-private/+/70c689ba9fc44316a2af26b06ec780cdb1ee5982 commit 70c689ba9fc44316a2af26b06ec780cdb1ee5982 Author: Jonathan Dixon <joth@google.com> Date: Thu Aug 04 04:01:56 2016
,
Aug 6 2016
This has fixed lab testing a treat: https://screenshot.googleplex.com/vmY4sbkG5fo request merge to M53 to get better test coverage there too
,
Aug 6 2016
Your change meets the bar and is auto-approved for M53 (branch: 2785)
,
Aug 8 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/overlays/overlay-variant-veyron-rialto-private/+/4c6dfeea241c1345deea7e52a935cce911a14d6d commit 4c6dfeea241c1345deea7e52a935cce911a14d6d Author: Jonathan Dixon <joth@google.com> Date: Thu Aug 04 04:01:56 2016
,
Aug 10 2016
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Aug 12 2016
,
Aug 13 2016
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Aug 22 2016
,
Dec 20 2016
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by shuqianz@chromium.org
, Jul 13 2016