New issue
Advanced search Search tips

Issue 860519 link

Starred by 2 users

Issue metadata

Status: Duplicate
Merged: issue 891594
Owner:
Closed: Oct 12
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug


Participants' hotlists:
Hotlist-Tast


Sign in to add a comment

UI tests fail while restarting 'ui' job

Project Member Reported by derat@chromium.org, Jul 5

Issue description

The ui.ChromeLogin Tast test failed on daisy-release at http://stainless/browse/chromeos-autotest-results/214182083-chromeos-test/ :

2018/07/04 23:53:51 Started test ui.ChromeLogin
2018/07/04 23:53:51 [23:53:51.311] Restarting ui job
2018/07/04 23:53:51 [23:53:51.328] Error at chrome_login.go:69: Chrome login failed: restarting job "ui" failed: exit status 1: initctl: Job is already running: ui

ui.ChromeCrashNotLoggedIn ran right before ui.ChromeLogin, and it looks like the 'ui' Upstart job was restarting around when ui.ChromeLogin started:

...
2018-07-04T23:52:48.382329-07:00 INFO session_manager[10610]: [INFO:browser_job.cc(165)] Terminating process: exiting cleanly
2018-07-04T23:52:48.382799-07:00 INFO session_manager[10610]: [INFO:system_utils_impl.cc(94)] Sending 15 to 11051 as 1000
2018-07-04T23:52:48.386656-07:00 INFO session_manager[10610]: [INFO:session_manager_service.cc(480)] SessionManagerService quitting run loop
2018-07-04T23:52:48.387364-07:00 INFO session_manager[10610]: [INFO:session_manager_service.cc(201)] SessionManagerService exiting
2018-07-04T23:53:51.506780-07:00 INFO sshd[11183]: Accepted publickey for root from 100.109.25.147 port 38512 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
...

I knew that upstart.RestartJob was racy when I wrote it; it checks a job's current status and then runs either 'initctl restart' or 'initctl start' depending on whether it's already running or not.

Unfortunately, I don't think that Upstart makes it easy to do this in a reliable way; see e.g. https://stackoverflow.com/questions/9479668/upstart-initctl-startrestart-ubuntu.

Maybe I should just make Chrome.restartChromeForTesting explicitly call StopJob before calling RestartJob; that might work here.
 
Cc: mnissler@chromium.org
I think that there are more examples of this in the daisy-release run at http://stainless/browse/chromeos-autotest-results/219439613-chromeos-test/ in some video tests that require login.

Looking more closely, though, I think that this is a symptom of the "ui" job hanging indefinitely after an earlier test tried to stop it:

2018/07/21 15:22:09 Started test video.PlayH264
2018/07/21 15:22:09 [15:22:08.902] Restarting ui job
2018/07/21 15:23:12 [15:23:11.902] Error at bundle.go:210: Test timed out
...
2018/07/21 15:23:12 Completed test video.PlayH264 in 1m3.001s with 1 error(s)
2018/07/21 15:23:12 Started test video.PlayVP8
2018/07/21 15:23:12 [15:23:11.904] Restarting ui job
2018/07/21 15:23:12 [15:23:11.935] Error at play.go:32: Failed to connect to Chrome: restarting job "ui" failed: exit status 1: initctl: Job is already running: ui
...
2018/07/21 15:23:12 Completed test video.PlayVP8 in 33ms with 1 error(s)
2018/07/21 15:23:12 Started test video.PlayVP9
2018/07/21 15:23:12 [15:23:11.936] Restarting ui job
2018/07/21 15:23:12 [15:23:11.955] Error at play.go:32: Failed to connect to Chrome: restarting job "ui" failed: exit status 1: initctl: Job is already running: ui
...
2018/07/21 15:23:12 Completed test video.PlayVP9 in 20ms with 1 error(s)

daisy doesn't support ARC, so this is probably unrelated to issue 865813.

Strangely, it looks like there were no updates to /var/log/messages after session_manager started exiting at 15:22:11:

...
2018-07-21T15:22:07.260514-07:00 INFO session_manager[24853]: [INFO:session_manager_impl.cc(567)] Starting user session
2018-07-21T15:22:09.075596-07:00 INFO session_manager[24853]: [INFO:browser_job.cc(167)] Terminating browser process 24884 with signal 15: exiting cleanly
2018-07-21T15:22:09.075701-07:00 INFO session_manager[24853]: [INFO:system_utils_impl.cc(94)] Sending 15 to 24884 as 1000
2018-07-21T15:22:09.075858-07:00 INFO session_manager[24853]: [INFO:system_utils_impl.cc(94)] Sending 15 to 25156 as 1000
2018-07-21T15:22:09.547911-07:00 INFO bluetoothd[1802]: adapter /org/bluez/hci0 set power to 1
2018-07-21T15:22:11.223997-07:00 INFO session_manager[24853]: [INFO:session_manager_service.cc(481)] SessionManagerService quitting run loop
2018-07-21T15:22:11.224187-07:00 INFO session_manager[24853]: [INFO:session_manager_service.cc(201)] SessionManagerService exiting
2018-07-21T15:23:12.233939-07:00 INFO sshd[25523]: Accepted publickey for root from 100.108.189.5 port 35664 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2018-07-21T15:23:12.521259-07:00 INFO sshd[25526]: Accepted publickey for root from 100.108.189.5 port 35674 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
<EOF>

powerd.LATEST shows Chrome disconnecting from D-Bus:

[0721/152210:INFO:suspend_delay_controller.cc(121)] Unregistering suspend delay 61997086 (chrome) due to D-Bus client :1.538 going away
[0721/152210:INFO:suspend_delay_controller.cc(121)] Unregistering dark suspend delay 62029852 (chrome) due to D-Bus client :1.538 going away
[0721/152211:INFO:suspend_delay_controller.cc(121)] Unregistering suspend delay 61997087 (chrome) due to D-Bus client :1.539 going away
[0721/152211:INFO:suspend_delay_controller.cc(121)] Unregistering dark suspend delay 62029853 (chrome) due to D-Bus client :1.539 going away

I'm not sure how to debug further to figure out why the ui job was hanging. :-/
I've grabbed a daisy DUT (chromeos2-row4-rack8-host19) and running ui.ChromeLogin forever. I'll update if I find anything.

I tried ~900 iterations of Chrome login on daisy, and found one iteration takes much longer than expected.

50%ile: 57s
90%ile: 85s
95%ile: 98s
max: 144s

OTOH I did not observe any indefinite hang. I will try seeing where it is choked.
Cc: derat@chromium.org gwendal@chromium.org hashimoto@chromium.org
Owner: apronin@chromium.org
It turned out this is another TPM issue.

I set 30 seconds timeout to "restart ui" and dumped several states on timeout. "ps auxwwf" output is:

...
root     16638  0.0  0.0   1416   468 ?        Ss   22:37   0:00 /bin/sh /usr/share/cros/init/ui-post-stop
root     16641  0.1  0.2  14956  5104 ?        S    22:37   0:00  \_ cryptohome --action=pkcs11_terminate

So "ui" job is choked at "cryptohome --action=pkcs11_terminate". I've attached call stacks of various processes (cryptohomed, chapsd, tcsd) at the moment.

apronin@, could you please take a look?

procs.txt
16.0 KB View Download
chapsd.991.dmp.txt
25.1 KB View Download
cryptohome.23540.dmp.txt
5.9 KB View Download
cryptohomed.1122.dmp.txt
37.7 KB View Download
tcsd.963.dmp.txt
8.4 KB View Download
Project Member

Comment 5 by bugdroid1@chromium.org, Jul 27

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/8d555fb2bf6f670385e506e625b7f80d9c86c699

commit 8d555fb2bf6f670385e506e625b7f80d9c86c699
Author: Shuhei Takahashi <nya@chromium.org>
Date: Fri Jul 27 15:50:49 2018

chrome: Save minidumps when "restart ui" times out.

It is known that "restart ui" is sometimes slow by TPM issues.

Also sets 30-second timeout to "restart ui".

BUG= chromium:860519 
TEST=tast run DUT ui.ChromeLogin
CQ-DEPEND=CL:1152639

Change-Id: I258fd520f0d2cf3f50d28901d026aa90b648b8a8
Reviewed-on: https://chromium-review.googlesource.com/1152635
Commit-Ready: Shuhei Takahashi <nya@chromium.org>
Tested-by: Shuhei Takahashi <nya@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/8d555fb2bf6f670385e506e625b7f80d9c86c699/src/chromiumos/tast/local/chrome/chrome.go

Project Member

Comment 6 by bugdroid1@chromium.org, Jul 27

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/f2bf0a7aa6422c41b2690b774c09264675784fb1

commit f2bf0a7aa6422c41b2690b774c09264675784fb1
Author: Shuhei Takahashi <nya@chromium.org>
Date: Fri Jul 27 15:50:57 2018

chrome: Add ui.ChromeLoginForever test.

BUG= chromium:860519 
TEST=tast run DUT ui.ChromeLoginForever

Change-Id: I48dbe8b28ef6da7853ebd08a4bd645fd505c91dd
Reviewed-on: https://chromium-review.googlesource.com/1152636
Commit-Ready: Shuhei Takahashi <nya@chromium.org>
Tested-by: Shuhei Takahashi <nya@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>

[add] https://crrev.com/f2bf0a7aa6422c41b2690b774c09264675784fb1/src/chromiumos/tast/local/bundles/cros/ui/chrome_login_forever.go

Project Member

Comment 7 by bugdroid1@chromium.org, Jul 27

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast/+/d956c8f04ec2166d3f382b2d2cadaf0e1021646f

commit d956c8f04ec2166d3f382b2d2cadaf0e1021646f
Author: Shuhei Takahashi <nya@chromium.org>
Date: Fri Jul 27 15:50:49 2018

tast: Add testing.ContextOutDir.

We often want to save files in utility functions for debugging,
but passing around testing.State or outDir is a pain.

BUG= chromium:860519 
TEST=tast run DUT example.Pass

Change-Id: I4124ec2334377fa7527bdbd38f1bba59d033a721
Reviewed-on: https://chromium-review.googlesource.com/1152639
Commit-Ready: Shuhei Takahashi <nya@chromium.org>
Tested-by: Shuhei Takahashi <nya@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>

[modify] https://crrev.com/d956c8f04ec2166d3f382b2d2cadaf0e1021646f/src/chromiumos/tast/testing/state.go

Summary: UI tests fail while restarting 'ui' job (was: ui.ChromeLogin fails when restarting already-running ui job)
I'm seeing daily test failures that look similar to this now. Most of them are on daisy, but they occasionally occur on other boards too. Here are a few examples:

daisy-release R70-10944.0.0
tast.ui.ChromeCrashNotLoggedIn, tast.ui.ChromeLogin
http://stainless/browse/chromeos-autotest-results/224532521-chromeos-test/

daisy-release R70-10945.0.0
tast.video.PlayH264, tast.video.PlayVP8, tast.video.PlayVP9
http://stainless/browse/chromeos-autotest-results/224613310-chromeos-test/

daisy-release R70-10941.0.0
tast.ui.ChromeCrashNotLoggedIn, tast.ui.ChromeLogin
http://stainless/browse/chromeos-autotest-results/224215499-chromeos-test/

poppy-release R70-10941.0.0
tast.ui.ChromeCrashNotLoggedIn, tast.ui.ChromeLogin
http://stainless/browse/chromeos-autotest-results/224236932-chromeos-test/

falco-release R70-10942.0.0
tast.ui.ChromeCrashNotLoggedIn, tast.ui.ChromeLogin
http://stainless/browse/chromeos-autotest-results/224307756-chromeos-test/

falco-release R70-10943.0.0
tast.ui.ChromeCrashNotLoggedIn, tast.ui.ChromeLogin
http://stainless/browse/chromeos-autotest-results/224396501-chromeos-test/

---

In the poppy-release example above, ui-post-stop was again blocked on "cryptohome --action=pkcs11_terminate":

root      5315  0.0  0.0   4328   848 ?        Ss   08:15   0:00 /bin/sh /usr/share/cros/init/ui-post-stop
root      5318  0.1  0.3  35084 12496 ?        S    08:15   0:00  \_ cryptohome --action=pkcs11_terminate

Here are the cryptohomed messages from syslog:

2018-08-05T15:15:53.401566+00:00 ERR cryptohomed[2213]: TpmIsReady: is not owned.
2018-08-05T15:15:53.401586+00:00 ERR cryptohomed[2213]: HasAuthorization: TPM not ready.
2018-08-05T15:15:53.401597+00:00 ERR cryptohomed[2213]: Destroy() called with insufficient authorization.
2018-08-05T15:15:53.404018+00:00 WARNING cryptohomed[2213]: Could not load the device policy file.
2018-08-05T15:15:53.560073+00:00 WARNING cryptohomed[2213]: Could not load the device policy file.
2018-08-05T15:15:53.652295+00:00 WARNING cryptohomed[2213]: Could not load the device policy file.
2018-08-05T15:15:53.689588+00:00 WARNING cryptohomed[2213]: message repeated 2 times: [ Could not load the device policy file.]
2018-08-05T15:15:53.894728+00:00 WARNING cryptohomed[2213]: Canceled creating cryptohome key - TPM is not ready.
2018-08-05T15:15:54.806765+00:00 WARNING cryptohomed[2213]: Could not load the device policy file.
2018-08-05T15:15:55.765616+00:00 WARNING cryptohomed[2213]: Canceled creating cryptohome key - TPM is not ready.
2018-08-05T15:15:55.961599+00:00 WARNING cryptohomed[2213]: Canceled creating cryptohome key - TPM is not ready.
2018-08-05T15:15:56.746344+00:00 INFO cryptohomed[2213]: Migrated (or created) user directory: /home/.shadow/c30861d54b050440dba9854eb78845cd5e393fda/mount
2018-08-05T15:15:56.747193+00:00 INFO cryptohomed[2213]: Creating pass-through directories /home/.shadow/c30861d54b050440dba9854eb78845cd5e393fda/mount/user/Cache
2018-08-05T15:15:56.747758+00:00 INFO cryptohomed[2213]: Creating pass-through directories /home/.shadow/c30861d54b050440dba9854eb78845cd5e393fda/mount/user/Downloads
2018-08-05T15:15:56.747939+00:00 INFO cryptohomed[2213]: Creating pass-through directories /home/.shadow/c30861d54b050440dba9854eb78845cd5e393fda/mount/user/GCache
2018-08-05T15:15:56.748092+00:00 INFO cryptohomed[2213]: Creating pass-through directories /home/.shadow/c30861d54b050440dba9854eb78845cd5e393fda/mount/user/GCache/v1
2018-08-05T15:15:56.748243+00:00 INFO cryptohomed[2213]: Creating pass-through directories /home/.shadow/c30861d54b050440dba9854eb78845cd5e393fda/mount/user/GCache/v2
2018-08-05T15:15:56.748390+00:00 INFO cryptohomed[2213]: Creating pass-through directories /home/.shadow/c30861d54b050440dba9854eb78845cd5e393fda/mount/user/GCache/v1/blobs
2018-08-05T15:15:56.748583+00:00 INFO cryptohomed[2213]: Creating pass-through directories /home/.shadow/c30861d54b050440dba9854eb78845cd5e393fda/mount/user/GCache/v1/tmp
2018-08-05T15:15:56.754011+00:00 INFO cryptohomed[2213]: RecursiveCopy: /home/.shadow/c30861d54b050440dba9854eb78845cd5e393fda/mount/user/log
2018-08-05T15:15:56.754360+00:00 INFO cryptohomed[2213]: RecursiveCopy: /home/.shadow/c30861d54b050440dba9854eb78845cd5e393fda/mount/user/.pki
2018-08-05T15:15:56.754642+00:00 INFO cryptohomed[2213]: RecursiveCopy: /home/.shadow/c30861d54b050440dba9854eb78845cd5e393fda/mount/user/.pki/nssdb
2018-08-05T15:15:56.756466+00:00 INFO cryptohomed[2213]: RecursiveCopy: /home/.shadow/c30861d54b050440dba9854eb78845cd5e393fda/mount/user/.ssh
2018-08-05T15:15:56.775097+00:00 WARNING cryptohomed[2213]: ManageCCDPwd is not supported
2018-08-05T15:15:56.775415+00:00 WARNING cryptohomed[2213]: Could not load the device policy file.
2018-08-05T15:15:57.707826+00:00 WARNING cryptohomed[2213]: TPM was not owned. TPM initialization call back will handle PKCS#11 initialization.
2018-08-05T15:15:57.934519+00:00 WARNING cryptohomed[2213]: GetTpmTokenSlotForPath: Path not found.
2018-08-05T15:15:58.226955+00:00 ERR cryptohomed[2213]: IsEndorsementKeyAvailable: Not Implemented.
2018-08-05T15:15:58.226974+00:00 ERR cryptohomed[2213]: IsEndorsementKeyAvailable: Not Implemented.
2018-08-05T15:15:58.529552+00:00 WARNING cryptohomed[2213]: GetTpmTokenSlotForPath: Path not found.
2018-08-05T15:15:58.923880+00:00 WARNING cryptohomed[2213]: RemoveKeyset: key to remove not found
2018-08-05T15:15:59.963940+00:00 ERR cryptohomed[2213]: SetIsOwned: Not Implemented.
2018-08-05T15:15:59.963996+00:00 ERR cryptohomed[2213]: TestTpmAuth: Not Implemented.
2018-08-05T15:15:59.967784+00:00 ERR cryptohomed[2213]: Taking TPM ownership took 1882ms
2018-08-05T15:16:00.996235+00:00 INFO cryptohomed[2213]: Created new cryptohome key.
2018-08-05T15:16:01.808020+00:00 INFO cryptohomed[2213]: Configuring TPM, ownership taken: 1.
2018-08-05T15:16:01.808052+00:00 INFO cryptohomed[2213]: Putting a Pkcs11_Initialize on the mount thread.
2018-08-05T15:16:11.145640+00:00 INFO cryptohomed[2213]: Lockbox created.
2018-08-05T15:16:11.924268+00:00 INFO cryptohomed[2213]: Skipping random salt generation.
2018-08-05T15:16:15.521761+00:00 INFO cryptohomed[2213]: Encrypted partition finalized.
2018-08-05T15:16:15.534025+00:00 INFO cryptohomed[2213]: InstallAttributes have been finalized.
2018-08-05T15:16:30.961656+00:00 INFO cryptohomed[2213]: A Pkcs11_Init event got finished.
2018-08-05T15:16:30.961723+00:00 INFO cryptohomed[2213]: PKCS#11 initialization succeeded.

Here are the tests that were run:

2018/08/05 08:15:50 Started test ui.ChromeCrashLoggedIn
2018/08/05 08:15:50 [08:15:49.681] Restarting ui job
...
2018/08/05 08:15:54 [08:15:53.259] Waiting for cryptohome /home/user/c30861d54b050440dba9854eb78845cd5e393fda
...
2018/08/05 08:15:59 Completed test ui.ChromeCrashLoggedIn in 9.367s with 0 error(s)
2018/08/05 08:15:59 Started test ui.ChromeCrashNotLoggedIn
2018/08/05 08:15:59 [08:15:59.036] Restarting ui job
...
2018/08/05 08:16:31 [08:16:30.961] Error at chrome_crash_not_logged_in.go:27: Chrome startup failed: restarting job "ui" failed: context deadline exceeded
...
2018/08/05 08:16:32 Completed test ui.ChromeCrashNotLoggedIn in 32.331s with 1 error(s)
2018/08/05 08:16:32 Started test ui.ChromeLogin
2018/08/05 08:16:32 [08:16:31.368] Restarting ui job
...
2018/08/05 08:16:34 [08:16:33.154] Error at chrome_login.go:72: Chrome login failed: restarting job "ui" failed: exit status 1
...
2018/08/05 08:16:34 Completed test ui.ChromeLogin in 1.983s with 1 error(s)

---

Andrey, do you know what could be making "pkcs11_terminate" block like this?
Project Member

Comment 9 by bugdroid1@chromium.org, Aug 9

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/cf64f1556958d7a28193ad18e045846604f8cc09

commit cf64f1556958d7a28193ad18e045846604f8cc09
Author: Daniel Erat <derat@chromium.org>
Date: Thu Aug 09 09:21:30 2018

tast-tests: Increase timeout for restarting ui job.

Increase the chrome package's timeout for restarting the
"ui" Upstart job from 30 seconds to 90 seconds to try to
work around ui-post-stop hangs due to slow-running
"cryptohome --action=pkcs11_terminate" commands.

BUG= chromium:860519 , chromium:872468 
TEST=ui.ChromeLogin still passes

Change-Id: I923de582411c64611cecebcb678c2fbd7d933167
Reviewed-on: https://chromium-review.googlesource.com/1168186
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>

[modify] https://crrev.com/cf64f1556958d7a28193ad18e045846604f8cc09/src/chromiumos/tast/local/chrome/chrome.go

Project Member

Comment 10 by bugdroid1@chromium.org, Aug 9

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast/+/a977eb6931b89c52554406e2b4cdb913247293c3

commit a977eb6931b89c52554406e2b4cdb913247293c3
Author: Daniel Erat <derat@chromium.org>
Date: Thu Aug 09 09:21:39 2018

tast: Increase default local test timeout to two minutes.

A long-running "cryptohome --action=pkcs11_terminate"
process can sometimes block the ui-post-stop script for an
extended period of time when tests try to restart the "ui"
job. Increase the default timeout for local tests from one
minute to two to try to work around this.

BUG= chromium:860519 , chromium:872468 
TEST=unit tests still pass

Change-Id: I6c6c7578bde586f4154353f0fe25c34896dd0b33
Reviewed-on: https://chromium-review.googlesource.com/1168184
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>

[modify] https://crrev.com/a977eb6931b89c52554406e2b4cdb913247293c3/src/chromiumos/tast/bundle/local.go

Fixing my embarrassing cc mistake. Sorry. :-(
Cc: -hashimoto@chromium.org hidehiko@chromium.org
Cc: -derat@chromium.org apronin@chromium.org
Owner: derat@chromium.org
Ah, where are we on this. I'm listed as the owner but did nothing.
Dan, did your CLs fix the issue?
Mergedinto: 891594
Status: Duplicate (was: Assigned)
I think that we still have random timeouts on daisy, but I suspect that that's a more general issue with the slowness of the hardware. See also  issue 894767  and issue 893957, for instance.

But I think/hope that I've finally worked around all races caused by the ui-respawn job, as tracked by  issue 891594 .

Sign in to add a comment