New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Chrome login fails in tests due to failed policy validation

Project Member Reported by derat@chromium.org, Oct 19

Issue description

sky@ pointed me at http://cautotest/afe/#tab_id=view_job&object_id=249807199, a squawks-release run where the ui.SingleProcessMashLogin Tast test failed with a timeout during login.

Logs are at http://stainless/browse/chromeos-autotest-results/249807199-chromeos-test/ :

2018/10/18 14:34:02 Started test ui.SingleProcessMashLogin
2018/10/18 14:34:02 [14:34:01.999] Restarting ui job
2018/10/18 14:34:03 [14:34:03.531] Waiting for org.chromium.SessionManager D-Bus service
2018/10/18 14:34:03 [14:34:03.532] Asking session_manager to enable Chrome testing
2018/10/18 14:34:03 [14:34:03.537] Waiting for Chrome to write its debugging port to /home/chronos/DevToolsActivePort
2018/10/18 14:34:04 [14:34:04.393] Checking cryptohomed service
2018/10/18 14:34:04 [14:34:04.393] Removing cryptohome for testuser@gmail.com
2018/10/18 14:34:04 [14:34:04.519] Finding OOBE DevTools target
2018/10/18 14:34:04 [14:34:04.620] Connecting to Chrome at ws://127.0.0.1:36895/devtools/page/6E324387C05242B6B6DB5F292D29FB0E
2018/10/18 14:34:05 [14:34:05.113] Waiting for OOBE
2018/10/18 14:34:10 [14:34:10.554] Logging in as user "testuser@gmail.com"
2018/10/18 14:34:10 [14:34:10.823] Waiting for cryptohome for user "testuser@gmail.com"
...
2018/10/18 14:35:59 [14:35:59.709] Error at single_process_mash_login.go:26: Chrome login failed: not mounted for testuser@gmail.com: context deadline exceeded; last error follows: /home/user/8a0d412dce3f1efad00146f2882f89ca22b2a162 not found

Previous tests that exercised Chrome login passed in the same run:

2018/10/18 14:33:36 Started test ui.ChromeLogin
2018/10/18 14:33:36 [14:33:36.206] Waiting for org.chromium.SessionManager D-Bus service
2018/10/18 14:33:36 [14:33:36.209] Restarting ui job
2018/10/18 14:33:37 [14:33:36.949] Waiting for org.chromium.SessionManager D-Bus service
2018/10/18 14:33:37 [14:33:36.971] Asking session_manager to enable Chrome testing
2018/10/18 14:33:37 [14:33:36.973] Waiting for Chrome to write its debugging port to /home/chronos/DevToolsActivePort
2018/10/18 14:33:39 [14:33:38.972] Checking cryptohomed service
2018/10/18 14:33:39 [14:33:38.973] Removing cryptohome for testuser@gmail.com
2018/10/18 14:33:39 [14:33:39.023] Finding OOBE DevTools target
2018/10/18 14:33:39 [14:33:39.166] Connecting to Chrome at ws://127.0.0.1:46859/devtools/page/FDFF2500861B4EB67C5C35A67641EE16
2018/10/18 14:33:39 [14:33:39.596] Waiting for OOBE
2018/10/18 14:33:44 [14:33:44.644] Logging in as user "testuser@gmail.com"
2018/10/18 14:33:44 [14:33:44.798] Waiting for cryptohome for user "testuser@gmail.com"
2018/10/18 14:33:46 [14:33:46.346] Waiting for OOBE to be dismissed
2018/10/18 14:33:48 [14:33:48.737] Waiting for SessionStateChanged "started" D-Bus signal from session_manager
2018/10/18 14:33:48 [14:33:48.737] Got SessionStateChanged signal
2018/10/18 14:33:48 [14:33:48.737] Creating new blank page
2018/10/18 14:33:49 [14:33:49.063] Connecting to Chrome at ws://127.0.0.1:46859/devtools/page/348BA8078D718871A21470739452216E
2018/10/18 14:33:50 [14:33:50.566] Navigating to http://127.0.0.1:37996
2018/10/18 14:33:51 [14:33:51.478] Got content "Hooray, it worked!"
2018/10/18 14:33:51 Completed test ui.ChromeLogin in 15.291s with 0 error(s)

The chrome_20181018-143403 log from the failed test contains the following:

[7618:7753:1018/143403.996796:WARNING:accelerometer_reader.cc(246)] Accelerometer device directory is empty at /dev/cros-ec-accel
[7618:7618:1018/143403.997211:VERBOSE1:drm_device_handle.cc(90)] Succeeded authenticating /dev/dri/card0 in 0 ms with 1 attempt(s)
[7765:7770:1018/143404.114243:VERBOSE1:drm_device_manager.cc(58)] Primary DRM device added: /sys/devices/pci0000:00/0000:00:02.0/drm/card0
[7618:7618:1018/143404.224082:ERROR:cloud_policy_validator.cc(441)] Policy signature validation failed
[7618:7618:1018/143404.224176:ERROR:session_manager_operation.cc(214)] Policy validation failed: 2 (BAD_SIGNATURE)
[7618:7618:1018/143404.224225:ERROR:device_settings_service.cc(333)] Session manager operation failed: 5 (VALIDATION_ERROR)
[7618:7618:1018/143404.224283:ERROR:device_settings_provider.cc(1016)] Failed to retrieve cros policies. Reason: 5 (VALIDATION_ERROR)
...
[7618:7618:1018/143404.321289:VERBOSE1:chrome_session_manager.cc(234)] Starting Chrome with login/oobe screen.
[7618:7618:1018/143404.321423:VERBOSE1:login_display_host_webui.cc(1191)] Showing OOBE screen: unknown
[7618:7618:1018/143404.337617:VERBOSE1:login_display_host_webui.cc(1263)] Current locale: en-US
...
[7618:7753:1018/143404.661019:ERROR:cloud_policy_validator.cc(441)] Policy signature validation failed
[7618:7618:1018/143404.683650:INFO:secure_channel_initializer.cc(119)] SecureChannelInitializer::OnBluetoothAdapterReceived(): Bluetooth adapter has been fetched. Passing all queued requests to the service.
[7618:7618:1018/143404.732225:ERROR:session_manager_operation.cc(214)] Policy validation failed: 2 (BAD_SIGNATURE)
[7618:7618:1018/143404.732308:ERROR:device_settings_service.cc(333)] Session manager operation failed: 5 (VALIDATION_ERROR)
[7618:7618:1018/143404.732365:ERROR:device_settings_provider.cc(1016)] Failed to retrieve cros policies. Reason: 5 (VALIDATION_ERROR)
...
[7618:7752:1018/143404.943622:ERROR:cloud_policy_validator.cc(441)] Policy signature validation failed
...
[7618:7618:1018/143405.081812:VERBOSE1:login_display_mojo.cc(63)] Emitting login-prompt-visible
...
[7618:7618:1018/143405.108212:ERROR:session_manager_operation.cc(214)] Policy validation failed: 2 (BAD_SIGNATURE)
[7618:7618:1018/143405.108301:ERROR:device_settings_service.cc(333)] Session manager operation failed: 5 (VALIDATION_ERROR)
[7618:7618:1018/143405.108362:ERROR:device_settings_provider.cc(1016)] Failed to retrieve cros policies. Reason: 5 (VALIDATION_ERROR)
...
[7618:7618:1018/143410.209011:VERBOSE1:gaia_screen_handler.cc(1170)] LoadAuthExtension, force: 1, offline: 0
...
[7618:7618:1018/143410.253620:INFO:signin_screen_handler.cc(1472)] Login WebUI >> active: 1, source: gaia-signin
[7618:7618:1018/143410.386246:VERBOSE1:signin_screen_handler.cc(1426)] Login WebUI >> loginVisible, src: oobe, webui_visible_: 0
[7618:7618:1018/143410.394924:VERBOSE1:signin_screen_handler.cc(1426)] Login WebUI >> loginVisible, src: gaia-loading, webui_visible_: 1
...
[7618:7618:1018/143410.548761:VERBOSE1:gaia_screen_handler.cc(620)] Auth extension finished loading
[7618:7618:1018/143410.556481:ERROR:CONSOLE(217)] "Unexpected condition on chrome://oobe/gaia-signin: Could not find value for languageList", source: chrome://resources/js/load_time_data.js (217)
[7618:7618:1018/143410.796369:VERBOSE1:gaia_screen_handler.cc(779)] HandleCompleteLogin
[7618:7618:1018/143410.796503:VERBOSE1:login_display_mojo.cc(106)] Show error, error_id: 1864, attempts:1, help_topic_id: 188036
[7618:7618:1018/143410.796543:ERROR:core_oobe_handler.cc(238)] CoreOobeHandler::ShowSignInError: error_text=Sign-in failed, please connect to the internet and try again.
...
[7618:7618:1018/143414.824144:VERBOSE1:gaia_screen_handler.cc(802)] Gaia is loaded
[7618:7618:1018/143414.824245:VERBOSE1:signin_screen_handler.cc(1426)] Login WebUI >> loginVisible, src: gaia-signin, webui_visible_: 1
[7618:7618:1018/143414.824350:WARNING:CONSOLE(17510)] "Unrecognized message from GAIA: showConfirmCancel", source: chrome://oobe/login.js (17510)
...

I don't see any of these policy errors in the Chrome logs from earlier tests, e.g. chrome_20181018-143337.

What could cause policy data to get messed up in the middle of a test run? The test that ran before ui.SingleProcessMashLogin kills session_manager over and over:

2018/10/18 14:33:51 Started test ui.SessionManagerRespawn
2018/10/18 14:33:51 [14:33:51.553] Getting initial session_manager process
2018/10/18 14:33:51 [14:33:51.560] Initial session_manager process is 5216
2018/10/18 14:33:51 [14:33:51.560] Killing 5216
2018/10/18 14:33:51 [14:33:51.562] Waiting for session_manager to respawn
2018/10/18 14:33:52 [14:33:52.597] Respawned session_manager process is 5926
2018/10/18 14:33:52 [14:33:52.597] Repeatedly killing session_manager to check that ui-respawn stops restarting it eventually
2018/10/18 14:33:52 [14:33:52.597] Killing 5926 and watching for respawn
2018/10/18 14:33:53 [14:33:53.282] Killing 6079 and watching for respawn
2018/10/18 14:33:54 [14:33:54.579] Killing 6360 and watching for respawn
2018/10/18 14:33:55 [14:33:55.388] Killing 6640 and watching for respawn
2018/10/18 14:33:56 [14:33:56.056] Killing 6900 and watching for respawn
2018/10/18 14:33:56 [14:33:56.739] Killing 7047 and watching for respawn
2018/10/18 14:34:01 [14:34:01.746] session_manager (correctly) not respawned
2018/10/18 14:34:02 Completed test ui.SessionManagerRespawn in 10.459s with 0 error(s)

Is session_manager corrupting the policy file?
 
Labels: -Pri-2 Hotlist-TreeCloser Hotlist-CrOS-Gardener Pri-1
#1: Yes, I reached the same conclusion from the logs (at http://stainless/browse/chromeos-autotest-results/251033913-chromeos-test/).

[8900:8900:1022/192640.243285:ERROR:session_manager_operation.cc(214)] Policy validation failed: 2 (BAD_SIGNATURE)
[8900:8900:1022/192640.243341:ERROR:device_settings_service.cc(333)] Session manager operation failed: 5 (VALIDATION_ERROR)
[8900:8900:1022/192640.243377:ERROR:device_settings_provider.cc(1016)] Failed to retrieve cros policies. Reason: 5 (VALIDATION_ERROR
...
[8900:8900:1022/192642.435079:VERBOSE1:gaia_screen_handler.cc(850)] HandleCompleteLogin
[8900:8900:1022/192642.435238:VERBOSE1:login_display_mojo.cc(106)] Show error, error_id: 1865, attempts:1, help_topic_id: 188036
[8900:8900:1022/192642.435284:ERROR:core_oobe_handler.cc(238)] CoreOobeHandler::ShowSignInError: error_text=Sign-in failed, please connect to the internet and try again.

Mattias, any thoughts? If there's something that we can/should do at the beginning of tests to wipe corrupt policy files, I'm happy to add that, but I need some help figuring out what that is (and when it's safe to do).
From xiyuan@ at  https://crbug.com/880823#c10 :

---

Looks like ui.ChromeLogin is the first test that completes sign-in successfully. So we will generate an owner key on the device.

ui.20181009-192447
====
[1009/192453:INFO:keygen_worker.cc(53)] Writing Owner key to /home/user/9ee774749aa421f6912b763e19572304f256ccc8/key.pub
[1009/192453:INFO:policy_key.cc(120)] wrote 294 bytes to /home/user/9ee774749aa421f6912b763e19572304f256ccc8/key.pub

/var/log/messages:
====
2018-10-10T02:24:53.992222+00:00 INFO session_manager[14168]: [INFO:policy_key.cc(120)] wrote 294 bytes to /var/lib/whitelist/owner.key
2018-10-10T02:24:53.992242+00:00 INFO session_manager[14168]: [INFO:policy_service.cc(264)] Persisted policy key to disk.

On the chrome side, it should generate a policy blob after the owner key generation finishes. However, the ownkey-gen -> policy-gen is asynchronous and does not blocks sign-in. I suspect we exit before it finishes and we end up with an owner key present but no policy blob (or having a stale one signed by a different key).
This failed again recently for peppy-chrome-pfq
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8931817354285908368

The successive build passed however.
Cc: amoylan@chromium.org malaykeshav@chromium.org
Mattias said that he'll take a look at this tomorrow.

We chatted a bit over IM about the problem. He agreed with Xiyuan's analysis that there's a policy on disk, but it's inconsistent with the key. It sounds like we can safely make Tast tests (probably via chrome.New) wipe /var/lib/whitelist to clear existing policy files. We should stop the ui job before doing this.

Mattias also pointed at http://cs/chromeos_public/src/third_party/autotest/files/client/cros/ownership.py?l=95&rcl=83c018c571f37a322417f4096e9984d680ddd3a2 (the Autotest code for clearing policy files) and noted that it seems like it's incorrect now that we're using numbered policy.N files instead of a single file.

So it seems like these tests can probably wipe policy files to work around cases where they run on DUTs that are already in a bad state for whatever reason, but there's still a mystery around how DUTs are getting into that state in the first place. The example at the top here, where some Tast login tests passed and then another login test in the same run failed due to bad policy files, still suggests to me that there may be a race.
Here's some analysis:

2018-10-18T21:33:36.964463+00:00 INFO session_manager[5216]: [INFO:session_manager_service.cc(152)] SessionManagerService starting

^ Note that session manager is starting here without saying "No policy key on disk at /var/lib/whitelist/owner.key", so there is a policy key installed at this point.

2018-10-18T21:33:36.973582+00:00 INFO session_manager[5216]: [INFO:browser_job.cc(156)] Running browser /opt/google/chrome/chrome --ppapi-flash-path=/opt/google/chrome/pepper/libpepflashplayer.so --ppapi-flash-version=31.0.0.122 --ui-prioritize-in-gpu-process --use-gl=egl --enable-native-gpu-memory-buffers --gpu-sandbox-failures-fatal=yes --enable-logging --log-level=1 --use-cras --enable-wayland-server --user-data-dir=/home/chronos --system-developer-mode --login-profile=user --has-chromeos-keyboard --guest-wallpaper-large=/usr/share/chromeos-assets/wallpaper/guest_large.jpg --guest-wallpaper-small=/usr/share/chromeos-assets/wallpaper/guest_small.jpg --child-wallpaper-large=/usr/share/chromeos-assets/wallpaper/child_large.jpg --child-wallpaper-small=/usr/share/chromeos-assets/wallpaper/child_small.jpg --default-wallpaper-large=/usr/share/chromeos-assets/wallpaper/oem_large.jpg --default-wallpaper-small=/usr/share/chromeos-assets/wallpaper/oem_small.jpg --default-wallpaper-is-oem --enable-consumer-kiosk --enterprise-enrollment-initial-modulus=15 --enterprise-enrollment-modulus-limit=19 --login-manager --remote-debugging-port=0 --disable-logging-redirect --ash-disable-system-sounds --oobe-skip-postlogin --disable-gaia-services --autoplay-policy=no-user-gesture-required --enable-experimental-extension-apis --whitelisted-extension-id=behllobkkfkfnphdnhnkndlbkcpglgmj --load-extension=/tmp/tast_chrome_extensions.991905158/test_api_extension --arc-availability=none --testing-channel=NamedTestingInterface:/tmp/.org.chromium.Chromium.2K1DWe/.org.chromium.Chromium.QrJxvd --vmodule=nss_cert_database_chromeos=1,*/assistant/*=1,*chromeos/login/*=1,auto_enrollment_controller=1,*/ui/ozone/*=1,*/ui/display/manager/chromeos/*=1,*night_light*=1,update_engine=1,component_updater_service=1,power_button_observer=2,webui_login_view=2,lock_state_controller=2,webui_screen_locker=2,screen_locker=2
2018-10-18T21:33:37.102574+00:00 INFO debugd[5289]: libminijail[5289]: mount '/' -> '/' type '' flags 0x1001
2018-10-18T21:33:37.103047+00:00 INFO debugd[5289]: libminijail[5289]: mount 'none' -> '/proc' type 'proc' flags 0xe
2018-10-18T21:33:37.103194+00:00 INFO debugd[5289]: libminijail[5289]: mount '/var' -> '/var' type '' flags 0x1000
2018-10-18T21:33:37.103279+00:00 INFO debugd[5289]: libminijail[5289]: mount 'tmpfs' -> '/mnt' type 'tmpfs' flags 0xe
2018-10-18T21:33:37.103362+00:00 INFO debugd[5289]: libminijail[5289]: mount '/mnt/stateful_partition/unencrypted/cache/vpd' -> '/mnt/stateful_partition/unencrypted/cache/vpd' type '' flags 0x1000
2018-10-18T21:33:37.103446+00:00 INFO debugd[5289]: libminijail[5289]: mount 'tmpfs' -> '/run' type 'tmpfs' flags 0xe
2018-10-18T21:33:37.103527+00:00 INFO debugd[5289]: libminijail[5289]: mount '/run/dbus' -> '/run/dbus' type '' flags 0x1001
2018-10-18T21:33:37.103609+00:00 INFO debugd[5289]: libminijail[5289]: mount '/tmp' -> '/tmp' type '' flags 0x1000
2018-10-18T21:33:37.103701+00:00 INFO debugd[5289]: libminijail[5289]: mount '/run/cups' -> '/run/cups' type '' flags 0x1001
2018-10-18T21:33:37.103807+00:00 INFO debugd[5289]: libminijail[5289]: mount '/run/ippusb' -> '/run/ippusb' type '' flags 0x1001
2018-10-18T21:33:37.103894+00:00 INFO debugd[5289]: libminijail[5289]: mount '/run/shill' -> '/run/shill' type '' flags 0x1001
2018-10-18T21:33:37.103985+00:00 INFO debugd[5289]: libminijail[5289]: mount '/run/arc/bugreport' -> '/run/arc/bugreport' type '' flags 0x1001
2018-10-18T21:33:37.104072+00:00 INFO debugd[5289]: libminijail[5289]: mount '/run/containers' -> '/run/containers' type '' flags 0x1001
2018-10-18T21:33:37.104155+00:00 INFO debugd[5289]: libminijail[5289]: mount '/dev' -> '/dev' type 'bind' flags 0x5000
2018-10-18T21:33:37.104237+00:00 INFO debugd[5289]: libminijail[5289]: mount '/sys' -> '/sys' type 'bind' flags 0x5000
2018-10-18T21:33:37.104324+00:00 INFO debugd[5289]: libminijail[5289]: mount '/run/tcsd' -> '/run/tcsd' type '' flags 0x1001
2018-10-18T21:33:39.021378+00:00 WARNING cryptohomed[1186]: No valid keysets on disk for 8a0d412dce3f1efad00146f2882f89ca22b2a162
2018-10-18T21:33:39.509930+00:00 INFO bluetoothd[2088]: adapter /org/bluez/hci0 set power to 1
2018-10-18T21:33:39.744990+00:00 INFO session_manager[5216]: [INFO:server_backed_state_key_generator.cc(126)] Stable device secret missing!
2018-10-18T21:33:45.192476+00:00 INFO cryptohomed[1186]: Encrypted partition finalized.
2018-10-18T21:33:45.206705+00:00 INFO cryptohomed[1186]: InstallAttributes have been finalized.
2018-10-18T21:33:46.326313+00:00 INFO cryptohomed[1186]: Migrated (or created) user directory: /home/.shadow/8a0d412dce3f1efad00146f2882f89ca22b2a162/mount
2018-10-18T21:33:46.330569+00:00 INFO cryptohomed[1186]: Creating pass-through directories /home/.shadow/8a0d412dce3f1efad00146f2882f89ca22b2a162/mount/user/Cache
2018-10-18T21:33:46.332353+00:00 INFO cryptohomed[1186]: Creating pass-through directories /home/.shadow/8a0d412dce3f1efad00146f2882f89ca22b2a162/mount/user/Downloads
2018-10-18T21:33:46.332876+00:00 INFO cryptohomed[1186]: Creating pass-through directories /home/.shadow/8a0d412dce3f1efad00146f2882f89ca22b2a162/mount/user/GCache
2018-10-18T21:33:46.333303+00:00 INFO cryptohomed[1186]: Creating pass-through directories /home/.shadow/8a0d412dce3f1efad00146f2882f89ca22b2a162/mount/user/GCache/v1
2018-10-18T21:33:46.333680+00:00 INFO cryptohomed[1186]: Creating pass-through directories /home/.shadow/8a0d412dce3f1efad00146f2882f89ca22b2a162/mount/user/GCache/v2
2018-10-18T21:33:46.334081+00:00 INFO cryptohomed[1186]: Creating pass-through directories /home/.shadow/8a0d412dce3f1efad00146f2882f89ca22b2a162/mount/user/GCache/v1/blobs
2018-10-18T21:33:46.334489+00:00 INFO cryptohomed[1186]: Creating pass-through directories /home/.shadow/8a0d412dce3f1efad00146f2882f89ca22b2a162/mount/user/GCache/v1/tmp
2018-10-18T21:33:46.339023+00:00 INFO cryptohomed[1186]: RecursiveCopy: /home/.shadow/8a0d412dce3f1efad00146f2882f89ca22b2a162/mount/user/.pki
2018-10-18T21:33:46.339549+00:00 INFO cryptohomed[1186]: RecursiveCopy: /home/.shadow/8a0d412dce3f1efad00146f2882f89ca22b2a162/mount/user/.pki/nssdb
2018-10-18T21:33:46.341344+00:00 INFO cryptohomed[1186]: RecursiveCopy: /home/.shadow/8a0d412dce3f1efad00146f2882f89ca22b2a162/mount/user/.ssh
2018-10-18T21:33:46.341862+00:00 INFO cryptohomed[1186]: RecursiveCopy: /home/.shadow/8a0d412dce3f1efad00146f2882f89ca22b2a162/mount/user/log
2018-10-18T21:33:47.154454+00:00 INFO session_manager[5216]: [INFO:policy_key.cc(51)] No policy key on disk at /home/root/8a0d412dce3f1efad00146f2882f89ca22b2a162/session_manager/policy/key
2018-10-18T21:33:47.154820+00:00 INFO cryptohomed[1186]: Putting a Pkcs11_Initialize on the mount thread.
2018-10-18T21:33:47.168917+00:00 INFO chapsd[1143]: Opening database in: /home/root/8a0d412dce3f1efad00146f2882f89ca22b2a162/chaps
2018-10-18T21:33:47.169010+00:00 INFO chapsd[1143]: leveldb /home/root/8a0d412dce3f1efad00146f2882f89ca22b2a162/chaps/database directory did not exist.
2018-10-18T21:33:47.202647+00:00 INFO chapsd[1143]: leveldb.stats:#012                               Compactions#012Level  Files Size(MB) Time(sec) Read(MB) Write(MB)#012--------------------------------------------------#012
2018-10-18T21:33:47.202689+00:00 INFO chapsd[1143]: leveldb contains 0 keys.
2018-10-18T21:33:47.218983+00:00 INFO chapsd[1143]: Importing opencryptoki objects.
2018-10-18T21:33:47.219051+00:00 WARNING chapsd[1143]: Did not find any opencryptoki objects to import.
2018-10-18T21:33:47.248495+00:00 INFO chapsd[1143]: Slot 1 ready for token at /home/root/8a0d412dce3f1efad00146f2882f89ca22b2a162/chaps
2018-10-18T21:33:47.250060+00:00 INFO cryptohomed[1186]: A Pkcs11_Init event got finished.
2018-10-18T21:33:47.250362+00:00 INFO cryptohomed[1186]: PKCS#11 initialization succeeded.
2018-10-18T21:33:47.251996+00:00 WARNING session_manager[5216]: [WARNING:device_policy_service.cc(506)] Could not verify that owner key belongs to this user.

^ At this point, session_manager checks whether the user who logs in has the private half of the owner key. That's not the case here. Assuming you're using the same account ID for all testing, the policy will state the user is supposed to be the owner though. This triggers a process referred to in the code as "key loss mitigation", i.e. we re-generate a key for use by the user. That happens asynchronously and takes some time.

2018-10-18T21:33:47.252210+00:00 ERR session_manager[5216]: [ERROR:dbus_util.cc(13)] CreateError(...): Domain=dbus, Code=org.chromium.SessionManagerInterface.PubkeySetIllegal, Message=Could not verify that owner key belongs to this user.
2018-10-18T21:33:47.254192+00:00 INFO chapsd[1143]: Created lock file: /run/lock/power_override/chapsd_token_init_slot_1.lock
2018-10-18T21:33:47.254359+00:00 INFO chapsd[1143]: Initializing key hierarchy for token at /home/root/8a0d412dce3f1efad00146f2882f89ca22b2a162/chaps
2018-10-18T21:33:47.293678+00:00 INFO session_manager[5216]: [INFO:session_manager_impl.cc(561)] Starting user session
2018-10-18T21:33:47.908335+00:00 INFO bluetoothd[2088]: adapter /org/bluez/hci0 set power to 1
2018-10-18T21:33:48.767076+00:00 INFO chapsd[1143]: Finished importing 0 pending objects.
2018-10-18T21:33:48.773645+00:00 INFO chapsd[1143]: Master key is ready for token at /home/root/8a0d412dce3f1efad00146f2882f89ca22b2a162/chaps
2018-10-18T21:33:48.773922+00:00 INFO chapsd[1143]: Deleted lock file: /run/lock/power_override/chapsd_token_init_slot_1.lock
2018-10-18T21:33:49.572427+00:00 WARNING cryptohomed[1186]: RemoveKeyset: key to remove not found
2018-10-18T21:33:49.854785+00:00 NOTICE dbus[536]: [system] Activating service name='org.chromium.EasyUnlock' (using servicehelper)
2018-10-18T21:33:49.906825+00:00 INFO easy_unlock[5709]: Starting EasyUnlock dbus service.
2018-10-18T21:33:49.909267+00:00 NOTICE dbus[536]: [system] Successfully activated service 'org.chromium.EasyUnlock'
2018-10-18T21:33:51.499331+00:00 INFO session_manager[5216]: [INFO:policy_service.cc(264)] Persisted policy key to disk.

^ A new policy key gets written to disk. This is consistent with the theory that a new key has been generated (session_manager generally only writes the key when it gets installed / rotated). When installing a new key, we also write out an initial policy blob signed by the new key, generating a log line that reads "Persisted policy to disk, path: /var/lib/whitelist/policy.1". This doesn't happen here, probably because session_manager gets killed before it manages to do so (see next log statement).

  So we're now in a situation where we have a policy key on disk that doesn't sign the policy file that's left around. Note that the messages complaining about policy signature check failures start exactly after this point, which corroborates the hypothesis.

  I've successfully recreated the situation by copying out /var/lib/whitelist/policy.1 after logging in, stopping ui, removing the cryptohome, logging in again and wait until a new owner.key and policy.1 file have been written, stopping ui, copy back the old policy.1 file, starting ui. At this point, login attempts fail silently with the ominous "CoreOobeHandler::ShowSignInError: error_text=Sign-in failed, please connect to the internet and try again." error message.





So the theory is confirmed, and the proposed workaround (clearing ownership state by wiping /var/lib/whitelist) should resolve this. The factor that led to this never getting reported as bug from the field but your tests still ran into it is that your device reset logic is incomplete: Removing the cryptohome for the owner account (which I assume you do) should arguably also reset device ownership (i.e. clearing /var/lib/whitelist). I think it's a good idea to do the latter unconditionally between tests to avoid device ownership state leaks across tests.

There's the larger issue that devices may end up in a situation where login is blocked with no way to recover. I've filed issue 898854 for that, but given there's no indication actual users are hitting this, it'll be pretty low on my person list of priorities.
Thanks for digging into this, Mattias!

As a less-risky workaround based on what you described above, would it be reasonable for the test code to actually check if policy.1 is signed by owner.key?

Or is it completely safe to unconditionally delete /var/lib/whitelist between tests when running on test DUTs (after stopping the "ui" job)? It seems to work on the DUT where I tried it; I'm mostly concerned about situations where someone's DUT is owned by a real GAIA account that they used during signin but then they run tests that use a different account. Could the loss of a valid key and policy file cause any issues in that case?

The main reason for my hesitation here is that the clear_ownership_files_no_restart Autotest function that you mentioned (http://cs/chromeos_public/src/third_party/autotest/files/client/cros/ownership.py?l=95&rcl=83c018c571f37a322417f4096e9984d680ddd3a2) appears to be called only from the login_OwnershipApi test, so I'm wondering if we'd be blazing new ground here by unconditionally deleting the dir.
I don't think we ever made any promises about which state a DUT would be left in after a test (with respect to the stateful file system, I think it is fair to expect the rootfs to be intact). As you note, there already is a function   ownership.restart_ui_to_clear_ownership_files which blows away ownership and by my count it is called by 10 different tests in autotest. I don't think there's any reason for caution here, in fact anything that makes the test environment more predictable is probably a win ;-)
Cc: mnissler@chromium.org nya@chromium.org hidehiko@chromium.org
Owner: derat@chromium.org
Status: Started (was: Assigned)
Thanks again! I've uploaded https://crrev.com/c/1309160 to make Tast's Chrome code recursively delete /var/lib/whitelist (since it gets recreated by session_manager).
Project Member

Comment 12 by bugdroid1@chromium.org, Nov 1

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/bdfc15efa8a161af65a4fc4a97d6a1d11d432826

commit bdfc15efa8a161af65a4fc4a97d6a1d11d432826
Author: Daniel Erat <derat@chromium.org>
Date: Thu Nov 01 23:24:52 2018

tast-tests: Clear policy files from /var/lib/whitelist.

Update chrome.New to delete the /var/lib/whitelist directory
when restarting the ui job. This is apparently needed in
some cases to get the DUT into a usable state if an earlier
test lets session_manager regenerate the owner.key file and
then kills it before it can also regenerate policy.1 and
sign it using the new owner key.

This also makes us clear the user's cryptohome while the ui
job is stopped rather than while it's running. I don't think
it's necessary, but it feels cleaner to delete policy files
and the cryptohome at around the same time.

Also update the "ready" function to wait for cryptohomed to
be running, as the chrome package no longer does this.

BUG= chromium:897278 
TEST=ran ui.ChromeLoginForever for 30 iterations without
     seeing failures; also started tryjobs

Change-Id: I9cf5226b7d82bad9f6fb7845815964cc1c9e0b6f
Reviewed-on: https://chromium-review.googlesource.com/1309160
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/bdfc15efa8a161af65a4fc4a97d6a1d11d432826/src/chromiumos/tast/local/ready/ready.go
[modify] https://crrev.com/bdfc15efa8a161af65a4fc4a97d6a1d11d432826/src/chromiumos/tast/local/chrome/chrome.go

Project Member

Comment 13 by bugdroid1@chromium.org, Nov 2

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/074805191d2d20eb5313e6ebcde3a6197ad8926c

commit 074805191d2d20eb5313e6ebcde3a6197ad8926c
Author: Daniel Erat <derat@chromium.org>
Date: Fri Nov 02 04:47:25 2018

tast-tests: Wait for cryptohome D-Bus service.

Make the local "cros" bundle wait for cryptohomed's D-Bus
service to be available before treating the DUT as ready. It
was previously just waiting for the process to start, but it
looks like there's still the potential for a race when
running on freshly-booted VMs: if the D-Bus service is
available, the cryptohome command (used when restarting
Chrome) crashes instead of waiting.

BUG= chromium:897278 
TEST=ran ui.ChromeLoginForever

Change-Id: If80e2efd1a3fa6e4297285a60740720b71b9375b
Reviewed-on: https://chromium-review.googlesource.com/c/1314050
Reviewed-by: Shuhei Takahashi <nya@chromium.org>
Commit-Queue: Shuhei Takahashi <nya@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/074805191d2d20eb5313e6ebcde3a6197ad8926c/src/chromiumos/tast/local/ready/ready.go

Comment 14 Deleted

https://crrev.com/c/1309160 introduced a race that can result in cryptohome failures in arc.Boot (the first test that does login) on VM builders. I've chumped https://crrev.com/c/1314050 as a fix.
I'm hopeful that this is working now. I don't see any recent "not mounted for" test failures, apart from ones on guado_moblab that I think are unrelated.
Status: Fixed (was: Started)
Here's a good summary of recent results: http://stainless/search?view=list&first_date=2018-10-12&last_date=2018-11-08&test=%5Etast%5C.&reason=not+mounted+for&exclude_cts=false&exclude_not_run=true&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true

There was a "not mounted for" error on nyan_kitty on Oct 30, and one on ninja on Oct 29, and a steady stream of multiple failures per day before then. But the only failures in November have been on guado_moblab, and I think they're unrelated (see  https://crbug.com/902318#c2 ).
Cc: emaxx@chromium.org pmarko@chromium.org alemate@chromium.org shend@chromium.org achuith@chromium.org olsen@chromium.org jrbarnette@chromium.org pastarmovj@chromium.org
 Issue 880823  has been merged into this issue.

Sign in to add a comment