Report better errors in Tast tests when cryptohomed isn't running |
|||||||
Issue descriptionIn http://stainless/search?view=list&first_date=2018-11-16&last_date=2018-11-29&branch=%5Emaster%24&test=%5Etast%5C..*%5C.&status=FAIL&reason=failed+to+remove+cryptohome&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=false, I see a bunch of tests failing with "failed to remove cryptohome: exit status 1". In the cases that I looked at, this is usually happening because cryptohomed isn't actually running -- I typically see messages like this in the test log: 2018/11/29 01:16:14 [01:16:14.104] Removing cryptohome for testuser@gmail.com 2018/11/29 01:16:14 [01:16:14.124] External command failed: exit status 1 2018/11/29 01:16:14 [01:16:14.124] Command: cryptohome --action=remove --force --user=testuser@gmail.com 2018/11/29 01:16:14 [01:16:14.124] Uncaptured output: Remove call failed: The name org.chromium.Cryptohome was not provided by any .service files. 2018/11/29 01:16:14 [01:16:14.124] Error at apptest.go:23: Failed to connect to Chrome: failed to remove cryptohome: exit status 1 I'd like to update chrome.New to test that cryptohomed is available before trying to log in. We can then return a better error message, e.g. "cryptohomed not running". Going further, it'd be nice to try to diagnose why cryptohomed isn't running. In the cases that I've looked at in the past (with much help from apronin@), the underlying problem is typically related to attestationd, tpm_managerd, chapsd, etc. not running. Maybe we could additionally check the status of some or all of those processes.
,
Nov 30
Thanks for reporting!
,
Nov 30
,
Nov 30
..wow I am failing at this.
,
Nov 30
In known cases, e.g. https://stainless.corp.google.com/browse/chromeos-autotest-results/262444554-chromeos-test/, the error is caused by cryptohomed crashing, which in turn is caused by "The name org.chromium.Attestation was not provided by any .service files" dbus error. That likely means that attestationd is not running. But, unfortunately, in all cases the logfiles are trimmed, so don't contain the point when something happens to attestationd or other daemons. Even walking through other jobs on the same host doesn't help. The jobs are either from before the problem starts, or contain already heavily trimmed logs (due to cryptohomed constantly restarting). E.g. for https://stainless.corp.google.com/browse/chromeos-autotest-results/261807539-chromeos-test/. Judging by cryptohomed crashdumps, the problems start before 2018/11/28 00:00:53 -08:00. Looking at the jobs on chromeos6-row3-rack6-host15 (http://cautotest.corp.google.com/afe/#tab_id=view_host&object_id=7729). Before pyro-release/R72-11303.0.0/bvt-perbuild/tast.informational (started 2018-11-28 00:20:11) we have 1) cheets_StartAndroid.stress.1 - cryptohomed already crashing, logs trimmed Job 261807419 chromeos-test pyro-release/R72-11303.0.0/bvt-perbuild/cheets_StartAndroid.stress.1 2018-11-27 23:14:11 Aborted Reset 261807419 chromeos-test pyro-release/R72-11303.0.0/bvt-perbuild/cheets_StartAndroid.stress.1 2018-11-27 23:13:35 Completed 2) generic_RebootTest (http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=261807418) - cryptohomed already crashing, logs trimmed Job 261807418 chromeos-test pyro-release/R72-11303.0.0/bvt-cq/generic_RebootTest 2018-11-27 23:07:31 Completed Reset 261807418 chromeos-test pyro-release/R72-11303.0.0/bvt-cq/generic_RebootTest 2018-11-27 23:06:55 Completed In logs: 2018-11-28T06:08:54.396284+00:00 WARNING kernel: [ 0.129904] wrmsrl_safe (MSR_IA32_FEATURE_CONTROL, 00000007) failed error -5 2018-11-28T06:08:54.396284+00:00 WARNING kernel: [ 0.145884] #3<6>[ 0.146158] Enabling VMX ======================================================================== < 3972179 > characters are trimmed here. ======================================================================== 3.405986-08:00 INFO sshd[3017]: Accepted publickey for root from 127.0.0.1 port 34244 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU 2018-11-27T23:09:43.418852-08:00 INFO cryptohomed[3003]: Protocol version: 0 2018-11-27T23:09:43.418933-08:00 INFO cryptohomed[3003]: Pinweaver IsSupported: 0 (Reset job doesn't capture any logs: https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row3-rack6-host15/1804512-reset/) 3) hardware_Memtester.quick (http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=261807416) - everything is still fine. Job 261807416 chromeos-test pyro-release/R72-11303.0.0/bvt-cq/hardware_Memtester.quick 2018-11-27 23:05:43 Completed Reset 261807416 chromeos-test pyro-release/R72-11303.0.0/bvt-cq/hardware_Memtester.quick 2018-11-27 23:05:07 Completed Last words from attestationd: 2018-11-28T06:08:56.042983+00:00 INFO attestationd[1228]: Attestation Daemon Started. 2018-11-28T06:08:56.043580+00:00 INFO trunksd[1173]: CheckState: TPM State verified. 2018-11-28T06:08:56.043985+00:00 WARNING attestationd[1228]: libminijail[1228]: non-debug build: ignoring request to enable seccomp logging 2018-11-28T06:08:56.047127+00:00 INFO attestationd[1228]: Attestation service started. 2018-11-28T06:08:56.059054+00:00 INFO attestationd[1228]: Loading attestation database. cryptohomed is still running here: 2018-11-28T07:05:18.893436+00:00 INFO cryptohomed[1304]: Protocol version: 0 2018-11-28T07:05:18.893504+00:00 INFO cryptohomed[1304]: Pinweaver IsSupported: 0 Last timestamp captured in syslog for this test: 2018-11-28T07:06:20.117690+00:00 NOTICE ag[29999]: autotest server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-AVdYI8 0 0) So, something has happened on boot between 23:06:20 and 23:09:43 (-08:00), but we have no logs for it. TZ has changed. Similar picture for other test runs with this issue. E.g. for failed nami run (http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=262099801), the issues started between cheets_StartAndroid and autoupdate_ForcedOBEUpdate 1) autoupdate_ForcedOBEUpdate - cryptohomed crashing Job 262099461 chromeos-test nami-release/R72-11312.0.0/bvt-perbuild/autoupdate_ForcedOOBEUpdate.delta 2018-11-28 23:18:51 Completed Reset 262099461 chromeos-test nami-release/R72-11312.0.0/bvt-perbuild/autoupdate_ForcedOOBEUpdate.delta 2018-11-28 23:18:13 Completed 2018-11-29T07:20:33.145621+00:00 WARNING kernel: [ 0.079892] can not enable VMX on CPU1 (already locked) 2018-11-29T07:20:33.145625+00:00 INFO kernel: [ 0.084401] NMI watchdog: enabled on all CPUs ======================================================================== < 464957 > characters are trimmed here. ======================================================================== ho 'no boot_id available'; fi) 2018-11-28T23:23:57.016640-08:00 ERR chapsd[1257]: TPM2Utility cannot be ready if the TPM is not owned. 2018-11-28T23:23:57.022884-08:00 ERR cryptohomed[3909]: AddDBusError(...): Domain=dbus, Code=org.freedesktop.DBus.Error.ServiceUnknown, Message=The name org.chromium.Attestation was not provided by any .service files 2018-11-28T23:23:57.023382-08:00 CRIT cryptohomed[3909]: SetSystemSalt error: 2#012/usr/lib64/libbase-core-456626.so(_ZN4base5debug10StackTraceC1Em+0x1c) [0x7e03a7f9937c]#012 2) cheets_StartAndroid.stress Job 262099368 chromeos-test nami-release/R72-11312.0.0/bvt-perbuild/cheets_StartAndroid.stress.3 2018-11-28 23:10:47 Completed Reset 262099368 chromeos-test nami-release/R72-11312.0.0/bvt-perbuild/cheets_StartAndroid.stress.3 2018-11-28 23:10:06 Completed Last words: 2018-11-29T07:11:16.895595+00:00 NOTICE ag[11135]: autotest server[stack::execute_section|_execute_daemon|run] -> ssh_run(nohup /usr/local/autotest/bin/autotestd /tmp/autoserv-NyNDe7 -H autoserv --verbose --hostname=chromeos2-row7-rack5-host11 --user=chromeos-test /usr/local/autotest/control.autoserv >/dev/null 2>/dev/null &) Something happened between 23:11:16 and 23:23:57. No logs. TZ changed. It'd be interesting indeed to know if attestationd is considered 'started'. And what's with other daemons: attestationd may be not running because trunksd is not running, for example. Interestingly, there are no signs of attestationd being constantly respawned.
,
Nov 30
Another case: something happened between 1) desktopui_ExitOnSupervisedUserCrash (https://stainless.corp.google.com/browse/chromeos-autotest-results/261724365-chromeos-test/chromeos2-row7-rack5-host5/) Everything still working fine at 14:53:50: 2018-11-27T14:52:57.400549-08:00 INFO cryptohomed[1507]: Protocol version: 0 2018-11-27T14:52:57.400584-08:00 INFO cryptohomed[1507]: Pinweaver IsSupported: 0 ... 2018-11-27T14:53:50.075270-08:00 NOTICE ag[9214]: autotest server[stack::execute_section|_execute_daemon|run] -> ssh_run(nohup /usr/local/autotest/bin/autotestd /tmp/autoserv-2YKbiD -H autoserv --verbose --hostname=chromeos2-row7-rack5-host5 --user=chromeos-test /usr/local/autotest/control.autoserv >/dev/null 2>/dev/null &) and 2) autoupdate_OmahaResponse.bad_metadata_size (https://stainless.corp.google.com/browse/chromeos-autotest-results/261724463-chromeos-test/chromeos2-row7-rack5-host5/) cryptohomed in crashloop at 14:57:58: 2018-11-27T21:47:06.171893+00:00 INFO kernel: [ 0.080387] Enabling VMX on cpu 1 2018-11-27T21:4 ======================================================================== < 3703908 > characters are trimmed here. ======================================================================== h notification for cryptohomed[3321] sig 6, user 0 group 0 (ignoring - crash origin is not ARC) 2018-11-27T14:57:58.211052-08:00 WARNING kernel: [ 13.242984] init: cryptohomed main process (3321) killed by ABRT signal 2018-11-27T14:57:58.211059-08:00 WARNING kernel: [ 13.242996] init: cryptohomed main process ended, respawning This time also reboot, but no TZ change. In all cases, the issues start after reboot. But the immediate tests before & after seem unrelated to the problem. So, just a randomly popping issue with attestationd or trunksd on boot? Can be another syscall not whitelisted in seccomp filters after libc update, which happens occasionally.
,
Dec 1
Finally, thanks to mruthven@, got a repro from the faft test runner (eve, R72-11310.0.0). 'messages' is attached. messages don't contain anything suspicious re attestationd, but attestationd process is zombie (with 4 threads), init still considers it as 'running'. # status attestationd attestationd start/running, process 1307 # ps -mLF --pid 1307 UID PID PPID LWP C NLWP SZ RSS PSR STIME TTY TIME CMD attesta+ 1307 1 - 0 4 0 0 - 16:34 ? 00:00:01 [attestationd] <defunct> attesta+ - - 1307 0 - - - 2 16:34 - 00:00:00 - attesta+ - - 1310 0 - - - 1 16:34 - 00:00:00 - attesta+ - - 1312 0 - - - 1 16:34 - 00:00:01 - attesta+ - - 1327 0 - - - 3 16:34 - 00:00:00 - # cat /proc/1307/task/**/syscall 24 0x781af16ab068 0x2b 0x7ffd397db220 0x58b101bd6010 0x0 0x0 0x7ffd397db028 0x781af0c93347 202 0x781aefd9c430 0x80 0x0 0x0 0x0 0x59d73a 0x781aefd9c2b0 0x781af126fa47 232 0x10 0x781ae0002000 0x20 0x950 0x0 0x0 0x781aef59baf0 0x781af0cacf3e 23 0x17 0x781aeed9a788 0x0 0x0 0x0 0x0 0x781aeed9a650 0x781af0ca3ea6 # cat /proc/1307/task/1307/stack [<ffffffffbad5a29d>] do_exit+0x411/0x80f [<ffffffffbad7cf10>] seccomp_phase1+0x345/0x34a [<ffffffffbaca47a1>] syscall_trace_enter_phase1+0xa7/0x123 [<ffffffffbb3ecbb0>] tracesys+0xd/0x44 [<ffffffffffffffff>] 0xffffffffffffffff # cat /proc/1307/task/1310/stack [<ffffffffbad7b3d5>] futex_wait_queue_me+0x89/0xdf [<ffffffffbad79f96>] futex_wait+0x2a9/0x7f1 [<ffffffffbad7aa85>] SyS_futex+0x5a7/0x951 [<ffffffffbb3ecb23>] entry_SYSCALL_64_fastpath+0x31/0xab [<ffffffffffffffff>] 0xffffffffffffffff # cat /proc/1307/task/1312/stack [<ffffffffbae97b83>] SYSC_epoll_wait+0x2fa/0x6dd [<ffffffffbae97887>] SyS_epoll_wait+0xe/0x10 [<ffffffffbb3ecb23>] entry_SYSCALL_64_fastpath+0x31/0xab [<ffffffffffffffff>] 0xffffffffffffffff # cat /proc/1307/task/1327/stack [<ffffffffbae8f48b>] poll_schedule_timeout+0x51/0x78 [<ffffffffbae8fcc5>] do_select+0x813/0x8ce [<ffffffffbae8ff4c>] core_sys_select+0x1cc/0x29d [<ffffffffbaddd1de>] SyS_select+0x65/0xd0 [<ffffffffbb3ecb23>] entry_SYSCALL_64_fastpath+0x31/0xab [<ffffffffffffffff>] 0xffffffffffffffff I wonder if syscall_trace_enter_phase1 -> seccomp_phase1 -> do_exit indicates that the process was killed by seccomp filter. Of the 3 current syscalls, - futex and epoll_wait are present in platform2/attestation/server/attestationd-seccomp-x86.policy - select is not in attestationd-seccomp-x86.policy, though, it is in attestationd-seccomp-amd64.policy, and _newselect is in attestationd-seccomp-arm.policy and also in x86 (where it has no meaning iiuc, it is an arm-only syscall). So, looks like we are again hit by the results of libc uprev (or something similar), and select is now used on x86. Adding it there.
,
Dec 1
Again, unf, we don't hit all branches on CQ tests (and I doubt we ever could), and thus such seccomp filter bugs slip into production :/
,
Dec 1
Side notes: 1) Aren't we supposed to get some indication in logs from seccomp logic on why the daemon was killed? 90% of the time we deal with logs collected from systems that occasionally has issues. And even if we formed a theory that it is the seccomp filter issue (we actually suspected it), we won't be able to learn which exact syscall is not whitelisted w/o access to the live system. 2) Why was attestationd left as zombie as the result, as opposed to dying and respawning? And can this be the reason that there was nothing in the logs? Are we waiting on the threads that never receive the signal (is it how it works?) and thus keep running?
,
Dec 1
Spawned a separate issue 910811 for fixing attestationd seccomp filter. This bug can remain focused on improving tast reporting.
,
Dec 1
,
Dec 5
Andrey this (improving tast reporting) is not 'ready'?
,
Dec 5
#12: I think that label was assigned accidentally. Andrey was just deleting it.
,
Dec 5
> Andrey this (improving tast reporting) is not 'ready'? Yes, I set Cros-Hwsec-Ready here to address the actual underlying issue, which then was extracted to a separate issue 910811 . Removed after that. Was not considering the 'report better errors' part for our cros-hwsec 'ready' queue, since derat@ already owned it. But, of course, I'm also fine with cros-hwsec implementing it instead, if Daniel is not working on it.
,
Dec 5
I'm planning to do it. In-progress change is at https://crrev.com/c/1357680.
,
Jan 7
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/88a36afed15c1f2d17a3947afc6381d1a942feb0 commit 88a36afed15c1f2d17a3947afc6381d1a942feb0 Author: Daniel Erat <derat@chromium.org> Date: Mon Jan 07 20:07:57 2019 tast-tests: Make chrome.New check that cryptohomed works. Add a new cryptohome.CheckService function that chrome.New calls early on to check that cryptohomed is running and that its D-Bus service is registered. cryptohome.CheckDeps can be called to check other daemons that can cause issues (attestationd, tpm_managerd, trunksd, and chapsd). This is intended to improve the error messages that are reported in situations where login would fail later anyway due to pre-existing issues on the DUT. BUG= chromium:910407 TEST=manual: hacked the code to always report failure and then ran ui.ChromeLogin against betty (fake TPM), caroline (TPM 1.2), and eve (TPM 2.0) DUTs and verified that errors were reasonable Change-Id: I415151d07303d87990c80ddd8bc9ca9fa3bb88fd Reviewed-on: https://chromium-review.googlesource.com/1357680 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Shuhei Takahashi <nya@chromium.org> [modify] https://crrev.com/88a36afed15c1f2d17a3947afc6381d1a942feb0/src/chromiumos/tast/local/cryptohome/cryptohome.go [modify] https://crrev.com/88a36afed15c1f2d17a3947afc6381d1a942feb0/src/chromiumos/tast/local/chrome/chrome.go
,
Jan 8
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by apronin@google.com
, Nov 30Components: -OS>Kernel>TPM