New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 910407 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 8
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Feature



Sign in to add a comment

Report better errors in Tast tests when cryptohomed isn't running

Project Member Reported by derat@chromium.org, Nov 29

Issue description

In http://stainless/search?view=list&first_date=2018-11-16&last_date=2018-11-29&branch=%5Emaster%24&test=%5Etast%5C..*%5C.&status=FAIL&reason=failed+to+remove+cryptohome&exclude_cts=false&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=false, I see a bunch of tests failing with "failed to remove cryptohome: exit status 1".

In the cases that I looked at, this is usually happening because cryptohomed isn't actually running -- I typically see messages like this in the test log:

2018/11/29 01:16:14 [01:16:14.104] Removing cryptohome for testuser@gmail.com
2018/11/29 01:16:14 [01:16:14.124] External command failed: exit status 1
2018/11/29 01:16:14 [01:16:14.124] Command: cryptohome --action=remove --force --user=testuser@gmail.com
2018/11/29 01:16:14 [01:16:14.124] Uncaptured output:
Remove call failed: The name org.chromium.Cryptohome was not provided by any .service files.
2018/11/29 01:16:14 [01:16:14.124] Error at apptest.go:23: Failed to connect to Chrome: failed to remove cryptohome: exit status 1

I'd like to update chrome.New to test that cryptohomed is available before trying to log in. We can then return a better error message, e.g. "cryptohomed not running".

Going further, it'd be nice to try to diagnose why cryptohomed isn't running. In the cases that I've looked at in the past (with much help from apronin@), the underlying problem is typically related to attestationd, tpm_managerd, chapsd, etc. not running. Maybe we could additionally check the status of some or all of those processes.
 
Cc: louiscollard@chromium.org
Components: -OS>Kernel>TPM
Cc: -apronin@chromium.org garryxiao@chromium.org
Labels: Cros-Hwsec-Ready
Thanks for reporting!
Cc: -kerrnel@chromium.org apronin@chromium.org
Cc: kerrnel@chromium.org
..wow I am failing at this. 
In known cases, e.g. https://stainless.corp.google.com/browse/chromeos-autotest-results/262444554-chromeos-test/, the error is caused by cryptohomed crashing, which in turn is caused by "The name org.chromium.Attestation was not provided by any .service files" dbus error.
That likely means that attestationd is not running. But, unfortunately, in all cases the logfiles are trimmed, so don't contain the point when something happens to attestationd or other daemons. Even walking through other jobs on the same host doesn't help. The jobs are either from before the problem starts, or contain already heavily trimmed logs (due to cryptohomed constantly restarting).

E.g. for https://stainless.corp.google.com/browse/chromeos-autotest-results/261807539-chromeos-test/. Judging by cryptohomed crashdumps, the problems start before 2018/11/28 00:00:53 -08:00.

Looking at the jobs on chromeos6-row3-rack6-host15 (http://cautotest.corp.google.com/afe/#tab_id=view_host&object_id=7729).

Before pyro-release/R72-11303.0.0/bvt-perbuild/tast.informational (started 2018-11-28 00:20:11) we have


1) cheets_StartAndroid.stress.1 - cryptohomed already crashing, logs trimmed
Job	261807419	chromeos-test	pyro-release/R72-11303.0.0/bvt-perbuild/cheets_StartAndroid.stress.1	2018-11-27 23:14:11	Aborted
Reset	261807419	chromeos-test	pyro-release/R72-11303.0.0/bvt-perbuild/cheets_StartAndroid.stress.1	2018-11-27 23:13:35	Completed

2) generic_RebootTest (http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=261807418) - cryptohomed already crashing, logs trimmed
Job	261807418	chromeos-test	pyro-release/R72-11303.0.0/bvt-cq/generic_RebootTest	2018-11-27 23:07:31	Completed
Reset	261807418	chromeos-test	pyro-release/R72-11303.0.0/bvt-cq/generic_RebootTest	2018-11-27 23:06:55	Completed

In logs:
2018-11-28T06:08:54.396284+00:00 WARNING kernel: [    0.129904] wrmsrl_safe (MSR_IA32_FEATURE_CONTROL, 00000007) failed error -5
2018-11-28T06:08:54.396284+00:00 WARNING kernel: [    0.145884]  #3<6>[    0.146158] Enabling VMX

========================================================================
  < 3972179 > characters are trimmed here.
========================================================================

3.405986-08:00 INFO sshd[3017]: Accepted publickey for root from 127.0.0.1 port 34244 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2018-11-27T23:09:43.418852-08:00 INFO cryptohomed[3003]: Protocol version: 0
2018-11-27T23:09:43.418933-08:00 INFO cryptohomed[3003]: Pinweaver IsSupported: 0

(Reset job doesn't capture any logs: https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row3-rack6-host15/1804512-reset/)


3) hardware_Memtester.quick (http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=261807416) - everything is still fine.
Job	261807416	chromeos-test	pyro-release/R72-11303.0.0/bvt-cq/hardware_Memtester.quick	2018-11-27 23:05:43	Completed
Reset	261807416	chromeos-test	pyro-release/R72-11303.0.0/bvt-cq/hardware_Memtester.quick	2018-11-27 23:05:07	Completed

Last words from attestationd:
2018-11-28T06:08:56.042983+00:00 INFO attestationd[1228]: Attestation Daemon Started.
2018-11-28T06:08:56.043580+00:00 INFO trunksd[1173]: CheckState: TPM State verified.
2018-11-28T06:08:56.043985+00:00 WARNING attestationd[1228]: libminijail[1228]: non-debug build: ignoring request to enable seccomp logging
2018-11-28T06:08:56.047127+00:00 INFO attestationd[1228]: Attestation service started.
2018-11-28T06:08:56.059054+00:00 INFO attestationd[1228]: Loading attestation database.

cryptohomed is still running here:
2018-11-28T07:05:18.893436+00:00 INFO cryptohomed[1304]: Protocol version: 0
2018-11-28T07:05:18.893504+00:00 INFO cryptohomed[1304]: Pinweaver IsSupported: 0

Last timestamp captured in syslog for this test:
2018-11-28T07:06:20.117690+00:00 NOTICE ag[29999]: autotest server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-AVdYI8 0 0)


So, something has happened on boot between 23:06:20 and 23:09:43 (-08:00), but we have no logs for it. TZ has changed.
Similar picture for other test runs with this issue.
E.g. for failed nami run (http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=262099801), the issues started between cheets_StartAndroid and autoupdate_ForcedOBEUpdate

1) autoupdate_ForcedOBEUpdate - cryptohomed crashing
Job	262099461	chromeos-test	nami-release/R72-11312.0.0/bvt-perbuild/autoupdate_ForcedOOBEUpdate.delta	2018-11-28 23:18:51	Completed
Reset	262099461	chromeos-test	nami-release/R72-11312.0.0/bvt-perbuild/autoupdate_ForcedOOBEUpdate.delta	2018-11-28 23:18:13	Completed

2018-11-29T07:20:33.145621+00:00 WARNING kernel: [    0.079892] can not enable VMX on CPU1 (already locked)
2018-11-29T07:20:33.145625+00:00 INFO kernel: [    0.084401] NMI watchdog: enabled on all CPUs

========================================================================
  < 464957 > characters are trimmed here.
========================================================================

ho 'no boot_id available'; fi)
2018-11-28T23:23:57.016640-08:00 ERR chapsd[1257]: TPM2Utility cannot be ready if the TPM is not owned.
2018-11-28T23:23:57.022884-08:00 ERR cryptohomed[3909]: AddDBusError(...): Domain=dbus, Code=org.freedesktop.DBus.Error.ServiceUnknown, Message=The name org.chromium.Attestation was not provided by any .service files
2018-11-28T23:23:57.023382-08:00 CRIT cryptohomed[3909]: SetSystemSalt error: 2#012/usr/lib64/libbase-core-456626.so(_ZN4base5debug10StackTraceC1Em+0x1c) [0x7e03a7f9937c]#012

2) cheets_StartAndroid.stress
Job	262099368	chromeos-test	nami-release/R72-11312.0.0/bvt-perbuild/cheets_StartAndroid.stress.3	2018-11-28 23:10:47	Completed
Reset	262099368	chromeos-test	nami-release/R72-11312.0.0/bvt-perbuild/cheets_StartAndroid.stress.3	2018-11-28 23:10:06	Completed

Last words:
2018-11-29T07:11:16.895595+00:00 NOTICE ag[11135]: autotest server[stack::execute_section|_execute_daemon|run] -> ssh_run(nohup /usr/local/autotest/bin/autotestd /tmp/autoserv-NyNDe7 -H autoserv --verbose --hostname=chromeos2-row7-rack5-host11 --user=chromeos-test /usr/local/autotest/control.autoserv >/dev/null 2>/dev/null &)

Something happened between 23:11:16 and 23:23:57. No logs. TZ changed.

It'd be interesting indeed to know if attestationd is considered 'started'. And what's with other daemons: attestationd may be not running because trunksd is not running, for example. Interestingly, there are no signs of attestationd being constantly respawned.
Another case: something happened between 
1) desktopui_ExitOnSupervisedUserCrash (https://stainless.corp.google.com/browse/chromeos-autotest-results/261724365-chromeos-test/chromeos2-row7-rack5-host5/)
Everything still working fine at 14:53:50:
2018-11-27T14:52:57.400549-08:00 INFO cryptohomed[1507]: Protocol version: 0
2018-11-27T14:52:57.400584-08:00 INFO cryptohomed[1507]: Pinweaver IsSupported: 0
...
2018-11-27T14:53:50.075270-08:00 NOTICE ag[9214]: autotest server[stack::execute_section|_execute_daemon|run] -> ssh_run(nohup /usr/local/autotest/bin/autotestd /tmp/autoserv-2YKbiD -H autoserv --verbose --hostname=chromeos2-row7-rack5-host5 --user=chromeos-test /usr/local/autotest/control.autoserv >/dev/null 2>/dev/null &)

and
2) autoupdate_OmahaResponse.bad_metadata_size (https://stainless.corp.google.com/browse/chromeos-autotest-results/261724463-chromeos-test/chromeos2-row7-rack5-host5/)
cryptohomed in crashloop at 14:57:58:
2018-11-27T21:47:06.171893+00:00 INFO kernel: [    0.080387] Enabling VMX on cpu 1
2018-11-27T21:4

========================================================================
  < 3703908 > characters are trimmed here.
========================================================================

h notification for cryptohomed[3321] sig 6, user 0 group 0 (ignoring - crash origin is not ARC)
2018-11-27T14:57:58.211052-08:00 WARNING kernel: [   13.242984] init: cryptohomed main process (3321) killed by ABRT signal
2018-11-27T14:57:58.211059-08:00 WARNING kernel: [   13.242996] init: cryptohomed main process ended, respawning

This time also reboot, but no TZ change.
In all cases, the issues start after reboot. But the immediate tests before & after seem unrelated to the problem.
So, just a randomly popping issue with attestationd or trunksd on boot? Can be another syscall not whitelisted in seccomp filters after libc update, which happens occasionally.
Finally, thanks to mruthven@, got a repro from the faft test runner (eve, R72-11310.0.0). 'messages' is attached.

messages don't contain anything suspicious re attestationd, but attestationd process is zombie (with 4 threads), init still considers it as 'running'.

# status attestationd
attestationd start/running, process 1307

# ps -mLF --pid 1307
UID        PID  PPID   LWP  C NLWP    SZ   RSS PSR STIME TTY          TIME CMD
attesta+  1307     1     -  0    4     0     0   - 16:34 ?        00:00:01 [attestationd] <defunct>
attesta+     -     -  1307  0    -     -     -   2 16:34 -        00:00:00 -
attesta+     -     -  1310  0    -     -     -   1 16:34 -        00:00:00 -
attesta+     -     -  1312  0    -     -     -   1 16:34 -        00:00:01 -
attesta+     -     -  1327  0    -     -     -   3 16:34 -        00:00:00 -

# cat /proc/1307/task/**/syscall 
24 0x781af16ab068 0x2b 0x7ffd397db220 0x58b101bd6010 0x0 0x0 0x7ffd397db028 0x781af0c93347
202 0x781aefd9c430 0x80 0x0 0x0 0x0 0x59d73a 0x781aefd9c2b0 0x781af126fa47
232 0x10 0x781ae0002000 0x20 0x950 0x0 0x0 0x781aef59baf0 0x781af0cacf3e
23 0x17 0x781aeed9a788 0x0 0x0 0x0 0x0 0x781aeed9a650 0x781af0ca3ea6

# cat /proc/1307/task/1307/stack 
[<ffffffffbad5a29d>] do_exit+0x411/0x80f
[<ffffffffbad7cf10>] seccomp_phase1+0x345/0x34a
[<ffffffffbaca47a1>] syscall_trace_enter_phase1+0xa7/0x123
[<ffffffffbb3ecbb0>] tracesys+0xd/0x44
[<ffffffffffffffff>] 0xffffffffffffffff
# cat /proc/1307/task/1310/stack 
[<ffffffffbad7b3d5>] futex_wait_queue_me+0x89/0xdf
[<ffffffffbad79f96>] futex_wait+0x2a9/0x7f1
[<ffffffffbad7aa85>] SyS_futex+0x5a7/0x951
[<ffffffffbb3ecb23>] entry_SYSCALL_64_fastpath+0x31/0xab
[<ffffffffffffffff>] 0xffffffffffffffff
# cat /proc/1307/task/1312/stack 
[<ffffffffbae97b83>] SYSC_epoll_wait+0x2fa/0x6dd
[<ffffffffbae97887>] SyS_epoll_wait+0xe/0x10
[<ffffffffbb3ecb23>] entry_SYSCALL_64_fastpath+0x31/0xab
[<ffffffffffffffff>] 0xffffffffffffffff
# cat /proc/1307/task/1327/stack 
[<ffffffffbae8f48b>] poll_schedule_timeout+0x51/0x78
[<ffffffffbae8fcc5>] do_select+0x813/0x8ce
[<ffffffffbae8ff4c>] core_sys_select+0x1cc/0x29d
[<ffffffffbaddd1de>] SyS_select+0x65/0xd0
[<ffffffffbb3ecb23>] entry_SYSCALL_64_fastpath+0x31/0xab
[<ffffffffffffffff>] 0xffffffffffffffff

I wonder if syscall_trace_enter_phase1 -> seccomp_phase1 -> do_exit indicates that the process was killed by seccomp filter.
Of the 3 current syscalls, 
 - futex and epoll_wait are present in platform2/attestation/server/attestationd-seccomp-x86.policy
 - select is not in attestationd-seccomp-x86.policy, though, it is in attestationd-seccomp-amd64.policy, and _newselect is in attestationd-seccomp-arm.policy and also in x86 (where it has no meaning iiuc, it is an arm-only syscall).

So, looks like we are again hit by the results of libc uprev (or something similar), and select is now used on x86. Adding it there.

messages
6.2 MB View Download
Again, unf, we don't hit all branches on CQ tests (and I doubt we ever could), and thus such seccomp filter bugs slip into production :/
Side notes:

1) Aren't we supposed to get some indication in logs from seccomp logic on why the daemon was killed? 
90% of the time we deal with logs collected from systems that occasionally has issues. And even if we formed a theory that it is the seccomp filter issue (we actually suspected it), we won't be able to learn which exact syscall is not whitelisted w/o access to the live system.

2) Why was attestationd left as zombie as the result, as opposed to dying and respawning? And can this be the reason that there was nothing in the logs? Are we waiting on the threads that never receive the signal (is it how it works?) and thus keep running? 
Spawned a separate  issue 910811  for fixing attestationd seccomp filter.
This bug can remain focused on improving tast reporting.
Labels: -Cros-Hwsec-Ready
Andrey this (improving tast reporting) is not 'ready'?
Status: Started (was: Assigned)
#12: I think that label was assigned accidentally. Andrey was just deleting it.
> Andrey this (improving tast reporting) is not 'ready'?

Yes, I set Cros-Hwsec-Ready here to address the actual underlying issue, which then was extracted to a separate  issue 910811 . Removed after that.

Was not considering the 'report better errors' part for our cros-hwsec 'ready' queue, since derat@ already owned it. But, of course, I'm also fine with cros-hwsec implementing it instead, if Daniel is not working on it.
I'm planning to do it. In-progress change is at https://crrev.com/c/1357680.
Project Member

Comment 16 by bugdroid1@chromium.org, Jan 7

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/88a36afed15c1f2d17a3947afc6381d1a942feb0

commit 88a36afed15c1f2d17a3947afc6381d1a942feb0
Author: Daniel Erat <derat@chromium.org>
Date: Mon Jan 07 20:07:57 2019

tast-tests: Make chrome.New check that cryptohomed works.

Add a new cryptohome.CheckService function that chrome.New
calls early on to check that cryptohomed is running and that
its D-Bus service is registered.

cryptohome.CheckDeps can be called to check other daemons
that can cause issues (attestationd, tpm_managerd, trunksd,
and chapsd).

This is intended to improve the error messages that are
reported in situations where login would fail later anyway
due to pre-existing issues on the DUT.

BUG= chromium:910407 
TEST=manual: hacked the code to always report failure and
     then ran ui.ChromeLogin against betty (fake TPM),
     caroline (TPM 1.2), and eve (TPM 2.0) DUTs and verified
     that errors were reasonable

Change-Id: I415151d07303d87990c80ddd8bc9ca9fa3bb88fd
Reviewed-on: https://chromium-review.googlesource.com/1357680
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>

[modify] https://crrev.com/88a36afed15c1f2d17a3947afc6381d1a942feb0/src/chromiumos/tast/local/cryptohome/cryptohome.go
[modify] https://crrev.com/88a36afed15c1f2d17a3947afc6381d1a942feb0/src/chromiumos/tast/local/chrome/chrome.go

Status: Fixed (was: Started)

Sign in to add a comment