Android fails to start (9th attempt) in cheets_StartAndroid.stress on cave
Reported by
jrbarnette@chromium.org,
Jun 15 2017
|
|||
Issue description
This paladin run failed because of cheets_StartAndroid.stress:
https://luci-milo.appspot.com/buildbot/chromeos/cave-paladin/589
Failure logs are here:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/123265055-chromeos-test/chromeos2-row8-rack6-host11/
Looking at debug/client.0.DEBUG, this is the last thing
reported:
====
06/14 04:44:34.078 INFO | arc_common:0043| Android has booted completely.
06/14 04:44:36.081 DEBUG| arc_util:0041| ARC is enabled in mode enabled
06/14 04:44:36.081 INFO | arc_util:0105| Saving Android dumpstate.
06/14 04:44:56.094 INFO | arc_util:0125| Android dumpstate successfully saved.
06/14 04:44:56.123 DEBUG| cros_interface:0363| ListProcesses(<predicate>)->[276 processes]
06/14 04:44:56.125 INFO | cros_interface:0546| (Re)starting the ui (logs the user out)
06/14 04:44:56.140 DEBUG| cros_interface:0439| IsServiceRunning(ui)->True
06/14 04:44:56.141 DEBUG| cros_interface:0058| sh -c restart ui
====
After that, all activity stopped, until the job was aborted due to
timeout, a little bit more than 1 hour later.
This problem has thus far been seen only once.
It's unknown whether this failure is related to bug 732434.
,
Jun 15 2017
If you look at /var/log/messages in https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/123265055-chromeos-test/chromeos2-row8-rack6-host11/crashinfo.chromeos2-row8-rack6-host11/var/log/ This isn't a reboot like the flaky caroline did. It just failed to start Android the 9th time (out of 10). The system stayed up until requested to reboot. 2017-06-14T11:35:52.815400+00:00 NOTICE root[5232]: autotest runtest cheets_StartAndroid [...] 2017-06-14T04:44:56.648477-07:00 INFO session_manager[584]: [INFO:session_manager_impl.cc(1511)] Android Container with pid 1236 stopped 2017-06-14T04:44:56.648523-07:00 INFO session_manager[584]: [INFO:upstart_signal_emitter.cc(38)] Emitting stop-arc-instance Upstart signal 2017-06-14T04:44:56.659123-07:00 INFO session_manager[584]: [INFO:upstart_signal_emitter.cc(38)] Emitting stop-arc-network Upstart signal 2017-06-14T04:44:56.660384-07:00 INFO session_manager[584]: [INFO:session_manager_impl.cc(417)] emitting D-Bus signal SessionStateChanged:stopped 2017-06-14T04:44:56.660499-07:00 INFO session_manager[584]: [INFO:session_manager_service.cc(465)] SessionManagerService quitting run loop 2017-06-14T04:44:56.660808-07:00 ERR session_manager[584]: [ERROR:session_manager_impl.cc(955)] Error getting Android container pid. 2017-06-14T04:44:56.660838-07:00 ERR session_manager[584]: [ERROR:dbus_util.cc(14)] CreateError(...): Domain=dbus, Code=org.chromium.SessionManagerInterface.ContainerShutdownFail, Message=Error getting Android container pid. 2017-06-14T04:44:56.661002-07:00 INFO session_manager[584]: [INFO:session_manager_service.cc(207)] SessionManagerService exiting 2017-06-14T04:44:56.688003-07:00 INFO session_manager[584]: [INFO:policy_service.cc(154)] Persisted policy to disk. 2017-06-14T04:44:56.696009-07:00 WARNING kernel: [ 468.990667] init: arc-camera main process ended, respawning 2017-06-14T04:44:56.690134-07:00 INFO session_manager[584]: [INFO:policy_service.cc(154)] Persisted policy to disk. 2017-06-14T04:44:56.721087-07:00 INFO minijail0[3420]: libminijail[3420]: not running init loop, exiting immediately 2017-06-14T04:44:56.738436-07:00 INFO arc_camera_service[3433]: Starting ARC camera service provider 2017-06-14T04:44:56.910001-07:00 INFO kernel: [ 469.204627] br0: port 1(veth_android) entered disabled state 2017-06-14T04:44:56.915126-07:00 INFO kernel: [ 469.209938] device veth_android left promiscuous mode 2017-06-14T04:44:56.915134-07:00 INFO kernel: [ 469.210053] br0: port 1(veth_android) entered disabled state [One hour wait here but DUT remains responsive.] 2017-06-14T05:39:13.623943-07:00 INFO periodic_scheduler[6873]: crash_sender: running /sbin/crash_sender 2017-06-14T05:39:13.624333-07:00 INFO periodic_scheduler[6874]: cros-machine-id-regen: running /usr/sbin/cros-machine-id-regen -r periodic -t 21600 2017-06-14T05:39:13.720538-07:00 NOTICE crash_sender[6899]: Exiting early due to test image. 2017-06-14T05:39:13.725846-07:00 NOTICE cros-machine-id-regen[6902]: Not regenerating since we did so 3722 seconds ago. 2017-06-14T05:39:13.726218-07:00 INFO periodic_scheduler[6903]: crash_sender: job completed 2017-06-14T05:39:13.728185-07:00 INFO periodic_scheduler[6905]: cros-machine-id-regen: job completed 2017-06-14T05:51:04.325355-07:00 INFO sshd[7616]: Accepted publickey for root from 127.0.0.1 port 55823 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU 2017-06-14T05:51:04.880870-07:00 NOTICE ag[7622]: autotest server[stack::_detect_host|check_host|wrapper] -> ssh_run(grep -q CHROMEOS /etc/lsb-release && ! test -f /mnt/stateful_partition/.android_tester && ! grep -q moblab /etc/lsb-release) 2017-06-14T05:51:04.892317-07:00 INFO sshd[7616]: Received disconnect from 127.0.0.1 port 55823:11: disconnected by user 2017-06-14T05:51:04.892383-07:00 INFO sshd[7616]: Disconnected from 127.0.0.1 port 55823 2017-06-14T05:51:07.469818-07:00 INFO sshd[7628]: Accepted publickey for root from 127.0.0.1 port 55824 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU 2017-06-14T05:51:07.936602-07:00 NOTICE ag[7632]: autotest server[stack::create_host|job_start|wrapper] -> ssh_run(test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start) 2017-06-14T05:51:08.533919-07:00 NOTICE ag[7635]: autotest server[stack::parallel_simple|get_network_stats|wrapper] -> ssh_run(route; echo SEPARATOR; cat /proc/net/dev) 2017-06-14T05:51:10.004878-07:00 INFO sshd[7640]: Accepted publickey for root from 127.0.0.1 port 55825 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU 2017-06-14T05:51:10.421072-07:00 NOTICE ag[7644]: autotest server[stack::_detect_host|check_host|wrapper] -> ssh_run(grep -q CHROMEOS /etc/lsb-release && ! test -f /mnt/stateful_partition/.android_tester && ! grep -q moblab /etc/lsb-release) 2017-06-14T05:51:10.432702-07:00 INFO sshd[7640]: Received disconnect from 127.0.0.1 port 55825:11: disconnected by user 2017-06-14T05:51:10.432751-07:00 INFO sshd[7640]: Disconnected from 127.0.0.1 port 55825 2017-06-14T05:51:12.977753-07:00 INFO sshd[7650]: Accepted publickey for root from 127.0.0.1 port 55826 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU 2017-06-14T05:51:13.481205-07:00 NOTICE ag[7654]: autotest server[stack::_find_orphaned_crashdumps|list_files_glob|wrapper] -> ssh_run(python -c 'import cPickle, glob, sys;cPickle.dump(glob.glob(sys.argv[1]), sys.stdout, 0)') 2017-06-14T05:51:13.999981-07:00 NOTICE ag[7657]: autotest server[stack::is_up|ssh_ping|wrapper] -> ssh_run(true) 2017-06-14T05:51:14.388241-07:00 NOTICE ag[7660]: autotest server[stack::collect_log_file|_get_file_stats|wrapper] -> ssh_run(ls -ld /var/tmp/messages.autotest_start | cut -d" " -f5) 2017-06-14T05:51:14.838826-07:00 NOTICE ag[7664]: autotest server[stack::use_rsync|check_rsync|wrapper] -> ssh_run(rsync --version) 2017-06-14T05:51:15.755956-07:00 NOTICE ag[7669]: autotest server[stack::collect_log_file|_get_file_stats|wrapper] -> ssh_run(ls -ld /var/log/messages | cut -d" " -f5) 2017-06-14T05:51:16.624447-07:00 NOTICE ag[7675]: autotest server[stack::get_crashinfo|collect_command|wrapper] -> ssh_run(dmesg) 2017-06-14T05:51:17.085297-07:00 NOTICE ag[7678]: autotest server[stack::collect_log_file|_get_file_stats|wrapper] -> ssh_run(ls -ld /usr/local/autotest/results/default/ | cut -d" " -f5) 2017-06-14T05:51:18.085914-07:00 NOTICE ag[7684]: autotest server[stack::collect_log_file|_get_file_stats|wrapper] -> ssh_run(ls -ld /var/log | cut -d" " -f5) I think these kind of failures are usually tracked on b/?
,
Jun 16 2017
Hmmm... ==== 2017-06-14T05:39:13.623943-07:00 INFO periodic_scheduler[6873]: crash_sender: running /sbin/crash_sender 2017-06-14T05:39:13.624333-07:00 INFO periodic_scheduler[6874]: cros-machine-id-regen: running /usr/sbin/cros-machine-id-regen -r periodic -t 21600 2017-06-14T05:39:13.720538-07:00 NOTICE crash_sender[6899]: Exiting early due to test image. ==== Isn't that saying there was a crash? Is there anything else in the logs?
,
Jun 16 2017
Looking at the logs, it seems there was a crash, but I can't find that any dump file was collected in the results. Key point of interest is the timing 04:44:56 - last message logged during chrome shutdown [silence] 05:39:13 - crash_sender logs its event [silence] 05:51:04 - logging resumes when the job abort triggers activity
,
Jun 16 2017
Also, looking closely at the logs, I think the failure is more in the vein of "failed to shut down successfully", rather than "failed to start". A successful restart includes these lines: ==== 2017-06-14T04:40:53.219177-07:00 INFO kernel: [ 225.514304] br0: port 1(veth_android) entered disabled state 2017-06-14T04:40:53.928990-07:00 INFO kernel: [ 226.224267] cryptohomed (1593): drop_caches: 3 2017-06-14T04:40:53.936036-07:00 INFO kernel: [ 226.230998] cryptohomed (1593): drop_caches: 3 2017-06-14T04:40:53.949484-07:00 WARNING chapsd[1193]: Unload Token event received for unknown path: /home/root/ee0012a3280e40b088968333deec442b8cb66a9d/chaps 2017-06-14T04:40:54.194824-07:00 ERR imageloader[12544]: Failed to read latest-version file. 2017-06-14T04:40:54.194894-07:00 ERR imageloader[12544]: Failed to verify and mount component. 2017-06-14T04:40:54.206134-07:00 WARNING kernel: [ 226.500981] init: imageloader main process (12544) terminated with status 1 2017-06-14T04:40:54.495442-07:00 INFO session_manager[12674]: [INFO:session_manager_main.cc(195)] Will wait 3s for graceful browser exit. 2017-06-14T04:40:54.518523-07:00 INFO session_manager[12674]: [INFO:session_manager_service.cc(139)] SessionManagerService starting ==== The failure stops after the "br0: port 1(veth_android) entered disabled state", and never reports the remaining terminations, or the messages about waiting for the browser. I assume those messages would be part of the shutdown flow, which makes it seems like the failure happened during shutdown. The log offers some clue as to the timing: 2017-06-14T04:40:53.219177-07:00 INFO kernel: [ 225.514304] br0: port 1(veth_android) entered disabled state 2017-06-14T04:40:53.928990-07:00 INFO kernel: [ 226.224267] cryptohomed (1593): drop_caches: 3 In the failure, the "cryptohomed" message is never logged, so the failure is after shutting down Android's network, and before whatever triggers the cryptohomed message.
,
Jun 16 2017
Pinged on b/38097492 to see if related.
,
Dec 14 2017
|
|||
►
Sign in to add a comment |
|||
Comment 1 by jrbarnette@chromium.org
, Jun 15 2017