New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 733738 link

Starred by 1 user

Issue metadata

Status: Archived
Owner: ----
Closed: Dec 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Android fails to start (9th attempt) in cheets_StartAndroid.stress on cave

Reported by jrbarnette@chromium.org, Jun 15 2017

Issue description

This paladin run failed because of cheets_StartAndroid.stress:
    https://luci-milo.appspot.com/buildbot/chromeos/cave-paladin/589

Failure logs are here:
    https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/123265055-chromeos-test/chromeos2-row8-rack6-host11/

Looking at debug/client.0.DEBUG, this is the last thing
reported:
====
06/14 04:44:34.078 INFO |        arc_common:0043| Android has booted completely.
06/14 04:44:36.081 DEBUG|          arc_util:0041| ARC is enabled in mode enabled
06/14 04:44:36.081 INFO |          arc_util:0105| Saving Android dumpstate.
06/14 04:44:56.094 INFO |          arc_util:0125| Android dumpstate successfully saved.
06/14 04:44:56.123 DEBUG|    cros_interface:0363| ListProcesses(<predicate>)->[276 processes]
06/14 04:44:56.125 INFO |    cros_interface:0546| (Re)starting the ui (logs the user out)
06/14 04:44:56.140 DEBUG|    cros_interface:0439| IsServiceRunning(ui)->True
06/14 04:44:56.141 DEBUG|    cros_interface:0058| sh -c restart ui 
====

After that, all activity stopped, until the job was aborted due to
timeout, a little bit more than 1 hour later.

This problem has thus far been seen only once.

It's unknown whether this failure is related to bug 732434.

 
One additional piece of information:  This was a paladin run, but
the paladin blame list didn't include any CLs that would run on
the DUT during testing, so whatever problem caused the failure, it
wasn't a bad CL.

Comment 2 by ihf@chromium.org, Jun 15 2017

Cc: elijahtaylor@chromium.org dshi@chromium.org ihf@chromium.org kinaba@chromium.org
Summary: Android fails to start (9th attempt) in cheets_StartAndroid.stress on cave (was: Hang in cheets_StartAndroid.stress on cave)
If you look at /var/log/messages in
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/123265055-chromeos-test/chromeos2-row8-rack6-host11/crashinfo.chromeos2-row8-rack6-host11/var/log/

This isn't a reboot like the flaky caroline did. It just failed to start Android the 9th time (out of 10). The system stayed up until requested to reboot.
2017-06-14T11:35:52.815400+00:00 NOTICE root[5232]: autotest runtest cheets_StartAndroid
[...]
2017-06-14T04:44:56.648477-07:00 INFO session_manager[584]: [INFO:session_manager_impl.cc(1511)] Android Container with pid 1236 stopped
2017-06-14T04:44:56.648523-07:00 INFO session_manager[584]: [INFO:upstart_signal_emitter.cc(38)] Emitting stop-arc-instance Upstart signal
2017-06-14T04:44:56.659123-07:00 INFO session_manager[584]: [INFO:upstart_signal_emitter.cc(38)] Emitting stop-arc-network Upstart signal
2017-06-14T04:44:56.660384-07:00 INFO session_manager[584]: [INFO:session_manager_impl.cc(417)] emitting D-Bus signal SessionStateChanged:stopped
2017-06-14T04:44:56.660499-07:00 INFO session_manager[584]: [INFO:session_manager_service.cc(465)] SessionManagerService quitting run loop
2017-06-14T04:44:56.660808-07:00 ERR session_manager[584]: [ERROR:session_manager_impl.cc(955)] Error getting Android container pid.
2017-06-14T04:44:56.660838-07:00 ERR session_manager[584]: [ERROR:dbus_util.cc(14)] CreateError(...): Domain=dbus, Code=org.chromium.SessionManagerInterface.ContainerShutdownFail, Message=Error getting Android container pid.
2017-06-14T04:44:56.661002-07:00 INFO session_manager[584]: [INFO:session_manager_service.cc(207)] SessionManagerService exiting
2017-06-14T04:44:56.688003-07:00 INFO session_manager[584]: [INFO:policy_service.cc(154)] Persisted policy to disk.
2017-06-14T04:44:56.696009-07:00 WARNING kernel: [  468.990667] init: arc-camera main process ended, respawning
2017-06-14T04:44:56.690134-07:00 INFO session_manager[584]: [INFO:policy_service.cc(154)] Persisted policy to disk.
2017-06-14T04:44:56.721087-07:00 INFO minijail0[3420]: libminijail[3420]: not running init loop, exiting immediately
2017-06-14T04:44:56.738436-07:00 INFO arc_camera_service[3433]: Starting ARC camera service provider
2017-06-14T04:44:56.910001-07:00 INFO kernel: [  469.204627] br0: port 1(veth_android) entered disabled state
2017-06-14T04:44:56.915126-07:00 INFO kernel: [  469.209938] device veth_android left promiscuous mode
2017-06-14T04:44:56.915134-07:00 INFO kernel: [  469.210053] br0: port 1(veth_android) entered disabled state

[One hour wait here but DUT remains responsive.]

2017-06-14T05:39:13.623943-07:00 INFO periodic_scheduler[6873]: crash_sender: running /sbin/crash_sender
2017-06-14T05:39:13.624333-07:00 INFO periodic_scheduler[6874]: cros-machine-id-regen: running /usr/sbin/cros-machine-id-regen -r periodic -t 21600
2017-06-14T05:39:13.720538-07:00 NOTICE crash_sender[6899]: Exiting early due to test image.
2017-06-14T05:39:13.725846-07:00 NOTICE cros-machine-id-regen[6902]: Not regenerating since we did so 3722 seconds ago.
2017-06-14T05:39:13.726218-07:00 INFO periodic_scheduler[6903]: crash_sender: job completed
2017-06-14T05:39:13.728185-07:00 INFO periodic_scheduler[6905]: cros-machine-id-regen: job completed
2017-06-14T05:51:04.325355-07:00 INFO sshd[7616]: Accepted publickey for root from 127.0.0.1 port 55823 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2017-06-14T05:51:04.880870-07:00 NOTICE ag[7622]: autotest server[stack::_detect_host|check_host|wrapper] -> ssh_run(grep -q CHROMEOS /etc/lsb-release && ! test -f /mnt/stateful_partition/.android_tester && ! grep -q moblab /etc/lsb-release)
2017-06-14T05:51:04.892317-07:00 INFO sshd[7616]: Received disconnect from 127.0.0.1 port 55823:11: disconnected by user
2017-06-14T05:51:04.892383-07:00 INFO sshd[7616]: Disconnected from 127.0.0.1 port 55823
2017-06-14T05:51:07.469818-07:00 INFO sshd[7628]: Accepted publickey for root from 127.0.0.1 port 55824 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2017-06-14T05:51:07.936602-07:00 NOTICE ag[7632]: autotest server[stack::create_host|job_start|wrapper] -> ssh_run(test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start)
2017-06-14T05:51:08.533919-07:00 NOTICE ag[7635]: autotest server[stack::parallel_simple|get_network_stats|wrapper] -> ssh_run(route; echo SEPARATOR; cat /proc/net/dev)
2017-06-14T05:51:10.004878-07:00 INFO sshd[7640]: Accepted publickey for root from 127.0.0.1 port 55825 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2017-06-14T05:51:10.421072-07:00 NOTICE ag[7644]: autotest server[stack::_detect_host|check_host|wrapper] -> ssh_run(grep -q CHROMEOS /etc/lsb-release && ! test -f /mnt/stateful_partition/.android_tester && ! grep -q moblab /etc/lsb-release)
2017-06-14T05:51:10.432702-07:00 INFO sshd[7640]: Received disconnect from 127.0.0.1 port 55825:11: disconnected by user
2017-06-14T05:51:10.432751-07:00 INFO sshd[7640]: Disconnected from 127.0.0.1 port 55825
2017-06-14T05:51:12.977753-07:00 INFO sshd[7650]: Accepted publickey for root from 127.0.0.1 port 55826 ssh2: RSA SHA256:Fp1qWjFLyK1cTpiI5rdk7iEJwoK9lcnYAgbQtGC3jzU
2017-06-14T05:51:13.481205-07:00 NOTICE ag[7654]: autotest server[stack::_find_orphaned_crashdumps|list_files_glob|wrapper] -> ssh_run(python -c 'import cPickle, glob, sys;cPickle.dump(glob.glob(sys.argv[1]), sys.stdout, 0)')
2017-06-14T05:51:13.999981-07:00 NOTICE ag[7657]: autotest server[stack::is_up|ssh_ping|wrapper] -> ssh_run(true)
2017-06-14T05:51:14.388241-07:00 NOTICE ag[7660]: autotest server[stack::collect_log_file|_get_file_stats|wrapper] -> ssh_run(ls -ld /var/tmp/messages.autotest_start | cut -d" " -f5)
2017-06-14T05:51:14.838826-07:00 NOTICE ag[7664]: autotest server[stack::use_rsync|check_rsync|wrapper] -> ssh_run(rsync --version)
2017-06-14T05:51:15.755956-07:00 NOTICE ag[7669]: autotest server[stack::collect_log_file|_get_file_stats|wrapper] -> ssh_run(ls -ld /var/log/messages | cut -d" " -f5)
2017-06-14T05:51:16.624447-07:00 NOTICE ag[7675]: autotest server[stack::get_crashinfo|collect_command|wrapper] -> ssh_run(dmesg)
2017-06-14T05:51:17.085297-07:00 NOTICE ag[7678]: autotest server[stack::collect_log_file|_get_file_stats|wrapper] -> ssh_run(ls -ld /usr/local/autotest/results/default/ | cut -d" " -f5)
2017-06-14T05:51:18.085914-07:00 NOTICE ag[7684]: autotest server[stack::collect_log_file|_get_file_stats|wrapper] -> ssh_run(ls -ld /var/log | cut -d" " -f5)

I think these kind of failures are usually tracked on b/?
Hmmm...
====
2017-06-14T05:39:13.623943-07:00 INFO periodic_scheduler[6873]: crash_sender: running /sbin/crash_sender
2017-06-14T05:39:13.624333-07:00 INFO periodic_scheduler[6874]: cros-machine-id-regen: running /usr/sbin/cros-machine-id-regen -r periodic -t 21600
2017-06-14T05:39:13.720538-07:00 NOTICE crash_sender[6899]: Exiting early due to test image.
====

Isn't that saying there was a crash?  Is there anything else in the
logs?

Looking at the logs, it seems there was a crash, but I can't find
that any dump file was collected in the results.  Key point of interest
is the timing

04:44:56 - last message logged during chrome shutdown
[silence]
05:39:13 - crash_sender logs its event
[silence]
05:51:04 - logging resumes when the job abort triggers activity

Also, looking closely at the logs, I think the failure is more
in the vein of "failed to shut down successfully", rather than
"failed to start".  A successful restart includes these lines:

====
2017-06-14T04:40:53.219177-07:00 INFO kernel: [  225.514304] br0: port 1(veth_android) entered disabled state
2017-06-14T04:40:53.928990-07:00 INFO kernel: [  226.224267] cryptohomed (1593): drop_caches: 3
2017-06-14T04:40:53.936036-07:00 INFO kernel: [  226.230998] cryptohomed (1593): drop_caches: 3
2017-06-14T04:40:53.949484-07:00 WARNING chapsd[1193]: Unload Token event received for unknown path: /home/root/ee0012a3280e40b088968333deec442b8cb66a9d/chaps
2017-06-14T04:40:54.194824-07:00 ERR imageloader[12544]: Failed to read latest-version file.
2017-06-14T04:40:54.194894-07:00 ERR imageloader[12544]: Failed to verify and mount component.
2017-06-14T04:40:54.206134-07:00 WARNING kernel: [  226.500981] init: imageloader main process (12544) terminated with status 1
2017-06-14T04:40:54.495442-07:00 INFO session_manager[12674]: [INFO:session_manager_main.cc(195)] Will wait 3s for graceful browser exit.
2017-06-14T04:40:54.518523-07:00 INFO session_manager[12674]: [INFO:session_manager_service.cc(139)] SessionManagerService starting
====

The failure stops after the "br0: port 1(veth_android) entered disabled state",
and never reports the remaining terminations, or the messages about
waiting for the browser.  I assume those messages would be part of the
shutdown flow, which makes it seems like the failure happened during
shutdown.

The log offers some clue as to the timing:
    2017-06-14T04:40:53.219177-07:00 INFO kernel: [  225.514304] br0: port 1(veth_android) entered disabled state
    2017-06-14T04:40:53.928990-07:00 INFO kernel: [  226.224267] cryptohomed (1593): drop_caches: 3

In the failure, the "cryptohomed" message is never logged, so the
failure is after shutting down Android's network, and before whatever
triggers the cryptohomed message.

Comment 6 by ihf@chromium.org, Jun 16 2017

Pinged on b/38097492 to see if related.
Status: Archived (was: Untriaged)

Sign in to add a comment