PFQ - Android did not boot because of arc-networkd error
Reported by
jrbarnette@chromium.org,
May 30 2018
|
|||||||
Issue description
This CQ slave failed:
https://luci-milo.appspot.com/buildbot/chromeos/quawks-paladin/3090
The failure was this test:
cheets_StartAndroid.stress: FAIL: Android did not boot!
The blamelist for the paladin was several dozen CLs; I've no idea
how to zoom in on a bad CL culprit. It's not even clear that there
_is_ a culprit: This could be a bug in the tree.
So for now, here's what's happening:
* This bug is a placeholder for any debug necessary.
* The CQ will be annotated to say "Bug in ToT", listing this
bug, and then finalized.
* All of the potential culprit CLs will get another trip to the
well.
* If one of those CLs caused the problem, that CQ run will fail,
too.
Assigning this to the Asia ARC++ constable. Here's what he should
do with it:
* If the bug doesn't occur again, it's a bug in the tree. Since
it will have occurred only once, he can mark it as low priority.
* If the bug does occur again, he'll have to figure out whether
the bug is in the tree, or a bad CL.
,
May 30 2018
quawks-paladin/3092 corresponding to master-paladin/18737 is green. https://luci-milo.appspot.com/buildbot/chromeos/quawks-paladin/3092 In 3090, arc-networkd is crashing due to read(2) error. So, I guess there is network infra issue? Assigned back to jrbarnette@ to infra check.
,
May 30 2018
> In 3090, arc-networkd is crashing due to read(2) error. > So, I guess there is network infra issue? There were no reported network issues in the lab. In any event, failures in the external network environment shouldn't cause Android not to boot. That said, this problem has only happened the one time, so it looks like it could fall into the "low priority due to infrequency" bucket. However, since the failure, quawks has only had two test runs, so the track record is merely "failed 1 time in 3". That's not enough for a definitive diagnosis, so we should hold this open.
,
May 30 2018
,
May 31 2018
Re-title and bump up priority since the arc-networkd error now happening in PFQ for boards (and first failed build): - betty-chrome-pfq https://luci-milo.appspot.com/buildbot/chromeos/betty-chrome-pfq/1805 - caroline-arcnext-chrome-pfq https://luci-milo.appspot.com/buildbot/chromeos/caroline-arcnext-chrome-pfq/439 - cyan-chrome-pfq https://luci-milo.appspot.com/buildbot/chromeos/cyan-chrome-pfq/3292 - kevin-arcnext-chrome-pfq https://luci-milo.appspot.com/buildbot/chromeos/kevin-arcnext-chrome-pfq/440 - reef-chrome-pfq https://luci-milo.appspot.com/buildbot/chromeos/reef-chrome-pfq/2598 - veyron_minnie-chrome-pfq https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-chrome-pfq/3294 Various cheets_CTS test fails because of it. Some of them have arc-networkd symbols and the crash stack looks like this: ==== Operating system: Linux 0.0.0 Linux 3.18.0-17847-ga95d64571fba #1 SMP PREEMPT Wed May 30 10:11:38 PDT 2018 x86_64 CPU: amd64 family 6 model 76 stepping 3 4 CPUs GPU: UNKNOWN Crash reason: SIGABRT Crash address: 0x0 Process uptime: not available Thread 0 (crashed) 0 libc-2.23.so!raise [raise.c : 54 + 0x10] rax = 0x0000000000000000 rdx = 0x0000000000000006 rcx = 0xffffffffffffffff rbx = 0x0000000000000000 rsi = 0x0000000000001adb rdi = 0x0000000000001adb rbp = 0x00007ffd8028c7d0 rsp = 0x00007ffd8028c6a8 r8 = 0x00007ffd8028c6f1 r9 = 0x00007ffd8028c6e7 r10 = 0x0000000000000008 r11 = 0x0000000000000206 r12 = 0x00007ffd8028cc68 r13 = 0x00007ffd8028c808 r14 = 0x00007ffd8028cc78 r15 = 0x00007ffd8028cc70 rip = 0x0000748e8ba21dd2 Found by: given as instruction pointer in context 1 libc-2.23.so!abort [abort.c : 89 + 0xa] rbx = 0x0000000000000000 rbp = 0x00007ffd8028c7d0 rsp = 0x00007ffd8028c6b0 r12 = 0x00007ffd8028cc68 r13 = 0x00007ffd8028c808 r14 = 0x00007ffd8028cc78 r15 = 0x00007ffd8028cc70 rip = 0x0000748e8ba23bf6 Found by: call frame info 2 libbase-core-395517.so!base::debug::BreakDebugger() [debugger_posix.cc : 219 + 0x5] rbx = 0x0000000000000000 rbp = 0x00007ffd8028c7e0 rsp = 0x00007ffd8028c7e0 r12 = 0x00007ffd8028cc68 r13 = 0x00007ffd8028c808 r14 = 0x00007ffd8028cc78 r15 = 0x00007ffd8028cc70 rip = 0x0000748e8c23be41 Found by: call frame info 3 libbase-core-395517.so!logging::LogMessage::~LogMessage() [logging.cc : 755 + 0x5] rbx = 0x0000000000000000 rbp = 0x00007ffd8028cc50 rsp = 0x00007ffd8028c7f0 r12 = 0x00007ffd8028cc68 r13 = 0x00007ffd8028c808 r14 = 0x00007ffd8028cc78 r15 = 0x00007ffd8028cc70 rip = 0x0000748e8c25f754 Found by: call frame info 4 arc-networkd!arc_networkd::IpHelper::OnFileCanReadWithoutBlocking(int) [ip_helper.cc : 88 + 0x26] rbx = 0x00007ffd8028cc68 rbp = 0x00007ffd8028d1b0 rsp = 0x00007ffd8028cc60 r12 = 0x00007ffd8028dae0 r13 = 0x0000630b8e1b2270 r14 = 0x00007ffd8028d8a0 r15 = 0x0000000000000002 rip = 0x0000630b8d96f3c6 Found by: call frame info 5 libbase-core-395517.so!base::MessagePumpLibevent::OnLibeventNotification(int, short, void*) [message_pump_libevent.cc : 102 + 0x8] rbx = 0x0000748e8c325240 rbp = 0x00007ffd8028d350 rsp = 0x00007ffd8028d1c0 r12 = 0x00007ffd8028dae0 r13 = 0x0000630b8e1b2270 r14 = 0x0000000000000004 r15 = 0x0000000000000002 rip = 0x0000748e8c26e55f Found by: call frame info 6 libevent_core-2.1.so.6.0.2!event_process_active_single_queue [event.c : 1580 + 0x9] rbx = 0x0000000000000082 rbp = 0x00007ffd8028d3e0 rsp = 0x00007ffd8028d360 r12 = 0x0000630b8e1b23c8 r13 = 0x0000630b8e1b2270 r14 = 0x0000630b8e1b2270 r15 = 0x0000630b8e1bd2b0 rip = 0x0000748e8b3b7dc9 Found by: call frame info 7 libevent_core-2.1.so.6.0.2!event_base_loop [event.c : 0 + 0x5] rbx = 0x0000000000000000 rbp = 0x00007ffd8028d480 rsp = 0x00007ffd8028d3f0 r12 = 0x0000630b8e1b2380 r13 = 0x0000000000000000 r14 = 0x0000630b8e1b2270 r15 = 0x0000000000000000 rip = 0x0000748e8b3b559e Found by: call frame info 8 libbase-core-395517.so!base::MessagePumpLibevent::Run(base::MessagePump::Delegate*) [message_pump_libevent.cc : 228 + 0xa] rbx = 0x0000630b8e1b21c0 rbp = 0x00007ffd8028d4f0 rsp = 0x00007ffd8028d490 r12 = 0x0000630b8e1b2208 r13 = 0x0000000000000000 r14 = 0x0000630b8e1b1f40 r15 = 0x0000000000000000 rip = 0x0000748e8c26e76e Found by: call frame info 9 libbase-core-395517.so!base::MessageLoop::RunHandler() [message_loop.cc : 442 + 0x9] rbx = 0x0000630b8e1b1f40 rbp = 0x00007ffd8028d650 rsp = 0x00007ffd8028d500 r12 = 0x00007ffd8028d728 r13 = 0x00007ffd8028dcd0 r14 = 0x00007ffd8028d668 r15 = 0x00007ffd8028d718 rip = 0x0000748e8c2684a1 Found by: call frame info 10 libbase-core-395517.so!base::RunLoop::Run() [run_loop.cc : 35 + 0x5] rbx = 0x00007ffd8028d6c0 rbp = 0x00007ffd8028d6b0 rsp = 0x00007ffd8028d660 r12 = 0x00007ffd8028d728 r13 = 0x00007ffd8028dcd0 r14 = 0x00007ffd8028d668 r15 = 0x00007ffd8028d718 rip = 0x0000748e8c29a3e0 Found by: call frame info 11 libbrillo-core-395517.so!brillo::BaseMessageLoop::Run() [base_message_loop.cc : 212 + 0x8] rbx = 0x00007ffd8028d918 rbp = 0x00007ffd8028d700 rsp = 0x00007ffd8028d6c0 r12 = 0x00007ffd8028d728 r13 = 0x00007ffd8028dcd0 r14 = 0x00007ffd8028d6c0 r15 = 0x00007ffd8028d718 rip = 0x0000748e8c3514c3 Found by: call frame info 12 libbrillo-core-395517.so!brillo::Daemon::Run() [daemon.cc : 31 + 0x8] rbx = 0x00007ffd8028d918 rbp = 0x00007ffd8028d770 rsp = 0x00007ffd8028d710 r12 = 0x00007ffd8028d728 r13 = 0x00007ffd8028dcd0 r14 = 0x00007ffd8028d8a0 r15 = 0x00007ffd8028d718 rip = 0x0000748e8c3473f4 Found by: call frame info 13 arc-networkd!main [main.cc : 49 + 0x5] rbx = 0x00007ffd8028d880 rbp = 0x00007ffd8028dbe0 rsp = 0x00007ffd8028d780 r12 = 0x0000630b8e1ace50 r13 = 0x00007ffd8028dcd0 r14 = 0x00007ffd8028dcd8 r15 = 0x0000000000000004 rip = 0x0000630b8d970ba3 Found by: call frame info 14 libc-2.23.so!__libc_start_main [libc-start.c : 289 + 0x1a] rbx = 0x0000000000000000 rbp = 0x00007ffd8028dcb0 rsp = 0x00007ffd8028dbf0 r12 = 0x0000630b8d996040 r13 = 0x00007ffd8028dcd0 r14 = 0x0000000000000000 r15 = 0x0000000000000000 rip = 0x0000748e8ba0e736 Found by: call frame info 15 arc-networkd!_start + 0x29 rbx = 0x0000000000000000 rbp = 0x0000000000000000 rsp = 0x00007ffd8028dcc0 r12 = 0x0000630b8d9627e0 r13 = 0x00007ffd8028dcd0 r14 = 0x0000000000000000 r15 = 0x0000000000000000 rip = 0x0000630b8d962809 Found by: call frame info 16 0x7ffd8028dcc8 rbx = 0x0000000000000000 rbp = 0x0000000000000000 rsp = 0x00007ffd8028dcc8 r12 = 0x0000630b8d9627e0 r13 = 0x00007ffd8028dcd0 r14 = 0x0000000000000000 r15 = 0x0000000000000000 rip = 0x00007ffd8028dcc8 Found by: call frame info 17 arc-networkd!_init + 0xd98 rsp = 0x00007ffd8028ddf8 rip = 0x0000630b8d9627e0 Found by: stack scanning ==== This is a CHECK in ip_helper.cc. https://cs.corp.google.com/chromeos_public/src/platform2/arc/network/ip_helper.cc?rcl=0bfdcd5675ee83f8b929226730d7f1c35bcd9bdb&l=88 The code is about 2-year old (+cernekee@). Not sure why it starts to fail. Anyone could see whether the CLs in those first failed builds suspicious?
,
May 31 2018
,
May 31 2018
The arc-networkd crash might be a side effect of android stop running. Looked at the cyan logs: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/204194385-chromeos-test/chromeos4-row11-rack11-host1/cheets_GTS.GtsPlacementTestCases/sysinfo/var/log I saw that Android reboots due to arcbridgeservice crash. [ 482.449293] <3>init: critical process 'arcbridgeservice' exited 4 times in 4 minutes; rebooting into recovery mode 2018-05-30 23:49:26.889208698-07:00: Post-stop arc-kmsg-logger And chrome side log: [5996:5996:0530/234754.863267:VERBOSE1:arc_session_runner.cc(151)] Session start requested: FULL_INSTANCE [5996:5996:0530/234754.863303:VERBOSE1:arc_session_runner.cc(224)] Starting ARC instance [5996:6015:0530/234755.358887:ERROR:service_manager_context.cc(247)] Attempting to run unsupported native service: /opt/google/chrome/chrome_renderer.service [5996:6015:0530/234755.359419:ERROR:service_manager_context.cc(247)] Attempting to run unsupported native service: /opt/google/chrome/chrome_renderer.service [5996:6015:0530/234755.375156:ERROR:service_manager_context.cc(247)] Attempting to run unsupported native service: /opt/google/chrome/chrome_renderer.service [5996:6015:0530/234755.375782:ERROR:service_manager_context.cc(247)] Attempting to run unsupported native service: /opt/google/chrome/chrome_renderer.service [5996:5996:0530/234755.567235:WARNING:memory_kills_monitor.cc(174)] Monitoring has been started [5996:5996:0530/234756.341093:INFO:arc_session_impl.cc(417)] ARC ready. [5996:5996:0530/234756.342425:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost [5996:5996:0530/234816.509476:VERBOSE1:arc_session_impl.cc(489)] Notified that ARC instance is stopped 0 [5996:5996:0530/234816.509593:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost [5996:5996:0530/234816.509634:INFO:arc_session_runner.cc(254)] ARC stopped: CRASH [5996:5996:0530/234816.509677:INFO:arc_session_runner.cc(281)] ARC restarting [5996:5996:0530/234816.509716:VERBOSE1:arc_session_manager.cc(195)] ARC session is stopped, but being restarted: CRASH [5996:5996:0530/234821.510705:INFO:arc_session_runner.cc(240)] Restarting ARC instance [5996:5996:0530/234821.510824:VERBOSE1:arc_session_runner.cc(224)] Starting ARC instance [5996:5996:0530/234824.246657:INFO:arc_session_impl.cc(417)] ARC ready. [5996:5996:0530/234824.247488:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost [5996:5996:0530/234840.346583:VERBOSE1:arc_session_impl.cc(489)] Notified that ARC instance is stopped 0 [5996:5996:0530/234840.346724:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost [5996:5996:0530/234840.346779:INFO:arc_session_runner.cc(254)] ARC stopped: CRASH [5996:5996:0530/234840.346833:INFO:arc_session_runner.cc(281)] ARC restarting [5996:5996:0530/234840.346890:VERBOSE1:arc_session_manager.cc(195)] ARC session is stopped, but being restarted: CRASH [5996:5996:0530/234845.347996:INFO:arc_session_runner.cc(240)] Restarting ARC instance [5996:5996:0530/234845.348251:VERBOSE1:arc_session_runner.cc(224)] Starting ARC instance [5996:5996:0530/234848.231426:INFO:arc_session_impl.cc(417)] ARC ready. [5996:5996:0530/234848.236535:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost [5996:5996:0530/234904.170941:VERBOSE1:arc_session_impl.cc(489)] Notified that ARC instance is stopped 0 [5996:5996:0530/234904.171055:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost [5996:5996:0530/234904.171093:INFO:arc_session_runner.cc(254)] ARC stopped: CRASH [5996:5996:0530/234904.171133:INFO:arc_session_runner.cc(281)] ARC restarting [5996:5996:0530/234904.171174:VERBOSE1:arc_session_manager.cc(195)] ARC session is stopped, but being restarted: CRASH [5996:5996:0530/234909.172366:INFO:arc_session_runner.cc(240)] Restarting ARC instance [5996:5996:0530/234909.172708:VERBOSE1:arc_session_runner.cc(224)] Starting ARC instance [5996:5996:0530/234912.060216:INFO:arc_session_impl.cc(417)] ARC ready. [5996:5996:0530/234912.060346:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost [5996:5996:0530/234927.165445:VERBOSE1:arc_session_impl.cc(489)] Notified that ARC instance is stopped 0 [5996:5996:0530/234927.165543:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost [5996:5996:0530/234927.165610:INFO:arc_session_runner.cc(254)] ARC stopped: CRASH [5996:5996:0530/234927.165651:INFO:arc_session_runner.cc(281)] ARC restarting [5996:5996:0530/234927.165694:VERBOSE1:arc_session_manager.cc(195)] ARC session is stopped, but being restarted: CRASH But I could not find logcat logs so did not know why arcbridgeservice is crashing.
,
May 31 2018
There is a few tests that did have logcat. e.g. In cyan-chrome-pfq 3292, cheets_CameraOrientation https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/204194501-chromeos-test/chromeos2-row6-rack2-host1/cheets_CameraOrientation/sysinfo/var/log/ Sample lines about arcbridgeservice failure: 05-31 07:04:21.916 45 45 I ArcBridgeService: Starting ARC bridge service ... 05-31 07:04:21.979 45 71 E ArcBridgeService: Invalid initial handshake. ... 05-31 07:04:27.236 191 191 I ArcBridgeService: Starting ARC bridge service 05-31 07:04:27.238 191 193 E mojoprocesssupport: [0531/070427.238065:ERROR:MojoProcessSupport.cc(104)] connect /var/run/chrome/arc_bridge.sock: Connection refused 05-31 07:04:27.238 191 193 E ArcMojoProcessSupport: Failed to connect to /var/run/chrome/arc_bridge.sock ==== Similar error found in https://luci-milo.appspot.com/buildbot/chromeos/caroline-arcnext-chrome-pfq/439 cheets_ContainerSmokeTest_P ==== 05-31 06:47:45.788 1332 1332 I ArcBridgeService: Starting ARC bridge service 05-31 06:47:45.789 791 1333 W AsyncOpDispatcher: GCoreClientInfo is not available, or the caller is GMS Core. 05-31 06:47:45.791 1332 1335 E mojoprocesssupport: [0531/064745.791272:ERROR:MojoProcessSupport.cc(104)] connect /var/run/chrome/arc_bridge.sock: Connection refused 05-31 06:47:45.791 1332 1335 E ArcMojoProcessSupport: Failed to connect to /var/run/chrome/arc_bridge.sock ==== Somehow, ArcBridgeService failed to make mojo connection back to chrome.
,
May 31 2018
cmtm@, could you help to find an owner for this? Otherwise, we might not have PFQ uprev tonight.
,
May 31 2018
I'm looking into it now. Will update when I find something.
,
Jun 1 2018
This should be fixed with https://chromium-review.googlesource.com/c/chromium/src/+/1080006
,
Jun 2 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by jrbarnette@chromium.org
, May 30 2018It is done. More information about what to do/expect. First, this CQ run shouldn't have any CLs in common with the original failure: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/18736 If that run fails, the bug is in the tree. Second, the build that follows 18736 (nominally, 18737) should have most/all of the potential culprit CLs. If that run passes, the bug is in the tree. Finally, if the 18737 run fails, the problem is probably a bad CL. But, it could still be a bug in the tree. So, stay sharp!