New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 847695 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

PFQ - Android did not boot because of arc-networkd error

Reported by jrbarnette@chromium.org, May 30 2018

Issue description

This CQ slave failed:
    https://luci-milo.appspot.com/buildbot/chromeos/quawks-paladin/3090

The failure was this test:
    cheets_StartAndroid.stress: FAIL: Android did not boot!

The blamelist for the paladin was several dozen CLs; I've no idea
how to zoom in on a bad CL culprit.  It's not even clear that there
_is_ a culprit:  This could be a bug in the tree.

So for now, here's what's happening:
  * This bug is a placeholder for any debug necessary.
  * The CQ will be annotated to say "Bug in ToT", listing this
    bug, and then finalized.
  * All of the potential culprit CLs will get another trip to the
    well.
  * If one of those CLs caused the problem, that CQ run will fail,
    too.

Assigning this to the Asia ARC++ constable.  Here's what he should
do with it:
  * If the bug doesn't occur again, it's a bug in the tree.  Since
    it will have occurred only once, he can mark it as low priority.
  * If the bug does occur again, he'll have to figure out whether
    the bug is in the tree, or a bad CL.
 
It is done.

More information about what to do/expect.

First, this CQ run shouldn't have any CLs in common with the original
failure:
    https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/18736

If that run fails, the bug is in the tree.

Second, the build that follows 18736 (nominally, 18737) should have
most/all of the potential culprit CLs.  If that run passes, the bug
is in the tree.

Finally, if the 18737 run fails, the problem is probably a bad CL.
But, it could still be a bug in the tree.  So, stay sharp!

Cc: hidehiko@chromium.org phshah@chromium.org cmtm@chromium.org hirono@chromium.org
Owner: jrbarnette@chromium.org
quawks-paladin/3092 corresponding to master-paladin/18737 is green.
https://luci-milo.appspot.com/buildbot/chromeos/quawks-paladin/3092

In 3090, arc-networkd is crashing due to read(2) error.
So, I guess there is network infra issue?
Assigned back to jrbarnette@ to infra check.

Owner: cmtm@chromium.org
Status: Assigned (was: Untriaged)
Summary: quawks-paladin - Android did not boot because of arc-networkd error (was: quawks-paladin failed - Android did not boot!)
> In 3090, arc-networkd is crashing due to read(2) error.
> So, I guess there is network infra issue?

There were no reported network issues in the lab.

In any event, failures in the external network environment shouldn't
cause Android not to boot.

That said, this problem has only happened the one time, so it looks like
it could fall into the "low priority due to infrequency" bucket.  However,
since the failure, quawks has only had two test runs, so the track record
is merely "failed 1 time in 3".  That's not enough for a definitive diagnosis,
so we should hold this open.

Comment 4 by uekawa@chromium.org, May 30 2018

Labels: ArcConstable

Comment 5 by xiy...@chromium.org, May 31 2018

Cc: cernekee@chromium.org
Labels: -Pri-3 Pri-1
Summary: PFQ - Android did not boot because of arc-networkd error (was: quawks-paladin - Android did not boot because of arc-networkd error)
Re-title and bump up priority since the arc-networkd error now happening in PFQ for boards (and first failed build):
- betty-chrome-pfq
  https://luci-milo.appspot.com/buildbot/chromeos/betty-chrome-pfq/1805
- caroline-arcnext-chrome-pfq
  https://luci-milo.appspot.com/buildbot/chromeos/caroline-arcnext-chrome-pfq/439
- cyan-chrome-pfq
  https://luci-milo.appspot.com/buildbot/chromeos/cyan-chrome-pfq/3292
- kevin-arcnext-chrome-pfq
  https://luci-milo.appspot.com/buildbot/chromeos/kevin-arcnext-chrome-pfq/440
- reef-chrome-pfq
  https://luci-milo.appspot.com/buildbot/chromeos/reef-chrome-pfq/2598
- veyron_minnie-chrome-pfq
  https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-chrome-pfq/3294

Various cheets_CTS test fails because of it.
Some of them have arc-networkd symbols and the crash stack looks like this:
====
Operating system: Linux
                  0.0.0 Linux 3.18.0-17847-ga95d64571fba #1 SMP PREEMPT Wed May 30 10:11:38 PDT 2018 x86_64
CPU: amd64
     family 6 model 76 stepping 3
     4 CPUs

GPU: UNKNOWN

Crash reason:  SIGABRT
Crash address: 0x0
Process uptime: not available

Thread 0 (crashed)
 0  libc-2.23.so!raise [raise.c : 54 + 0x10]
    rax = 0x0000000000000000   rdx = 0x0000000000000006
    rcx = 0xffffffffffffffff   rbx = 0x0000000000000000
    rsi = 0x0000000000001adb   rdi = 0x0000000000001adb
    rbp = 0x00007ffd8028c7d0   rsp = 0x00007ffd8028c6a8
     r8 = 0x00007ffd8028c6f1    r9 = 0x00007ffd8028c6e7
    r10 = 0x0000000000000008   r11 = 0x0000000000000206
    r12 = 0x00007ffd8028cc68   r13 = 0x00007ffd8028c808
    r14 = 0x00007ffd8028cc78   r15 = 0x00007ffd8028cc70
    rip = 0x0000748e8ba21dd2
    Found by: given as instruction pointer in context
 1  libc-2.23.so!abort [abort.c : 89 + 0xa]
    rbx = 0x0000000000000000   rbp = 0x00007ffd8028c7d0
    rsp = 0x00007ffd8028c6b0   r12 = 0x00007ffd8028cc68
    r13 = 0x00007ffd8028c808   r14 = 0x00007ffd8028cc78
    r15 = 0x00007ffd8028cc70   rip = 0x0000748e8ba23bf6
    Found by: call frame info
 2  libbase-core-395517.so!base::debug::BreakDebugger() [debugger_posix.cc : 219 + 0x5]
    rbx = 0x0000000000000000   rbp = 0x00007ffd8028c7e0
    rsp = 0x00007ffd8028c7e0   r12 = 0x00007ffd8028cc68
    r13 = 0x00007ffd8028c808   r14 = 0x00007ffd8028cc78
    r15 = 0x00007ffd8028cc70   rip = 0x0000748e8c23be41
    Found by: call frame info
 3  libbase-core-395517.so!logging::LogMessage::~LogMessage() [logging.cc : 755 + 0x5]
    rbx = 0x0000000000000000   rbp = 0x00007ffd8028cc50
    rsp = 0x00007ffd8028c7f0   r12 = 0x00007ffd8028cc68
    r13 = 0x00007ffd8028c808   r14 = 0x00007ffd8028cc78
    r15 = 0x00007ffd8028cc70   rip = 0x0000748e8c25f754
    Found by: call frame info
 4  arc-networkd!arc_networkd::IpHelper::OnFileCanReadWithoutBlocking(int) [ip_helper.cc : 88 + 0x26]
    rbx = 0x00007ffd8028cc68   rbp = 0x00007ffd8028d1b0
    rsp = 0x00007ffd8028cc60   r12 = 0x00007ffd8028dae0
    r13 = 0x0000630b8e1b2270   r14 = 0x00007ffd8028d8a0
    r15 = 0x0000000000000002   rip = 0x0000630b8d96f3c6
    Found by: call frame info
 5  libbase-core-395517.so!base::MessagePumpLibevent::OnLibeventNotification(int, short, void*) [message_pump_libevent.cc : 102 + 0x8]
    rbx = 0x0000748e8c325240   rbp = 0x00007ffd8028d350
    rsp = 0x00007ffd8028d1c0   r12 = 0x00007ffd8028dae0
    r13 = 0x0000630b8e1b2270   r14 = 0x0000000000000004
    r15 = 0x0000000000000002   rip = 0x0000748e8c26e55f
    Found by: call frame info
 6  libevent_core-2.1.so.6.0.2!event_process_active_single_queue [event.c : 1580 + 0x9]
    rbx = 0x0000000000000082   rbp = 0x00007ffd8028d3e0
    rsp = 0x00007ffd8028d360   r12 = 0x0000630b8e1b23c8
    r13 = 0x0000630b8e1b2270   r14 = 0x0000630b8e1b2270
    r15 = 0x0000630b8e1bd2b0   rip = 0x0000748e8b3b7dc9
    Found by: call frame info
 7  libevent_core-2.1.so.6.0.2!event_base_loop [event.c : 0 + 0x5]
    rbx = 0x0000000000000000   rbp = 0x00007ffd8028d480
    rsp = 0x00007ffd8028d3f0   r12 = 0x0000630b8e1b2380
    r13 = 0x0000000000000000   r14 = 0x0000630b8e1b2270
    r15 = 0x0000000000000000   rip = 0x0000748e8b3b559e
    Found by: call frame info
 8  libbase-core-395517.so!base::MessagePumpLibevent::Run(base::MessagePump::Delegate*) [message_pump_libevent.cc : 228 + 0xa]
    rbx = 0x0000630b8e1b21c0   rbp = 0x00007ffd8028d4f0
    rsp = 0x00007ffd8028d490   r12 = 0x0000630b8e1b2208
    r13 = 0x0000000000000000   r14 = 0x0000630b8e1b1f40
    r15 = 0x0000000000000000   rip = 0x0000748e8c26e76e
    Found by: call frame info
 9  libbase-core-395517.so!base::MessageLoop::RunHandler() [message_loop.cc : 442 + 0x9]
    rbx = 0x0000630b8e1b1f40   rbp = 0x00007ffd8028d650
    rsp = 0x00007ffd8028d500   r12 = 0x00007ffd8028d728
    r13 = 0x00007ffd8028dcd0   r14 = 0x00007ffd8028d668
    r15 = 0x00007ffd8028d718   rip = 0x0000748e8c2684a1
    Found by: call frame info
10  libbase-core-395517.so!base::RunLoop::Run() [run_loop.cc : 35 + 0x5]
    rbx = 0x00007ffd8028d6c0   rbp = 0x00007ffd8028d6b0
    rsp = 0x00007ffd8028d660   r12 = 0x00007ffd8028d728
    r13 = 0x00007ffd8028dcd0   r14 = 0x00007ffd8028d668
    r15 = 0x00007ffd8028d718   rip = 0x0000748e8c29a3e0
    Found by: call frame info
11  libbrillo-core-395517.so!brillo::BaseMessageLoop::Run() [base_message_loop.cc : 212 + 0x8]
    rbx = 0x00007ffd8028d918   rbp = 0x00007ffd8028d700
    rsp = 0x00007ffd8028d6c0   r12 = 0x00007ffd8028d728
    r13 = 0x00007ffd8028dcd0   r14 = 0x00007ffd8028d6c0
    r15 = 0x00007ffd8028d718   rip = 0x0000748e8c3514c3
    Found by: call frame info
12  libbrillo-core-395517.so!brillo::Daemon::Run() [daemon.cc : 31 + 0x8]
    rbx = 0x00007ffd8028d918   rbp = 0x00007ffd8028d770
    rsp = 0x00007ffd8028d710   r12 = 0x00007ffd8028d728
    r13 = 0x00007ffd8028dcd0   r14 = 0x00007ffd8028d8a0
    r15 = 0x00007ffd8028d718   rip = 0x0000748e8c3473f4
    Found by: call frame info
13  arc-networkd!main [main.cc : 49 + 0x5]
    rbx = 0x00007ffd8028d880   rbp = 0x00007ffd8028dbe0
    rsp = 0x00007ffd8028d780   r12 = 0x0000630b8e1ace50
    r13 = 0x00007ffd8028dcd0   r14 = 0x00007ffd8028dcd8
    r15 = 0x0000000000000004   rip = 0x0000630b8d970ba3
    Found by: call frame info
14  libc-2.23.so!__libc_start_main [libc-start.c : 289 + 0x1a]
    rbx = 0x0000000000000000   rbp = 0x00007ffd8028dcb0
    rsp = 0x00007ffd8028dbf0   r12 = 0x0000630b8d996040
    r13 = 0x00007ffd8028dcd0   r14 = 0x0000000000000000
    r15 = 0x0000000000000000   rip = 0x0000748e8ba0e736
    Found by: call frame info
15  arc-networkd!_start + 0x29
    rbx = 0x0000000000000000   rbp = 0x0000000000000000
    rsp = 0x00007ffd8028dcc0   r12 = 0x0000630b8d9627e0
    r13 = 0x00007ffd8028dcd0   r14 = 0x0000000000000000
    r15 = 0x0000000000000000   rip = 0x0000630b8d962809
    Found by: call frame info
16  0x7ffd8028dcc8
    rbx = 0x0000000000000000   rbp = 0x0000000000000000
    rsp = 0x00007ffd8028dcc8   r12 = 0x0000630b8d9627e0
    r13 = 0x00007ffd8028dcd0   r14 = 0x0000000000000000
    r15 = 0x0000000000000000   rip = 0x00007ffd8028dcc8
    Found by: call frame info
17  arc-networkd!_init + 0xd98
    rsp = 0x00007ffd8028ddf8   rip = 0x0000630b8d9627e0
    Found by: stack scanning
====

This is a CHECK in ip_helper.cc.
https://cs.corp.google.com/chromeos_public/src/platform2/arc/network/ip_helper.cc?rcl=0bfdcd5675ee83f8b929226730d7f1c35bcd9bdb&l=88

The code is about 2-year old (+cernekee@). Not sure why it starts to fail.

Anyone could see whether the CLs in those first failed builds suspicious?

Comment 6 by uekawa@chromium.org, May 31 2018

Cc: abhishekbh@chromium.org hugobenichi@chromium.org

Comment 7 by xiy...@chromium.org, May 31 2018

The arc-networkd crash might be a side effect of android stop running. Looked at the cyan logs:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/204194385-chromeos-test/chromeos4-row11-rack11-host1/cheets_GTS.GtsPlacementTestCases/sysinfo/var/log

I saw that Android reboots due to arcbridgeservice crash.
[  482.449293] <3>init: critical process 'arcbridgeservice' exited 4 times in 4 minutes; rebooting into recovery mode
2018-05-30 23:49:26.889208698-07:00: Post-stop arc-kmsg-logger

And chrome side log:
[5996:5996:0530/234754.863267:VERBOSE1:arc_session_runner.cc(151)] Session start requested: FULL_INSTANCE
[5996:5996:0530/234754.863303:VERBOSE1:arc_session_runner.cc(224)] Starting ARC instance
[5996:6015:0530/234755.358887:ERROR:service_manager_context.cc(247)] Attempting to run unsupported native service: /opt/google/chrome/chrome_renderer.service
[5996:6015:0530/234755.359419:ERROR:service_manager_context.cc(247)] Attempting to run unsupported native service: /opt/google/chrome/chrome_renderer.service
[5996:6015:0530/234755.375156:ERROR:service_manager_context.cc(247)] Attempting to run unsupported native service: /opt/google/chrome/chrome_renderer.service
[5996:6015:0530/234755.375782:ERROR:service_manager_context.cc(247)] Attempting to run unsupported native service: /opt/google/chrome/chrome_renderer.service
[5996:5996:0530/234755.567235:WARNING:memory_kills_monitor.cc(174)] Monitoring has been started
[5996:5996:0530/234756.341093:INFO:arc_session_impl.cc(417)] ARC ready.
[5996:5996:0530/234756.342425:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost
[5996:5996:0530/234816.509476:VERBOSE1:arc_session_impl.cc(489)] Notified that ARC instance is stopped 0
[5996:5996:0530/234816.509593:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost
[5996:5996:0530/234816.509634:INFO:arc_session_runner.cc(254)] ARC stopped: CRASH
[5996:5996:0530/234816.509677:INFO:arc_session_runner.cc(281)] ARC restarting
[5996:5996:0530/234816.509716:VERBOSE1:arc_session_manager.cc(195)] ARC session is stopped, but being restarted: CRASH
[5996:5996:0530/234821.510705:INFO:arc_session_runner.cc(240)] Restarting ARC instance
[5996:5996:0530/234821.510824:VERBOSE1:arc_session_runner.cc(224)] Starting ARC instance
[5996:5996:0530/234824.246657:INFO:arc_session_impl.cc(417)] ARC ready.
[5996:5996:0530/234824.247488:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost
[5996:5996:0530/234840.346583:VERBOSE1:arc_session_impl.cc(489)] Notified that ARC instance is stopped 0
[5996:5996:0530/234840.346724:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost
[5996:5996:0530/234840.346779:INFO:arc_session_runner.cc(254)] ARC stopped: CRASH
[5996:5996:0530/234840.346833:INFO:arc_session_runner.cc(281)] ARC restarting
[5996:5996:0530/234840.346890:VERBOSE1:arc_session_manager.cc(195)] ARC session is stopped, but being restarted: CRASH
[5996:5996:0530/234845.347996:INFO:arc_session_runner.cc(240)] Restarting ARC instance
[5996:5996:0530/234845.348251:VERBOSE1:arc_session_runner.cc(224)] Starting ARC instance
[5996:5996:0530/234848.231426:INFO:arc_session_impl.cc(417)] ARC ready.
[5996:5996:0530/234848.236535:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost
[5996:5996:0530/234904.170941:VERBOSE1:arc_session_impl.cc(489)] Notified that ARC instance is stopped 0
[5996:5996:0530/234904.171055:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost
[5996:5996:0530/234904.171093:INFO:arc_session_runner.cc(254)] ARC stopped: CRASH
[5996:5996:0530/234904.171133:INFO:arc_session_runner.cc(281)] ARC restarting
[5996:5996:0530/234904.171174:VERBOSE1:arc_session_manager.cc(195)] ARC session is stopped, but being restarted: CRASH
[5996:5996:0530/234909.172366:INFO:arc_session_runner.cc(240)] Restarting ARC instance
[5996:5996:0530/234909.172708:VERBOSE1:arc_session_runner.cc(224)] Starting ARC instance
[5996:5996:0530/234912.060216:INFO:arc_session_impl.cc(417)] ARC ready.
[5996:5996:0530/234912.060346:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost
[5996:5996:0530/234927.165445:VERBOSE1:arc_session_impl.cc(489)] Notified that ARC instance is stopped 0
[5996:5996:0530/234927.165543:VERBOSE1:arc_bridge_host_impl.cc(264)] Mojo connection lost
[5996:5996:0530/234927.165610:INFO:arc_session_runner.cc(254)] ARC stopped: CRASH
[5996:5996:0530/234927.165651:INFO:arc_session_runner.cc(281)] ARC restarting
[5996:5996:0530/234927.165694:VERBOSE1:arc_session_manager.cc(195)] ARC session is stopped, but being restarted: CRASH

But I could not find logcat logs so did not know why arcbridgeservice is crashing.

Comment 8 by xiy...@chromium.org, May 31 2018

There is a few tests that did have logcat. 
e.g.
In cyan-chrome-pfq 3292, cheets_CameraOrientation
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/204194501-chromeos-test/chromeos2-row6-rack2-host1/cheets_CameraOrientation/sysinfo/var/log/

Sample lines about arcbridgeservice failure:

05-31 07:04:21.916    45    45 I ArcBridgeService: Starting ARC bridge service
...
05-31 07:04:21.979    45    71 E ArcBridgeService: Invalid initial handshake.
...
05-31 07:04:27.236   191   191 I ArcBridgeService: Starting ARC bridge service
05-31 07:04:27.238   191   193 E mojoprocesssupport: [0531/070427.238065:ERROR:MojoProcessSupport.cc(104)] connect /var/run/chrome/arc_bridge.sock: Connection refused
05-31 07:04:27.238   191   193 E ArcMojoProcessSupport: Failed to connect to /var/run/chrome/arc_bridge.sock
====

Similar error found in
https://luci-milo.appspot.com/buildbot/chromeos/caroline-arcnext-chrome-pfq/439
cheets_ContainerSmokeTest_P

====
05-31 06:47:45.788  1332  1332 I ArcBridgeService: Starting ARC bridge service
05-31 06:47:45.789   791  1333 W AsyncOpDispatcher: GCoreClientInfo is not available, or the caller is GMS Core.
05-31 06:47:45.791  1332  1335 E mojoprocesssupport: [0531/064745.791272:ERROR:MojoProcessSupport.cc(104)] connect /var/run/chrome/arc_bridge.sock: Connection refused
05-31 06:47:45.791  1332  1335 E ArcMojoProcessSupport: Failed to connect to /var/run/chrome/arc_bridge.sock
====

Somehow, ArcBridgeService failed to make mojo connection back to chrome.

Comment 9 by xiy...@chromium.org, May 31 2018

cmtm@, could you help to find an owner for this? Otherwise, we might not have PFQ uprev tonight.

Comment 10 by cmtm@chromium.org, May 31 2018

I'm looking into it now. Will update when I find something.

Comment 12 by cmtm@chromium.org, Jun 2 2018

Status: Fixed (was: Assigned)

Sign in to add a comment