New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 801625 link

Starred by 1 user

Issue metadata

Status: Duplicate
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

kevin CQ: crashed in graphics_Idle

Project Member Reported by diand...@chromium.org, Jan 12 2018

Issue description

Last night a kevin CQ ran:
  https://luci-milo.appspot.com/buildbot/chromeos/kevin-paladin/3548

It failed with this message:
  graphics_Idle: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.
  http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/169496184-chromeos-test/

There aren't too many logs there.  ...but I went to cautotest and found the next thing to run (a provision).  That lead me to here:

  https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos6-row2-rack24-host5/3135715-provision/20181201001633/sysinfo/pstore/sys/fs/pstore/

...where we have a ramoops that looks very relevant.  Specifically this bit:

[  993.272152] mali ff9a0000.gpu: error detected from slot 0, job status 0x00000004 (TERMINATED)
[  993.272259] Unable to handle kernel paging request at virtual address dead0000000001c0
[  993.272267] pgd = ffffffc0ec821000
[  993.272271] [dead0000000001c0] *pgd=00000000eca22003, *pud=00000000eca22003, *pmd=0000000000000000
[  993.272283] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[  993.272288] Modules linked in: cmac rfcomm btusb btrtl btbcm btintel bluetooth uinput mwifiex_pcie mwifiex uvcvideo zram bridge stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_mark fuse snd_seq_dummy snd_seq snd_seq_device cfg80211 ip6table_filter cdc_ether usbnet r8152 mii joydev
[  993.272354] CPU: 4 PID: 6045 Comm: TaskSchedulerCo Not tainted 4.4.110-12483-gbb83cd9af3a7-dirty #1
[  993.272358] Hardware name: Google Kevin (DT)
[  993.272362] task: ffffffc0b7b88000 ti: ffffffc0b1948000 task.ti: ffffffc0b1948000
[  993.272372] PC is at kbase_jd_zap_context+0x80/0xe0
[  993.272376] LR is at kbase_jd_zap_context+0xa0/0xe0
[  993.272380] pc : [<ffffffc0005c4564>] lr : [<ffffffc0005c4584>] pstate: 80000145
[  993.272383] sp : ffffffc0b194bac0
[  993.272386] x29: ffffffc0b194baf0 x28: ffffffc0dcc31100 
[  993.272393] x27: 0000000000000008 x26: ffffffc0ede661d8 
[  993.272399] x25: dead000000000100 x24: dead000000000100 
[  993.272405] x23: 0000000000004002 x22: ffffff8006594620 
[  993.272411] x21: ffffff8006580068 x20: ffffff8006570368 
[  993.272417] x19: ffffff8006570000 x18: 0000000000000001 
[  993.272424] x17: dead000000000100 x16: 0000000000000000 
[  993.272430] x15: ffffffffffffffff x14: ffffff80065800c0 
[  993.272437] x13: 0000000000000000 x12: 0000000000000018 
[  993.272443] x11: ffffffc0b7b88000 x10: 0000000000000000 
[  993.272449] x9 : cb88537fdc8ba66e x8 : 0000000000000000 
[  993.272455] x7 : 0000000000000003 x6 : 0000000000000000 
[  993.272462] x5 : 0000000000000000 x4 : 0000000000000000 
[  993.272467] x3 : 0000000000000000 x2 : 0000000000000001 
[  993.272474] x1 : 0000000000000140 x0 : 0000000000000000 
... snip ...
[  993.273722] 
[  993.273727] Process TaskSchedulerCo (pid: 6045, stack limit = 0xffffffc0b1948040)
[  993.273730] Stack: (0xffffffc0b194bac0 to 0xffffffc0b194c000)
[  993.273735] bac0: 0000000000000001 ffffffc0ede661d8 ffffffc0ede661e8 ffffff80065950f8
[  993.273739] bae0: ffffffc0ede64000 ffffff8006570000 ffffffc0b194bb40 ffffffc0005c910c
[  993.273743] bb00: ffffffc0ede661d8 dead000000000100 0000000000000001 ffffffc0ede661d8
[  993.273746] bb20: ffffffc0ede661e8 ffffff80065950f8 ffffffc0ede64000 ffffff8006570000
[  993.273750] bb40: ffffffc0b194bbc0 ffffffc0005d25a8 ffffffc0b194bba0 0000000000000000
[  993.273754] bb60: 0000000000000000 cb88537fdc8ba66e ffffffc0dcc31100 cb88537fdc8ba66e
[  993.273757] bb80: ffffffc0b1948000 0000000000000008 ffffffc0ee5c1240 ffffffc0ee5c1240
[  993.273761] bba0: ffffffc0bb680a10 ffffffc0eea09a20 ffffffc0bb680a00 ffffffc0ecc561e0
[  993.273765] bbc0: ffffffc0b194bc10 ffffffc000412304 0000000000000009 0000000000000000
[  993.273769] bbe0: 0000000000000000 0000000000000001 ffffffc0bb681400 ffffffc001154810
[  993.273773] bc00: ffffffc0b7b88000 ffffffc0bb680a00 ffffffc0b194bc30 ffffffc00036ad80
[  993.273777] bc20: ffffffc000c15b4a ffffffc000c15b4a ffffffc0b194bc60 ffffffc0002dbf0c
[  993.273781] bc40: ffffffc0ddcd1c00 0000000000000001 0000000000000009 ffffffc0b7b88000
[  993.273785] bc60: ffffffc0b194bcd0 ffffffc0002201d0 ffffffc0b194bcd4 ffffffc0b7b886a0
[  993.273788] bc80: ffffffc0dcc31100 0000000000000008 ffffffc0dcc31208 0000000000000009
[  993.273792] bca0: ffffffc0b1948000 0000000000000001 ffffffc0ddcd1c00 ffffffc0b1948000
[  993.273796] bcc0: 0000000000000100 0000000000000009 ffffffc0b194bd00 ffffffc0002209c4
[  993.273800] bce0: 0000000000040004 ffffffc0b194bdd0 0000000000000100 ffffffc0b194bdb0
[  993.273804] bd00: ffffffc0b194bd80 ffffffc00022b128 00000000ffffffff ffffffc0ddcd1c00
[  993.273808] bd20: ffffffc0dcc31908 ffffffc00108ca50 ffffffc0b1948000 ffffffc000a04000
[  993.273812] bd40: 00000000000000f0 0000000000000186 00000000ea4d9514 00000000000000f0
[  993.273816] bd60: 00000000fffffe00 00000000ea4d9516 ffffffc0b194bec0 0000000000400809
[  993.273820] bd80: ffffffc0b194beb0 ffffffc000208978 ffffffc0b1948000 ffffffc000a04000
[  993.273824] bda0: 00000000000000f0 0000000000000080 0000000000000000 00000000ffffffff
[  993.273827] bdc0: 0000000000000000 0000000011dd2e18 0000000000000009 0000000000000000
[  993.273831] bde0: 0000000000000000 ffffffc00021c1a0 ffffffff00000000 cb88537fdc8ba66e
[  993.273835] be00: 0000000000000000 ffffffc0002050a4 0000000000000000 0000000011dd2e1c
[  993.273839] be20: 0000000000000080 0000000000000001 0000000011dd2e18 0000000000000000
[  993.273843] be40: ffffffc0b194beb0 ffffffc0002fc908 00000000ea4d9516 cb88537fdc8ba66e
[  993.273847] be60: ffffffc0b1948000 ffffffc000a04000 00000000000000f0 0000000000000186
[  993.273851] be80: 0000000000000011 00000000400f0030 00000000ea4d9516 ffffffffffffffff
[  993.273855] bea0: 0000000000000000 0000000000400800 0000000000000000 ffffffc000202d28
[  993.273858] bec0: 0000000011dd2e1c 0000000000000080 0000000000000001 0000000000000000
[  993.273862] bee0: 0000000011dd2e18 0000000000000000 0000000000000001 00000000000000f0
[  993.273867] bf00: 0000000000000001 0000000011dd2e1c 0000000000000080 00000000e3a8a994
[  993.273870] bf20: 00000000000000f0 00000000e3a8a970 00000000ea4d4ee7 0000000000000000
[  993.273874] bf40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  993.273878] bf60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  993.273882] bf80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  993.273886] bfa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  993.273890] bfc0: 00000000ea4d9514 00000000400f0030 0000000011dd2e1c ffffffffffffffff
[  993.273894] bfe0: 0000000000000000 0000000000000000 d0901ab519f1eb0a 9f7b0adba80b164f
[  993.273897] Call trace:
[  993.273902] [<ffffffc0005c4564>] kbase_jd_zap_context+0x80/0xe0
[  993.273907] [<ffffffc0005c910c>] kbase_destroy_context+0x34/0x174
[  993.273912] [<ffffffc0005d25a8>] kbase_release+0x138/0x178
[  993.273918] [<ffffffc000412304>] __fput+0xd0/0x1c4
[  993.273924] [<ffffffc00036ad80>] ____fput+0x1c/0x28
[  993.273929] [<ffffffc0002dbf0c>] task_work_run+0x88/0xd0
[  993.273934] [<ffffffc0002201d0>] do_exit+0x248/0x744
[  993.273938] [<ffffffc0002209c4>] do_group_exit+0x98/0xa8
[  993.273942] [<ffffffc00022b128>] get_signal+0x124/0x570
[  993.273948] [<ffffffc000208978>] do_notify_resume+0xa8/0x56c
[  993.273952] [<ffffffc000202d28>] work_pending+0x1c/0x20
[  993.273957] Code: 8b080276 52880057 aa1603f8 1400000d (b940c308) 
[  993.275130] ---[ end trace ebacce565756c7fa ]---

---

Given that a graphics test was running and we got a kbase crash, this seems like a real problem.  I know we've been tweaking some graphics stuff on ARM recently so giving to gurchetansingh to investigate.  It's possible that the failure is flaky and the issue is already on ToT since nothing in the blame list looks terribly relevant.



 
Cc: dbehr@chromium.org
I couldn't find the exact symbols in the paladin, but I downloaded the debug.tgz from kevin release R65-10302.0.0.  I then ran this:

  aarch64-cros-linux-gnu-gdb debug/boot/vmlinux

That brought me into gdb.  It wasn't too much fun trying to get GDB to resolve to my source files, but I finally ended up roughly:

  mkdir -p tmp/portage/sys-kernel/chromeos-kernel-4_4-4.4.110-r1317/work
  cd tmp/portage/sys-kernel/chromeos-kernel-4_4-4.4.110-r1317/work
  ln -s /path/to/third_party/kernel/v4.4 chromeos-kernel-4_4-4.4.110

...and then I re-ran GDB.  Now, I can run:

  disass /s kbase_jd_zap_context

We were crashing at kbase_jd_zap_context+0x80, which is AKA +128 decimal.  Thus, 

---

(gdb)   disass /s kbase_jd_zap_context
Dump of assembler code for function kbase_jd_zap_context:
../../../../../tmp/portage/sys-kernel/chromeos-kernel-4_4-4.4.110-r1317/work/chromeos-kernel-4_4-4.4.110/drivers/gpu/arm/midgard/mali_kbase_jd.c:
1673    {
   0xffffffc0005c44e4 <+0>:     stp     x24, x23, [sp, #-64]!
   0xffffffc0005c44e8 <+4>:     stp     x22, x21, [sp, #16]
   0xffffffc0005c44ec <+8>:     stp     x20, x19, [sp, #32]
   0xffffffc0005c44f0 <+12>:    stp     x29, x30, [sp, #48]
   0xffffffc0005c44f4 <+16>:    add     x29, sp, #0x30
   0xffffffc0005c44f8 <+20>:    mov     x19, x0
   0xffffffc0005c44fc <+24>:    bl      0xffffffc000204260 <_mcount>

1674            struct kbase_jd_atom *katom;
1675            struct list_head *entry, *tmp;
1676            struct kbase_device *kbdev;
1677
1678            KBASE_DEBUG_ASSERT(kctx);
1679
1680            kbdev = kctx->kbdev;
1681
1682            KBASE_TRACE_ADD(kbdev, JD_ZAP_CONTEXT, kctx, NULL, 0u, 0u);
1683
1684            kbase_js_zap_context(kctx);
   0xffffffc0005c4500 <+28>:    mov     x0, x19
   0xffffffc0005c4504 <+32>:    bl      0xffffffc0005c8334 <kbase_js_zap_context>

1685
1686            mutex_lock(&kctx->jctx.lock);
   0xffffffc0005c4508 <+36>:    add     x20, x19, #0x368
   0xffffffc0005c450c <+40>:    mov     x0, x20
   0xffffffc0005c4510 <+44>:    bl      0xffffffc000930a34 <mutex_lock>

1687
1688            /*
1689             * While holding the struct kbase_jd_context lock clean up jobs which are known to kbase but are
1690             * queued outside the job scheduler.
1691             */
1692
1693            hrtimer_cancel(&kctx->soft_event_timeout);
   0xffffffc0005c4514 <+48>:    mov     w8, #0x5150                     // #20816
   0xffffffc0005c4518 <+52>:    movk    w8, #0x2, lsl #16
   0xffffffc0005c451c <+56>:    add     x0, x19, x8
   0xffffffc0005c4520 <+60>:    bl      0xffffffc0002f6b30 <hrtimer_cancel>

1694            list_for_each_safe(entry, tmp, &kctx->waiting_soft_jobs) {
   0xffffffc0005c4524 <+64>:    mov     w8, #0x45f8                     // #17912
   0xffffffc0005c4528 <+68>:    movk    w8, #0x2, lsl #16
   0xffffffc0005c452c <+72>:    add     x21, x19, x8
   0xffffffc0005c4530 <+76>:    ldr     x0, [x21]
   0xffffffc0005c4534 <+80>:    b       0xffffffc0005c4544 <kbase_jd_zap_context+96>
   0xffffffc0005c4538 <+84>:    ldr     x22, [x0], #-104

1695                    katom = list_entry(entry, struct kbase_jd_atom, dep_item[0]);
1696                    kbase_cancel_soft_job(katom);
   0xffffffc0005c453c <+88>:    bl      0xffffffc0005cc214 <kbase_cancel_soft_job>
   0xffffffc0005c4540 <+92>:    mov     x0, x22

1694            list_for_each_safe(entry, tmp, &kctx->waiting_soft_jobs) {
   0xffffffc0005c4544 <+96>:    cmp     x0, x21
   0xffffffc0005c4548 <+100>:   b.ne    0xffffffc0005c4538 <kbase_jd_zap_context+84>  // b.any
---Type <return> to continue, or q <return> to quit---

1697            }
1698
1699
1700    #if defined(CONFIG_KDS) || defined(CONFIG_DRM_DMA_SYNC)
1701            /* For each job waiting on a resource, cancel the wait and force the job to
1702             * complete early, this is done so that we don't leave jobs outstanding waiting
1703             * on kds resources which may never be released when contexts are zapped, resulting
1704             * in a hang.
1705             *
1706             * Note that we can safely iterate over the list as the struct kbase_jd_context lock is held,
1707             * this prevents items being removed when calling job_done_nolock in kbase_cancel_kds_wait_job.
1708             */
1709
1710            list_for_each(entry, &kctx->waiting_resource) {
   0xffffffc0005c454c <+104>:   mov     w8, #0x4620                     // #17952
   0xffffffc0005c4550 <+108>:   movk    w8, #0x2, lsl #16
   0xffffffc0005c4554 <+112>:   add     x22, x19, x8
   0xffffffc0005c4558 <+116>:   mov     w23, #0x4002                    // #16386
   0xffffffc0005c455c <+120>:   mov     x24, x22
   0xffffffc0005c4560 <+124>:   b       0xffffffc0005c4594 <kbase_jd_zap_context+176>

214             if (katom->status == KBASE_JD_ATOM_STATE_QUEUED) {
   0xffffffc0005c4564 <+128>:   ldr     w8, [x24, #192]
   0xffffffc0005c4568 <+132>:   cmp     w8, #0x1
   0xffffffc0005c456c <+136>:   b.ne    0xffffffc0005c4594 <kbase_jd_zap_context+176>  // b.any

---

It looks like 'katom' is bogus.
Mergedinto: 774348
Status: Duplicate (was: Untriaged)

Sign in to add a comment