kevin CQ: crashed in graphics_Idle |
||
Issue descriptionLast night a kevin CQ ran: https://luci-milo.appspot.com/buildbot/chromeos/kevin-paladin/3548 It failed with this message: graphics_Idle: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run. http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/169496184-chromeos-test/ There aren't too many logs there. ...but I went to cautotest and found the next thing to run (a provision). That lead me to here: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos6-row2-rack24-host5/3135715-provision/20181201001633/sysinfo/pstore/sys/fs/pstore/ ...where we have a ramoops that looks very relevant. Specifically this bit: [ 993.272152] mali ff9a0000.gpu: error detected from slot 0, job status 0x00000004 (TERMINATED) [ 993.272259] Unable to handle kernel paging request at virtual address dead0000000001c0 [ 993.272267] pgd = ffffffc0ec821000 [ 993.272271] [dead0000000001c0] *pgd=00000000eca22003, *pud=00000000eca22003, *pmd=0000000000000000 [ 993.272283] Internal error: Oops: 96000004 [#1] PREEMPT SMP [ 993.272288] Modules linked in: cmac rfcomm btusb btrtl btbcm btintel bluetooth uinput mwifiex_pcie mwifiex uvcvideo zram bridge stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_mark fuse snd_seq_dummy snd_seq snd_seq_device cfg80211 ip6table_filter cdc_ether usbnet r8152 mii joydev [ 993.272354] CPU: 4 PID: 6045 Comm: TaskSchedulerCo Not tainted 4.4.110-12483-gbb83cd9af3a7-dirty #1 [ 993.272358] Hardware name: Google Kevin (DT) [ 993.272362] task: ffffffc0b7b88000 ti: ffffffc0b1948000 task.ti: ffffffc0b1948000 [ 993.272372] PC is at kbase_jd_zap_context+0x80/0xe0 [ 993.272376] LR is at kbase_jd_zap_context+0xa0/0xe0 [ 993.272380] pc : [<ffffffc0005c4564>] lr : [<ffffffc0005c4584>] pstate: 80000145 [ 993.272383] sp : ffffffc0b194bac0 [ 993.272386] x29: ffffffc0b194baf0 x28: ffffffc0dcc31100 [ 993.272393] x27: 0000000000000008 x26: ffffffc0ede661d8 [ 993.272399] x25: dead000000000100 x24: dead000000000100 [ 993.272405] x23: 0000000000004002 x22: ffffff8006594620 [ 993.272411] x21: ffffff8006580068 x20: ffffff8006570368 [ 993.272417] x19: ffffff8006570000 x18: 0000000000000001 [ 993.272424] x17: dead000000000100 x16: 0000000000000000 [ 993.272430] x15: ffffffffffffffff x14: ffffff80065800c0 [ 993.272437] x13: 0000000000000000 x12: 0000000000000018 [ 993.272443] x11: ffffffc0b7b88000 x10: 0000000000000000 [ 993.272449] x9 : cb88537fdc8ba66e x8 : 0000000000000000 [ 993.272455] x7 : 0000000000000003 x6 : 0000000000000000 [ 993.272462] x5 : 0000000000000000 x4 : 0000000000000000 [ 993.272467] x3 : 0000000000000000 x2 : 0000000000000001 [ 993.272474] x1 : 0000000000000140 x0 : 0000000000000000 ... snip ... [ 993.273722] [ 993.273727] Process TaskSchedulerCo (pid: 6045, stack limit = 0xffffffc0b1948040) [ 993.273730] Stack: (0xffffffc0b194bac0 to 0xffffffc0b194c000) [ 993.273735] bac0: 0000000000000001 ffffffc0ede661d8 ffffffc0ede661e8 ffffff80065950f8 [ 993.273739] bae0: ffffffc0ede64000 ffffff8006570000 ffffffc0b194bb40 ffffffc0005c910c [ 993.273743] bb00: ffffffc0ede661d8 dead000000000100 0000000000000001 ffffffc0ede661d8 [ 993.273746] bb20: ffffffc0ede661e8 ffffff80065950f8 ffffffc0ede64000 ffffff8006570000 [ 993.273750] bb40: ffffffc0b194bbc0 ffffffc0005d25a8 ffffffc0b194bba0 0000000000000000 [ 993.273754] bb60: 0000000000000000 cb88537fdc8ba66e ffffffc0dcc31100 cb88537fdc8ba66e [ 993.273757] bb80: ffffffc0b1948000 0000000000000008 ffffffc0ee5c1240 ffffffc0ee5c1240 [ 993.273761] bba0: ffffffc0bb680a10 ffffffc0eea09a20 ffffffc0bb680a00 ffffffc0ecc561e0 [ 993.273765] bbc0: ffffffc0b194bc10 ffffffc000412304 0000000000000009 0000000000000000 [ 993.273769] bbe0: 0000000000000000 0000000000000001 ffffffc0bb681400 ffffffc001154810 [ 993.273773] bc00: ffffffc0b7b88000 ffffffc0bb680a00 ffffffc0b194bc30 ffffffc00036ad80 [ 993.273777] bc20: ffffffc000c15b4a ffffffc000c15b4a ffffffc0b194bc60 ffffffc0002dbf0c [ 993.273781] bc40: ffffffc0ddcd1c00 0000000000000001 0000000000000009 ffffffc0b7b88000 [ 993.273785] bc60: ffffffc0b194bcd0 ffffffc0002201d0 ffffffc0b194bcd4 ffffffc0b7b886a0 [ 993.273788] bc80: ffffffc0dcc31100 0000000000000008 ffffffc0dcc31208 0000000000000009 [ 993.273792] bca0: ffffffc0b1948000 0000000000000001 ffffffc0ddcd1c00 ffffffc0b1948000 [ 993.273796] bcc0: 0000000000000100 0000000000000009 ffffffc0b194bd00 ffffffc0002209c4 [ 993.273800] bce0: 0000000000040004 ffffffc0b194bdd0 0000000000000100 ffffffc0b194bdb0 [ 993.273804] bd00: ffffffc0b194bd80 ffffffc00022b128 00000000ffffffff ffffffc0ddcd1c00 [ 993.273808] bd20: ffffffc0dcc31908 ffffffc00108ca50 ffffffc0b1948000 ffffffc000a04000 [ 993.273812] bd40: 00000000000000f0 0000000000000186 00000000ea4d9514 00000000000000f0 [ 993.273816] bd60: 00000000fffffe00 00000000ea4d9516 ffffffc0b194bec0 0000000000400809 [ 993.273820] bd80: ffffffc0b194beb0 ffffffc000208978 ffffffc0b1948000 ffffffc000a04000 [ 993.273824] bda0: 00000000000000f0 0000000000000080 0000000000000000 00000000ffffffff [ 993.273827] bdc0: 0000000000000000 0000000011dd2e18 0000000000000009 0000000000000000 [ 993.273831] bde0: 0000000000000000 ffffffc00021c1a0 ffffffff00000000 cb88537fdc8ba66e [ 993.273835] be00: 0000000000000000 ffffffc0002050a4 0000000000000000 0000000011dd2e1c [ 993.273839] be20: 0000000000000080 0000000000000001 0000000011dd2e18 0000000000000000 [ 993.273843] be40: ffffffc0b194beb0 ffffffc0002fc908 00000000ea4d9516 cb88537fdc8ba66e [ 993.273847] be60: ffffffc0b1948000 ffffffc000a04000 00000000000000f0 0000000000000186 [ 993.273851] be80: 0000000000000011 00000000400f0030 00000000ea4d9516 ffffffffffffffff [ 993.273855] bea0: 0000000000000000 0000000000400800 0000000000000000 ffffffc000202d28 [ 993.273858] bec0: 0000000011dd2e1c 0000000000000080 0000000000000001 0000000000000000 [ 993.273862] bee0: 0000000011dd2e18 0000000000000000 0000000000000001 00000000000000f0 [ 993.273867] bf00: 0000000000000001 0000000011dd2e1c 0000000000000080 00000000e3a8a994 [ 993.273870] bf20: 00000000000000f0 00000000e3a8a970 00000000ea4d4ee7 0000000000000000 [ 993.273874] bf40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 993.273878] bf60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 993.273882] bf80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 993.273886] bfa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 993.273890] bfc0: 00000000ea4d9514 00000000400f0030 0000000011dd2e1c ffffffffffffffff [ 993.273894] bfe0: 0000000000000000 0000000000000000 d0901ab519f1eb0a 9f7b0adba80b164f [ 993.273897] Call trace: [ 993.273902] [<ffffffc0005c4564>] kbase_jd_zap_context+0x80/0xe0 [ 993.273907] [<ffffffc0005c910c>] kbase_destroy_context+0x34/0x174 [ 993.273912] [<ffffffc0005d25a8>] kbase_release+0x138/0x178 [ 993.273918] [<ffffffc000412304>] __fput+0xd0/0x1c4 [ 993.273924] [<ffffffc00036ad80>] ____fput+0x1c/0x28 [ 993.273929] [<ffffffc0002dbf0c>] task_work_run+0x88/0xd0 [ 993.273934] [<ffffffc0002201d0>] do_exit+0x248/0x744 [ 993.273938] [<ffffffc0002209c4>] do_group_exit+0x98/0xa8 [ 993.273942] [<ffffffc00022b128>] get_signal+0x124/0x570 [ 993.273948] [<ffffffc000208978>] do_notify_resume+0xa8/0x56c [ 993.273952] [<ffffffc000202d28>] work_pending+0x1c/0x20 [ 993.273957] Code: 8b080276 52880057 aa1603f8 1400000d (b940c308) [ 993.275130] ---[ end trace ebacce565756c7fa ]--- --- Given that a graphics test was running and we got a kbase crash, this seems like a real problem. I know we've been tweaking some graphics stuff on ARM recently so giving to gurchetansingh to investigate. It's possible that the failure is flaky and the issue is already on ToT since nothing in the blame list looks terribly relevant.
,
Jan 12 2018
I couldn't find the exact symbols in the paladin, but I downloaded the debug.tgz from kevin release R65-10302.0.0. I then ran this: aarch64-cros-linux-gnu-gdb debug/boot/vmlinux That brought me into gdb. It wasn't too much fun trying to get GDB to resolve to my source files, but I finally ended up roughly: mkdir -p tmp/portage/sys-kernel/chromeos-kernel-4_4-4.4.110-r1317/work cd tmp/portage/sys-kernel/chromeos-kernel-4_4-4.4.110-r1317/work ln -s /path/to/third_party/kernel/v4.4 chromeos-kernel-4_4-4.4.110 ...and then I re-ran GDB. Now, I can run: disass /s kbase_jd_zap_context We were crashing at kbase_jd_zap_context+0x80, which is AKA +128 decimal. Thus, --- (gdb) disass /s kbase_jd_zap_context Dump of assembler code for function kbase_jd_zap_context: ../../../../../tmp/portage/sys-kernel/chromeos-kernel-4_4-4.4.110-r1317/work/chromeos-kernel-4_4-4.4.110/drivers/gpu/arm/midgard/mali_kbase_jd.c: 1673 { 0xffffffc0005c44e4 <+0>: stp x24, x23, [sp, #-64]! 0xffffffc0005c44e8 <+4>: stp x22, x21, [sp, #16] 0xffffffc0005c44ec <+8>: stp x20, x19, [sp, #32] 0xffffffc0005c44f0 <+12>: stp x29, x30, [sp, #48] 0xffffffc0005c44f4 <+16>: add x29, sp, #0x30 0xffffffc0005c44f8 <+20>: mov x19, x0 0xffffffc0005c44fc <+24>: bl 0xffffffc000204260 <_mcount> 1674 struct kbase_jd_atom *katom; 1675 struct list_head *entry, *tmp; 1676 struct kbase_device *kbdev; 1677 1678 KBASE_DEBUG_ASSERT(kctx); 1679 1680 kbdev = kctx->kbdev; 1681 1682 KBASE_TRACE_ADD(kbdev, JD_ZAP_CONTEXT, kctx, NULL, 0u, 0u); 1683 1684 kbase_js_zap_context(kctx); 0xffffffc0005c4500 <+28>: mov x0, x19 0xffffffc0005c4504 <+32>: bl 0xffffffc0005c8334 <kbase_js_zap_context> 1685 1686 mutex_lock(&kctx->jctx.lock); 0xffffffc0005c4508 <+36>: add x20, x19, #0x368 0xffffffc0005c450c <+40>: mov x0, x20 0xffffffc0005c4510 <+44>: bl 0xffffffc000930a34 <mutex_lock> 1687 1688 /* 1689 * While holding the struct kbase_jd_context lock clean up jobs which are known to kbase but are 1690 * queued outside the job scheduler. 1691 */ 1692 1693 hrtimer_cancel(&kctx->soft_event_timeout); 0xffffffc0005c4514 <+48>: mov w8, #0x5150 // #20816 0xffffffc0005c4518 <+52>: movk w8, #0x2, lsl #16 0xffffffc0005c451c <+56>: add x0, x19, x8 0xffffffc0005c4520 <+60>: bl 0xffffffc0002f6b30 <hrtimer_cancel> 1694 list_for_each_safe(entry, tmp, &kctx->waiting_soft_jobs) { 0xffffffc0005c4524 <+64>: mov w8, #0x45f8 // #17912 0xffffffc0005c4528 <+68>: movk w8, #0x2, lsl #16 0xffffffc0005c452c <+72>: add x21, x19, x8 0xffffffc0005c4530 <+76>: ldr x0, [x21] 0xffffffc0005c4534 <+80>: b 0xffffffc0005c4544 <kbase_jd_zap_context+96> 0xffffffc0005c4538 <+84>: ldr x22, [x0], #-104 1695 katom = list_entry(entry, struct kbase_jd_atom, dep_item[0]); 1696 kbase_cancel_soft_job(katom); 0xffffffc0005c453c <+88>: bl 0xffffffc0005cc214 <kbase_cancel_soft_job> 0xffffffc0005c4540 <+92>: mov x0, x22 1694 list_for_each_safe(entry, tmp, &kctx->waiting_soft_jobs) { 0xffffffc0005c4544 <+96>: cmp x0, x21 0xffffffc0005c4548 <+100>: b.ne 0xffffffc0005c4538 <kbase_jd_zap_context+84> // b.any ---Type <return> to continue, or q <return> to quit--- 1697 } 1698 1699 1700 #if defined(CONFIG_KDS) || defined(CONFIG_DRM_DMA_SYNC) 1701 /* For each job waiting on a resource, cancel the wait and force the job to 1702 * complete early, this is done so that we don't leave jobs outstanding waiting 1703 * on kds resources which may never be released when contexts are zapped, resulting 1704 * in a hang. 1705 * 1706 * Note that we can safely iterate over the list as the struct kbase_jd_context lock is held, 1707 * this prevents items being removed when calling job_done_nolock in kbase_cancel_kds_wait_job. 1708 */ 1709 1710 list_for_each(entry, &kctx->waiting_resource) { 0xffffffc0005c454c <+104>: mov w8, #0x4620 // #17952 0xffffffc0005c4550 <+108>: movk w8, #0x2, lsl #16 0xffffffc0005c4554 <+112>: add x22, x19, x8 0xffffffc0005c4558 <+116>: mov w23, #0x4002 // #16386 0xffffffc0005c455c <+120>: mov x24, x22 0xffffffc0005c4560 <+124>: b 0xffffffc0005c4594 <kbase_jd_zap_context+176> 214 if (katom->status == KBASE_JD_ATOM_STATE_QUEUED) { 0xffffffc0005c4564 <+128>: ldr w8, [x24, #192] 0xffffffc0005c4568 <+132>: cmp w8, #0x1 0xffffffc0005c456c <+136>: b.ne 0xffffffc0005c4594 <kbase_jd_zap_context+176> // b.any --- It looks like 'katom' is bogus.
,
Jan 12 2018
|
||
►
Sign in to add a comment |
||
Comment 1 by diand...@chromium.org
, Jan 12 2018