tracking: CQ is broken by DUT reboot issues |
|||||||||||||||
Issue descriptionThis week there're two similar failed CQ: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16420 https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16411 Their common points include: 1. More than 1 board failed, actually at least 3 boards failed. 2. Some tests failed due to DUT reboot in the test: In kevin-paladin of master build 16411: graphics_Idle: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run. In peppy-paladin of master build 16420: logging_UserCrash: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run. 3. Some provision failures: In reef-uni-paladin of master build 16411: RootfsUpdateError('Update failed with unexpected update status: UPDATE_STATUS_IDLE',) In edgar-paladin of master build 16420 (The real error, not the raised error in builder link): RootfsUpdateError("Failed to perform rootfs update: RootfsUpdateError('Update failed with unexpected update status: UPDATE_STATUS_IDLE',) 4. Cheet tests failed due to DUT reboot in the test: In wizpig-paladin of master build 16411: cheets_StartAndroid.stress: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run. In sentry-paladin of master build 16420: cheets_StartAndroid.stress: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run. I doubted there's a bad CL, but the only common CL list for the 2 rounds are 677132 677141 679211 679264 687976, 4 of which has been successfully submitted, and one of them (687976) is about waterfall config, make me hard to believe it breaks these boards. So there's a bug in ToT? or just coincidence? Filed this bug for tracking and discussing. CC next week's deputy & secondary.
,
Sep 29 2017
9975 did not have an android uprev or chrome uprev. so the changes that can affect is chrome os changes only? https://crosland.corp.google.com/log/9974.0.0..9975.0.0
,
Sep 29 2017
Regarding these:
> In kevin-paladin of master build 16411:
> graphics_Idle: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.
> In peppy-paladin of master build 16420:
> logging_UserCrash: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.
The symptom is superficially similar to the cheets failures:
system crash during testing. However, it's possible that the
problems are different.
So, the next actions:
* Haul in a current sheriff. Maybe also an ARC constable.
Better still, haul them all in:
https://youtu.be/RZnQxEFJE_c?t=15s
* The sheriffs should take a look at a reasonable sampling of
the failures, and figure out what crashed, at least to the
level of identifying a stack trace.
* There should be new bugs for all the distinct crashes found.
Probably, all the bugs should be listed as Dev- and Beta-stoppers.
,
Sep 29 2017
> 9975 did not have an android uprev or chrome uprev. > so the changes that can affect is chrome os changes only? > > https://crosland.corp.google.com/log/9974.0.0..9975.0.0 That's a plausible blamelist, but note that I simplified the description of the problem. All 9975 failures were ABORTS. That _could_ be caused by a problem in ARC introduced in the build, or it _could_ be a problem in the infrastructure, or it _could_ be a random Chrome OS that just happened to pick an ARC test to be its victim. All that means we need to look at the crashes to narrow down the nature of the problem.
,
Sep 29 2017
Took a look at samus failure samus-release/R63-9982.0.0/bvt-perbuild/cheets_StartAndroid.stress.9 ... logs tell me test failed with exit code 255 and failed to salvage files via rsync because ssh no longer connects to the machine. https://wmatrix.googleplex.com/testrun/unfiltered?test_ids=564701257 09/27 23:53:22.606 DEBUG| autotest:1277| AUTOTEST_STATUS::START ---- ---- timestamp=1506581601 localtime=Sep 27 23:53:21 09/27 23:53:22.608 INFO | server_job:0214| START ---- ---- timestamp=1506581601 localtime=Sep 27 23:53:21 09/27 23:53:22.676 DEBUG| autotest:1277| AUTOTEST_STATUS:: START cheets_StartAndroid.stress cheets_StartAndroid.stress timestamp=1506581602 localtime=Sep 27 23:53:22 09/27 23:53:22.677 INFO | server_job:0214| START cheets_StartAndroid.stress cheets_StartAndroid.stress timestamp=1506581602 localtime=Sep 27 23:53:22 09/28 00:24:09.229 DEBUG| autotest:0952| Result exit status is 255.
,
Sep 29 2017
From what I understand, cheets_StartAndroid.stress tries to log into chrome OS 10 times. minnie log file on the same revision 9982 looks better but failing differently: https://wmatrix.googleplex.com/testrun/unfiltered?test_ids=564631010 python2.7 crashed and dumped core. That probably caused the test failure. https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/145353456-chromeos-test/chromeos2-row6-rack8-host13/crashinfo.chromeos2-row6-rack8-host13 I can see log files for cheets_StartAndroid.stress itself and the test seems to have succeeded: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/145353456-chromeos-test/chromeos2-row6-rack8-host13/cheets_StartAndroid.stress 09/27 22:32:41.641 INFO | arc_common:0037| Waiting for Android to boot completely. 09/27 22:32:41.642 DEBUG| utils:0212| Running 'android-sh -c "getprop sys.boot_completed"' 09/27 22:32:43.677 DEBUG| utils:0212| Running 'android-sh -c "getprop sys.boot_completed"' 09/27 22:32:45.710 DEBUG| utils:0212| Running 'android-sh -c "getprop sys.boot_completed"' 09/27 22:32:47.760 DEBUG| utils:0212| Running 'android-sh -c "getprop sys.boot_completed"' 09/27 22:32:49.798 DEBUG| utils:0212| Running 'android-sh -c "getprop sys.boot_completed"' 09/27 22:32:52.182 DEBUG| utils:0212| Running 'android-sh -c "getprop sys.boot_completed"' 09/27 22:32:54.512 DEBUG| utils:0212| Running 'android-sh -c "getprop sys.boot_completed"' 09/27 22:32:56.883 DEBUG| utils:0212| Running 'android-sh -c "getprop sys.boot_completed"' 09/27 22:32:59.262 DEBUG| utils:0212| Running 'android-sh -c "getprop sys.boot_completed"' 09/27 22:32:59.860 INFO | arc_common:0043| Android has booted completely. 09/27 22:33:01.864 DEBUG| arc_util:0040| ARC is enabled in mode enabled 09/27 22:33:01.864 INFO | arc_util:0104| Saving Android dumpstate. 09/27 22:33:19.167 DEBUG| logging_manager:0627| Logging subprocess finished 09/27 22:33:19.170 DEBUG| logging_manager:0627| Logging subprocess finished
,
Sep 29 2017
Due to it's possibly a bug in ToT and may raise more failures for following deputies: cc arc++ constable @lhchavez to investigate the cheet failure. assign to current sheriff @vbendeb to investigate others (maybe they're in the same root cause with cheet).
,
Sep 29 2017
FYI, arc++ constable is uekawa@ (non-PST) and elijahtaylor@ (PST) this week.
,
Sep 29 2017
,
Sep 29 2017
I've tried running the test locally, it seems to generate a lot of crashdumps on each invocation. $ test_that --board=kevin --no-quickmerge kevin cheets_StartAndroid.stress.1 . . 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170929.092540.2170.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170929.092540.2170.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170929.092616.1558.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170929.092616.1558.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170925.160532.2217.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170925.160532.2217.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170925.160515.1547.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170925.160515.1547.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170928.101033.1545.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170928.101033.1545.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170929.092639.2613.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170929.092639.2613.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170926.103025.1554.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170926.103025.1554.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170929.092715.1533.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170929.092715.1533.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170925.160608.1540.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170925.160608.1540.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170928.101054.2785.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170928.101054.2785.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170926.102850.2639.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170926.102850.2639.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170926.102949.2201.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170926.102949.2201.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170928.101101.1600.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170928.101101.1600.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170928.100956.2209.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170928.100956.2209.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170926.102931.1563.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170926.102931.1563.dmp 23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/chrome.20170905.150319.1599.dmp 23:24:00 INFO | autoserv| INFO ---- ---- timestamp=1506659040 localtime=Sep 28 23:24:00 Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/chrome.20170905.150319.1599.dmp
,
Sep 29 2017
oops, I wasn't looking at the timestamps, local crashdumps were just previous old crash dumps. so far the test isn't failing locally on my kevin. Not sure if this is just flakily failing 1/100 time.
,
Sep 29 2017
On login screen, I see lots of error messages on ui.log Hmm... the timestamp seems to be weird. 0929/003000 is PST, not local time. localhost ui # date -u Fri Sep 29 07:30:32 UTC 2017 localhost ui # date Fri Sep 29 16:30:43 JST 2017 and about 4 log lines are spit out per second, my ui log seems to have 10k lines. [17008:17008:0929/002957.418666:ERROR:gl_image_native_pixmap.cc(240)] Failed to flush rendering seems to be a bug for that crbug.com/765515
,
Sep 29 2017
I looked at another log kevin-release/R63-9980.0.0/bvt-perbuild/cheets_StartAndroid.stress.1 it seems to be failing at third "restart ui" and the rest of cheets_StartAndroid.stress.DEBUG are corrupted. Hmm... pstore/console-ramoops contains: [ 342.638989] arc0: renamed from slave_android [ 342.745473] binder: 16538:16552 transaction failed 29189, size 0-0 [ 342.939470] Unable to handle kernel NULL pointer dereference at virtual address 00000001 [ 342.939490] pgd = ffffffc0b6a67000 [ 342.939496] [00000001] *pgd=00000000bcec9003, *pud=00000000bcec9003, *pmd=0000000000000000 [ 342.939512] Internal error: Oops: 96000006 [#1] PREEMPT SMP [ 342.943068] Call trace: [ 342.943080] [<ffffffc0003338bc>] kmem_cache_alloc+0xb8/0x23c [ 342.943091] [<ffffffc0003acb50>] __kernfs_new_node+0x5c/0xd8 [ 342.943098] [<ffffffc0003acacc>] kernfs_new_node+0x44/0x6c [ 342.943106] [<ffffffc0003aed30>] __kernfs_create_file+0x50/0xd0 [ 342.943114] [<ffffffc0003af57c>] sysfs_add_file_mode_ns+0x100/0x168 [ 342.943122] [<ffffffc0003b06d0>] internal_create_group+0x120/0x28c [ 342.943129] [<ffffffc0003b05a4>] sysfs_create_group+0x28/0x34 [ 342.943135] [<ffffffc000336160>] sysfs_slab_add+0x110/0x1a0 [ 342.943142] [<ffffffc000335ffc>] __kmem_cache_create+0x2c8/0x31c [ 342.943150] [<ffffffc0003141ac>] kmem_cache_create+0x12c/0x1b0 [ 342.943160] [<ffffffc0007fc380>] nf_conntrack_init_net+0x114/0x244 [ 342.943168] [<ffffffc0007fc8f0>] nf_conntrack_pernet_init+0x20/0x16c [ 342.943176] [<ffffffc0007bae54>] ops_init+0xd0/0x120 [ 342.943183] [<ffffffc0007ba580>] setup_net+0x90/0x12c [ 342.943189] [<ffffffc0007ba468>] copy_net_ns+0xa8/0x130 [ 342.943197] [<ffffffc00023d760>] create_new_namespaces+0x118/0x190 [ 342.943204] [<ffffffc00023d950>] unshare_nsproxy_namespaces+0x7c/0xa4 [ 342.943214] [<ffffffc00021e374>] SyS_unshare+0x1a0/0x30c [ 342.943223] [<ffffffc000203e34>] el0_svc_naked+0x24/0x28 [ 342.943230] Code: b98022aa b940196c 910003e9 9272c529 (f86a6ac8) [ 342.943622] ---[ end trace 70c4ff6dce8c98ef ]--- so probably a kernel panic.
,
Sep 29 2017
Doug,there seems to be a kernel panic happening on Kevin when recent builds are running ARC tests, can you please take a look.
,
Sep 29 2017
Oooo, our friend netfilter. Added Kevin and Brian. I'll take a quick look and see if I see something obvious.
,
Sep 29 2017
Actually, +mka too on the off chance this is somehow related to clang. :-P
,
Sep 29 2017
Hard to tell if it is related to clang. kevin was switched to clang with 9967.0.0 (09/23). Since this doesn't seem to be easily reproducible it might be necessary to revert the switch to clang to find out if we don't find other culprits.
,
Sep 29 2017
At: nf_conntrack_init_net+0x114/0x244 ...in decimal (to match gdb): nf_conntrack_init_net+276/580 --- 1. Let's get a little more details on the crash. No links above, so I guess I gotta find it myself. :( 2. Hrm, there were lots of references to the paladins failing above, but I guess this is just about "kevin-release/R63-9980.0.0/bvt-perbuild/cheets_StartAndroid.stress.1" now? 3. Check wmatrix, go to "suites", go to "bvt-perbuild", go to "R63", click on "cheets_StartAndroid.stress", find the line for kevin/R63-9980.0.0, click "logs". Should go to <https://wmatrix.googleplex.com/testrun_log/unfiltered?test_ids=564442454> which takes you to storage. 4. Search around and find <https://storage.cloud.google.com/chromeos-autotest-results/145280500-chromeos-test/chromeos6-row2-rack24-host1/crashinfo.chromeos6-row2-rack24-host1/pstore/pstore/console-ramoops>. ...so interestingly enough, the crash above isn't the first strange thing in the log. First I see: [ 1.235702] WARNING: at /mnt/host/source/src/third_party/kernel/v4.4/fs/kernfs/dir.c:533 That's this warning: > void kernfs_get(struct kernfs_node *kn) > { > if (kn) { > WARN_ON(!atomic_read(&kn->count)); > atomic_inc(&kn->count); > } > } That's a really unexpected warning. Especially 1.2 seconds into boot (!). Then more errors, like: [ 233.197309] kernfs_put: misc/cpu_dma_latency: released with incorrect active_ref 0 [ 233.197338] ------------[ cut here ]------------ [ 233.197343] WARNING: at /mnt/host/source/src/third_party/kernel/v4.4/fs/kernfs/dir.c:562 ... [ 292.784386] WARNING: at /mnt/host/source/src/third_party/kernel/v4.4/fs/kernfs/dir.c:533 Then: [ 329.660611] ida_remove called for id=20089 which is not allocated. [ 329.660633] ------------[ cut here ]------------ [ 329.660638] WARNING: at /mnt/host/source/src/third_party/kernel/v4.4/lib/idr.c:1051 ...and finally the crash in question: [ 342.939470] Unable to handle kernel NULL pointer dereference at virtual address 00000001 [ 342.939490] pgd = ffffffc0b6a67000 [ 342.939496] [00000001] *pgd=00000000bcec9003, *pud=00000000bcec9003, *pmd=0000000000000000 [ 342.939512] Internal error: Oops: 96000006 [#1] PREEMPT SMP [ 342.939520] Modules linked in: ip6t_REJECT nf_reject_ipv6 xt_TCPMSS ip6table_mangle veth uinput rfcomm cmac btusb btrtl btbcm btintel uvcvideo mwifiex_pcie videobuf2_vmalloc mwifiex xt_nat bluetooth bridge zram stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat fuse xt_mark cfg80211 ip6table_filter cdc_ether usbnet r8152 mii joydev [ 342.939629] CPU: 1 PID: 16586 Comm: mount-passthrou Tainted: G W 4.4.86-11747-gb3bf69ebcfe9 #1 [ 342.939635] Hardware name: Google Kevin (DT) [ 342.939641] task: ffffffc0b82fe800 ti: ffffffc0db6a0000 task.ti: ffffffc0db6a0000 [ 342.939655] PC is at kmem_cache_alloc+0xb8/0x23c [ 342.939663] LR is at kmem_cache_alloc+0x30/0x23c [ 342.939669] pc : [<ffffffc0003338bc>] lr : [<ffffffc000333834>] pstate: 60000145 [ 342.939673] sp : ffffffc0db6a39f0 [ 342.939679] x29: ffffffc0db6a3a20 x28: 00000000000011b4 [ 342.939688] x27: 000000000000ee4b x26: ffffffc0010befe8 [ 342.939697] x25: 00000000024000c0 x24: ffffffc0011c3000 [ 342.939707] x23: 000000000000a251 x22: 0000000000000001 [ 342.939716] x21: ffffffc0eef08800 x20: 00000000024080c0 [ 342.939726] x19: ffffffc0003acb50 x18: 0000000000000007 [ 342.939735] x17: 000000000000001f x16: 0000000000000001 [ 342.939745] x15: 0000000000000680 x14: 0000000000000000 [ 342.939754] x13: 0000000000486941 x12: 0000000000000000 [ 342.939764] x11: ffffffc0db6a0000 x10: 0000000000000000 [ 342.939773] x9 : ffffffc0db6a0000 x8 : ffffffc0db6a39f0 [ 342.939782] x7 : 0000000000000000 x6 : 0000000000000000 [ 342.939791] x5 : ffffffc0010bf340 x4 : ffffffc000a1bb48 [ 342.939800] x3 : 0000000000000002 x2 : 0000000000000000 [ 342.939809] x1 : 0000000000000167 x0 : ffffffc0eef08800 --- Something is weird. My gut instinct would be bad hardware, but I'll keep digging.
,
Sep 29 2017
OK, I don't think it's bad hardware
===
* When I build the kernel myself functions seem to be a different size (?), so grabbing official symbols for 9980.0.0 on goldeneye. Basically you want "debug-kevin.tgz" in the artifacts directory. It's big.
* Get the kernel symbols out with:
tar xvf ~/Downloads/dev-channel%2Fkevin%2F9980.0.0%2Fdebug-kevin.tgz debug/boot/vmlinux
* Start gdb: aarch64-cros-linux-gnu-gdb /b/tip/tmp/debug/boot/vmlinux
* Now:
(gdb) set pagination off
(gdb) disass /s kmem_cache_alloc
Dump of assembler code for function kmem_cache_alloc:
/mnt/host/source/src/third_party/kernel/v4.4/mm/slub.c:
2522 {
0xffffffc000333804 <+0>: stp x24, x23, [sp,#-64]!
0xffffffc000333808 <+4>: stp x22, x21, [sp,#16]
0xffffffc00033380c <+8>: stp x20, x19, [sp,#32]
...
...
/mnt/host/source/src/third_party/kernel/v4.4/mm/slub.c:
247 return *(void **)(object + s->offset);
0xffffffc0003338bc <+184>: ldr x8, [x22,x10]
* Note that 184 is 0xb8. Also note that x22 + x10 = 0x1 which matches where the illegal access was. Not sure that helps us much, though. :(
===
I picked the same host and picked a random other test (started at cautotest/, went to "View Host", picked "chromeos6-row2-rack24-host1", picked something random and looked at its logs.
I ended up going to:
https://storage.cloud.google.com/chromeos-autotest-results/145696267-chromeos-test/chromeos6-row2-rack24-host1/sysinfo/var/log/messages
...and I found in the logs more failures that look similar to the weird "dir" ones above. :-(
2017-09-29T05:50:05.908291-07:00 WARNING kernel: [ 2.235854] WARNING: at /mnt/host/source/src/third_party/kernel/v4.4/fs/kernfs/dir.c:533
...
2017-09-29T05:50:05.916280-07:00 WARNING kernel: [ 2.243532] kernfs_put: power/control: released with incorrect active_ref 0
2017-09-29T05:50:05.916307-07:00 WARNING kernel: [ 2.243550] ------------[ cut here ]------------
2017-09-29T05:50:05.916310-07:00 WARNING kernel: [ 2.243555] WARNING: at /mnt/host/source/src/third_party/kernel/v4.4/fs/kernfs/dir.c:562
===
This is really weird...I'll try grabbing an official build and see if I can repro that way on my hardware. I can also dig through other hardware. No idea if the "dir" warning is somehow the root cause of the other but it's suspicious...
,
Sep 29 2017
OK, I put official 9980.0.0 on my kevin and installed it to emmc. I then rebooted. I noticed that it rebooted twice and found in console-ramoops more mysterious stuff: [ 0.894320] GPT: Use GNU Parted to correct GPT errors. [ 0.894358] mmcblk1: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 [ 0.998757] mmc0: MAN_BKOPS_EN bit is not set [ 1.013091] mmc0: new HS400 Enhanced strobe MMC card at address 0001 [ 1.013872] mmcblk0: mmc0:0001 500073 29.1 GiB [ 1.014367] mmcblk0boot0: mmc0:0001 500073 partition 1 4.00 MiB [ 1.014843] mmcblk0boot1: mmc0:0001 500073 partition 2 4.00 MiB [ 1.015250] mmcblk0rpmb: mmc0:0001 500073 partition 3 4.00 MiB [ 1.019811] mmcblk0: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 [ 1.259699] Unable to handle kernel NULL pointer dereference at virtual address 00000000 [ 1.259733] pgd = ffffffc001210000 [ 1.259739] [00000000] *pgd=0000000000221003, *pud=0000000000221003, *pmd=0000000000222003, *pte=00e80000fee00707 [ 1.259760] Internal error: Oops: 96000005 [#1] PREEMPT SMP [ 1.259768] Modules linked in: [ 1.259782] CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 4.4.86-11747-gb3bf69ebcfe9 #1 [ 1.259787] Hardware name: Google Kevin (DT) [ 1.259806] Workqueue: events dwc3_rockchip_otg_extcon_evt_work [ 1.259814] task: ffffffc0eefea700 ti: ffffffc0ef004000 task.ti: ffffffc0ef004000 [ 1.259827] PC is at reset_control_deassert+0x18/0x40 [ 1.259838] LR is at tcphy_phy_init+0x280/0x334 [ 1.259844] pc : [<ffffffc000517de4>] lr : [<ffffffc0004c7500>] pstate: 60000045 [ 1.259850] sp : ffffffc0ef007820 [ 1.259855] x29: ffffffc0ef007830 x28: 0000000000000001 [ 1.259865] x27: 0000000000000001 x26: ffffffc0ef1c4640 [ 1.259875] x25: ffffffc0ef180c10 x24: ffffffc0ef1c4628 [ 1.259884] x23: 000000005533194a x22: 0000000000000002 [ 1.259894] x21: 0000000000000002 x20: 0000000000000002 [ 1.259903] x19: ffffffc0ed73c000 x18: 0000000000000000 [ 1.259913] x17: 0000000000000010 x16: ffffff80002fefff [ 1.259923] x15: ffffff80002ff000 x14: 0000000000011c8c [ 1.259932] x13: 00000000000113a0 x12: 0000000000010c0c [ 1.259942] x11: 0000000000008fff x10: 000000000000c104 [ 1.259951] x9 : 0000000000030020 x8 : ffffff80040b0020 [ 1.259961] x7 : 0000000000000000 x6 : 000000000000003f [ 1.259971] x5 : 0000000000000000 x4 : 0000000000000004 [ 1.259980] x3 : ffffffc0eeeab004 x2 : 0000000000000001 [ 1.259990] x1 : 0000000000000001 x0 : 0000000000000000 ... ... [ 1.262572] Call trace: [ 1.262580] [<ffffffc000517de4>] reset_control_deassert+0x18/0x40 [ 1.262588] [<ffffffc0004c7500>] tcphy_phy_init+0x280/0x334 [ 1.262596] [<ffffffc0004c7be4>] rockchip_usb3_phy_power_on+0x70/0x13c [ 1.262603] [<ffffffc0004c327c>] phy_power_on+0xa8/0xe0 [ 1.262614] [<ffffffc0006372ec>] dwc3_core_init+0x508/0x838 [ 1.262622] [<ffffffc000636b94>] dwc3_probe+0x9c0/0xb50 [ 1.262632] [<ffffffc0005c0c9c>] platform_drv_probe+0x5c/0xb0 [ 1.262640] [<ffffffc0005be894>] driver_probe_device+0x224/0x264 [ 1.262648] [<ffffffc0005bef64>] __device_attach_driver+0x84/0x9c [ 1.262657] [<ffffffc0005bcd58>] bus_for_each_drv+0x80/0xc8 [ 1.262663] [<ffffffc0005be9f4>] __device_attach+0xa4/0x128 [ 1.262676] [<ffffffc0005bea98>] device_initial_probe+0x20/0x2c [ 1.262682] [<ffffffc0005bcf98>] bus_probe_device+0x34/0x8c [ 1.262692] [<ffffffc0005ba180>] device_add+0x4b4/0x540 [ 1.262703] [<ffffffc0007372b0>] of_device_add+0x3c/0x50 [ 1.262712] [<ffffffc000737c64>] of_platform_device_create_pdata+0x94/0xd0 [ 1.262719] [<ffffffc000737ec4>] of_platform_bus_create+0x158/0x310 [ 1.262727] [<ffffffc0007380e4>] of_platform_populate+0x68/0xa8 [ 1.262734] [<ffffffc000639a10>] dwc3_rockchip_of_populate+0x30/0x68 [ 1.262741] [<ffffffc000639b08>] dwc3_rockchip_otg_extcon_evt_work+0x70/0xf4 [ 1.262753] [<ffffffc000238e70>] process_one_work+0x198/0x2fc [ 1.262759] [<ffffffc000238910>] worker_thread+0x190/0x2b0 [ 1.262769] [<ffffffc00023d568>] kthread+0xe0/0xf0 [ 1.262779] [<ffffffc000203dd0>] ret_from_fork+0x10/0x40 [ 1.262787] Code: 910043fd aa0003f3 d503201f f9400260 (f9400008) [ 1.262804] ---[ end trace d6feebb2bf4439be ]--- === First I guess I should figure out just how crashy 9980.0.0 is: 1. Crash (above) 2. No crash (above) 3. No crash 4. No crash 5. No crash 6. No crash 7. No crash 8. No crash 9. No crash 10. No crash 11. No crash Hrm, so I guess it's not actually as likely as I first thought? I checked for crashes / warnings with: grep 'PC is' /dev/pstore/console-ramoops dmesg | grep "PC is" === I guess I need to automate this a little bit... Hopefully I can get something to reproduce with serial console... === Oooo, jackpot! slub_debug to the rescue: [ 3.337475] ============================================================================= [ 3.345646] BUG kmalloc-256 (Not tainted): Redzone overwritten [ 3.351471] ----------------------------------------------------------------------------- [ 3.351471] [ 3.361115] Disabling lock debugging due to kernel taint [ 3.366421] INFO: 0xffffffc0ebe88600-0xffffffc0ebe88607. First byte 0x0 instead of 0xcc [ 3.374422] INFO: Allocated in rockchip_sound_probe+0x64/0x278 age=7 cpu=5 pid=1 [ 3.381813] alloc_debug_processing+0x7c/0x108 [ 3.386253] ___slab_alloc+0x270/0x310 [ 3.389999] __kmalloc_track_caller+0x224/0x234 [ 3.394526] devm_kmalloc+0x34/0x80 [ 3.398010] rockchip_sound_probe+0x64/0x278 [ 3.402276] platform_drv_probe+0x5c/0xb0 [ 3.406281] driver_probe_device+0x224/0x264 [ 3.410546] __driver_attach+0x68/0x94 [ 3.414293] bus_for_each_dev+0x7c/0xc4 [ 3.418125] driver_attach+0x2c/0x38 [ 3.421697] bus_add_driver+0x128/0x1f0 [ 3.425528] driver_register+0x90/0xdc [ 3.429271] __platform_driver_register+0x4c/0x58 [ 3.433971] rockchip_sound_driver_init+0x18/0x20 [ 3.438671] do_one_initcall+0x10c/0x1cc [ 3.442590] kernel_init_freeable+0x1a4/0x23c [ 3.446943] INFO: Slab 0xffffffbdc3b02200 objects=25 used=5 fp=0xffffffc0ebe8b700 flags=0x4081 [ 3.455545] INFO: Object 0xffffffc0ebe88500 @offset=1280 fp=0xffffffc0ebe88780 [ 3.455545] [ 3.464236] Bytes b4 ffffffc0ebe884f0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ [ 3.473705] Object ffffffc0ebe88500: 00 fe e8 eb c0 ff ff ff 00 5f d7 ec c0 ff ff ff ........._...... [ 3.483001] Object ffffffc0ebe88510: 38 c4 5d 00 c0 ff ff ff 18 ee c5 00 c0 ff ff ff 8.]............. [ 3.492296] Object ffffffc0ebe88520: 60 00 00 00 00 00 00 00 19 f9 c9 00 c0 ff ff ff `............... [ 3.501591] Object ffffffc0ebe88530: 00 00 00 00 00 00 00 00 5b b9 c9 00 c0 ff ff ff ........[....... [ 3.510886] Object ffffffc0ebe88540: 00 00 00 00 00 00 00 00 e2 de c9 00 c0 ff ff ff ................ [ 3.520181] Object ffffffc0ebe88550: 00 00 00 00 00 00 00 00 50 f9 c9 00 c0 ff ff ff ........P....... [ 3.529476] Object ffffffc0ebe88560: 00 00 00 00 00 00 00 00 e9 de c9 00 c0 ff ff ff ................ [ 3.538772] Object ffffffc0ebe88570: 00 00 00 00 00 00 00 00 50 f9 c9 00 c0 ff ff ff ........P....... [ 3.548066] Object ffffffc0ebe88580: 00 00 00 00 00 00 00 00 46 f5 c9 00 c0 ff ff ff ........F....... [ 3.557362] Object ffffffc0ebe88590: 00 00 00 00 00 00 00 00 9d d7 c9 00 c0 ff ff ff ................ [ 3.566657] Object ffffffc0ebe885a0: 00 00 00 00 00 00 00 00 46 f5 c9 00 c0 ff ff ff ........F....... [ 3.575952] Object ffffffc0ebe885b0: 00 00 00 00 00 00 00 00 a1 d7 c9 00 c0 ff ff ff ................ [ 3.585247] Object ffffffc0ebe885c0: 00 00 00 00 00 00 00 00 2f ce c9 00 c0 ff ff ff ......../....... [ 3.594542] Object ffffffc0ebe885d0: 00 00 00 00 00 00 00 00 35 f9 c9 00 c0 ff ff ff ........5....... [ 3.603837] Object ffffffc0ebe885e0: 00 00 00 00 00 00 00 00 c4 ca c4 00 c0 ff ff ff ................ [ 3.613132] Object ffffffc0ebe885f0: 00 00 00 00 00 00 00 00 c1 e0 c9 00 c0 ff ff ff ................ [ 3.622427] Redzone ffffffc0ebe88600: 00 00 00 00 00 00 00 00 ........ [ 3.631116] Padding ffffffc0ebe88740: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ [ 3.640499] Padding ffffffc0ebe88750: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ [ 3.649881] Padding ffffffc0ebe88760: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ [ 3.659262] Padding ffffffc0ebe88770: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ [ 3.668647] CPU: 5 PID: 1 Comm: swapper/0 Tainted: G B 4.4.86 #49 --- I believe the allocation routes = devm_kzalloc(dev, sizeof(rockchip_routes), GFP_KERNEL); ...oh yeah. That changed recently and looks wrong... Looks like it broke in <https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/674847> === I'll post a patch shortly.
,
Sep 29 2017
Ouch, missed this in the review and it didn't blow up in manual testing :( Thanks for tracking this down and documenting your steps!
,
Sep 29 2017
OK, kevin should be much more robust once this lands: * https://chromium-review.googlesource.com/#/c/chromiumos/third_party/kernel/+/693319 FROMLIST: ASoC: rockchip: Allocate enough memory so we don't overflow routes ...note that the above change won't help any other random DUT reboot issues, but since you assigned this bug to me specifically for the kevin one I'm assuming that you wanted me to hijack this bug for kevin. Having a single big bug saying "the CQ fails sometimes" isn't super useful anyway since it'd be a catchall for every bug we ever will have... If you want to keep tracking down other random CQ failures, maybe it makes sense to start a bug for each different looking failure?
,
Sep 29 2017
Re#22, I want this bug to focus on the summarize of random reboot problem in CQ. Could file a separate bug for the kevin and mark it blocking this one? Thanks! Change the name back. Assign to next week's deputy to monitor this problem and its derivative bugs.
,
Sep 29 2017
Super-bug = good luck and adios!
,
Sep 29 2017
@23: OK, if you're sure. IMHO super bugs aren't super useful. Next time maybe better to file the sub-bug and assign it to me so I don't pollute your umbrella bug then. Given that the CL is already marked to go in (referencing this bug) and no merge is needed, I probably won't file a separate bug unless someone can come up with a good reason.
,
Sep 30 2017
Re #25, can you confirm whether the failures described in #0 all come from the same problem that you're trying to fix, or what part of the failures can be fixed by your CL? My original intent to file this bug is to question whether they happened due to a same bug or should belong to different bugs. If the reason is the same, we can re-name this bug directly to the real issue. If not, I want to find several proper assigners and create several sub bugs for them, and this bug is still used to make sure all failures have been weel dealt with, and it can be closed when all failures are fully represented by these sub bugs. Sorry for the trouble that there's some miscommunication here, I should follow this bug and file a new bug when the topic is narrowed down to 'kevin'.
,
Sep 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7708447f54d518ddbb7da9c66cac9a1c6310cc4a commit 7708447f54d518ddbb7da9c66cac9a1c6310cc4a Author: Douglas Anderson <dianders@chromium.org> Date: Sat Sep 30 03:16:26 2017 FROMLIST: ASoC: rockchip: Allocate enough memory so we don't overflow routes In the recent commit d9f9c167edae ("ASoC: rockchip: Init dapm routes dynamically") we improperly allocated memory for the card->dapm_routes causing us to overflow the allocation on every boot. Oops. Let's allocate the correct amount of memory. We'll also add a check to make sure that we don't overrun memory even if we encounter some sort of weird device tree. BUG= chromium:769990 TEST=No more slub debug errors; audio still works Change-Id: Ibfacae048394247fb77f8f6b9395027318a7718e Fixes: d9f9c167edae ("ASoC: rockchip: Init dapm routes dynamically") Signed-off-by: Douglas Anderson <dianders@chromium.org> (am from https://patchwork.kernel.org/patch/9978763/) Reviewed-on: https://chromium-review.googlesource.com/693319 Reviewed-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Jeffy Chen <jeffy.chen@rock-chips.com> [modify] https://crrev.com/7708447f54d518ddbb7da9c66cac9a1c6310cc4a/sound/soc/rockchip/rk3399_gru_sound.c
,
Oct 2 2017
#26, the problem fixed in #27 is just for Kevin (part of "4." on #0), so the bigger problem isn't handled yet.
,
Oct 2 2017
26/28: I'm not CCed on this bug anymore, so if you have questions either CC me again temporarily or send me an email... ...it looks like uekawa already answered you, but basically the problem above touches code that's only executed on kevin and devices with the same SoC (bob, scarlet). The fix can't fix anyone else. --- > and this bug is still used to make sure all failures have been weel dealt with So you're going to keep this bug open until no DUTs ever crash?
,
Oct 27 2017
I buy #29. DUT reboot issues are pervasive and for bazillion reasons. This one is done. I opened three more in my deputy shift ;)
,
Jan 22 2018
,
Jan 23 2018
|
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by jrbarnette@chromium.org
, Sep 29 2017Components: Platform>ARC
Regarding this: > cheets_StartAndroid.stress: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run. Look here: https://wmatrix.googleplex.com/failures/unfiltered?hide_missing=True&tests=cheets_StartAndroid.stress&releases=63 This test seems to be having problems since R63-9975.0.0. That's the canary that started at 19:00 on 9/25. Also, we can see problems on the branch: https://wmatrix.googleplex.com/failures/unfiltered?hide_missing=True&tests=cheets_StartAndroid.stress&releases=62 The problems started at about the same time. There's a good chance that there's a new bug that got committed to the tree round about Tuesday, 9/25 at the latest.