New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 769990 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: ----

Blocked on:
issue 765515
issue 770031



Sign in to add a comment

tracking: CQ is broken by DUT reboot issues

Project Member Reported by xixuan@chromium.org, Sep 28 2017

Issue description

This week there're two similar failed CQ:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16420
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/16411

Their common points include:
1. More than 1 board failed, actually at least 3 boards failed.

2. Some tests failed due to DUT reboot in the test:
In kevin-paladin of master build 16411:
graphics_Idle: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.
In peppy-paladin of master build 16420:
logging_UserCrash: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.

3. Some provision failures:
In reef-uni-paladin of master build 16411:
RootfsUpdateError('Update failed with unexpected update status: UPDATE_STATUS_IDLE',)
In edgar-paladin of master build 16420 (The real error, not the raised error in builder link):
RootfsUpdateError("Failed to perform rootfs update: RootfsUpdateError('Update failed with unexpected update status: UPDATE_STATUS_IDLE',)

4. Cheet tests failed due to DUT reboot in the test:
In wizpig-paladin of master build 16411:
cheets_StartAndroid.stress: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.
In sentry-paladin of master build 16420:
cheets_StartAndroid.stress: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.


I doubted there's a bad CL, but the only common CL list for the 2 rounds are 
677132
677141
679211
679264
687976,
4 of which has been successfully submitted, and one of them (687976) is about waterfall config, make me hard to believe it breaks these boards. So there's a bug in ToT? or just coincidence?

Filed this bug for tracking and discussing. CC next week's deputy & secondary. 
 
Cc: chinyue@chromium.org mxt@chromium.org elijahtaylor@chromium.org uekawa@chromium.org
Components: Platform>ARC
Regarding this:

> cheets_StartAndroid.stress: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.

Look here:
    https://wmatrix.googleplex.com/failures/unfiltered?hide_missing=True&tests=cheets_StartAndroid.stress&releases=63

This test seems to be having problems since R63-9975.0.0.
That's the canary that started at 19:00 on 9/25.

Also, we can see problems on the branch:
    https://wmatrix.googleplex.com/failures/unfiltered?hide_missing=True&tests=cheets_StartAndroid.stress&releases=62

The problems started at about the same time.

There's a good chance that there's a new bug that got
committed to the tree round about Tuesday, 9/25 at the
latest.

Comment 2 by uekawa@google.com, Sep 29 2017

9975 did not have an android uprev or chrome uprev.
so the changes that can affect is chrome os changes only?

https://crosland.corp.google.com/log/9974.0.0..9975.0.0


Regarding these:

> In kevin-paladin of master build 16411:
> graphics_Idle: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.
> In peppy-paladin of master build 16420:
> logging_UserCrash: FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.

The symptom is superficially similar to the cheets failures:
system crash during testing.  However, it's possible that the
problems are different.

So, the next actions:
  * Haul in a current sheriff.  Maybe also an ARC constable.
    Better still, haul them all in:
        https://youtu.be/RZnQxEFJE_c?t=15s
  * The sheriffs should take a look at a reasonable sampling of
    the failures, and figure out what crashed, at least to the
    level of identifying a stack trace.
  * There should be new bugs for all the distinct crashes found.
    Probably, all the bugs should be listed as Dev- and Beta-stoppers.

> 9975 did not have an android uprev or chrome uprev.
> so the changes that can affect is chrome os changes only?
>
> https://crosland.corp.google.com/log/9974.0.0..9975.0.0

That's a plausible blamelist, but note that I simplified the
description of the problem.  All 9975 failures were ABORTS.
That _could_ be caused by a problem in ARC introduced in the
build, or it _could_ be a problem in the infrastructure, or
it _could_ be a random Chrome OS that just happened to pick an
ARC test to be its victim.

All that means we need to look at the crashes to narrow down
the nature of the problem.

Comment 5 by uekawa@google.com, Sep 29 2017

Took a look at samus failure 
samus-release/R63-9982.0.0/bvt-perbuild/cheets_StartAndroid.stress.9

... logs tell me test failed with exit code 255 and failed to salvage files via rsync because ssh no longer connects to the machine.




https://wmatrix.googleplex.com/testrun/unfiltered?test_ids=564701257


09/27 23:53:22.606 DEBUG|          autotest:1277| AUTOTEST_STATUS::START	----	----	timestamp=1506581601	localtime=Sep 27 23:53:21	
09/27 23:53:22.608 INFO |        server_job:0214| START	----	----	timestamp=1506581601	localtime=Sep 27 23:53:21	
09/27 23:53:22.676 DEBUG|          autotest:1277| AUTOTEST_STATUS::	START	cheets_StartAndroid.stress	cheets_StartAndroid.stress	timestamp=1506581602	localtime=Sep 27 23:53:22	
09/27 23:53:22.677 INFO |        server_job:0214| 	START	cheets_StartAndroid.stress	cheets_StartAndroid.stress	timestamp=1506581602	localtime=Sep 27 23:53:22	
09/28 00:24:09.229 DEBUG|          autotest:0952| Result exit status is 255.


Comment 6 by uekawa@google.com, Sep 29 2017

From what I understand, cheets_StartAndroid.stress tries to log into chrome OS 10 times.


minnie log file on the same revision 9982 looks better but failing differently:
https://wmatrix.googleplex.com/testrun/unfiltered?test_ids=564631010


python2.7 crashed and dumped core. That probably caused the test failure.
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/145353456-chromeos-test/chromeos2-row6-rack8-host13/crashinfo.chromeos2-row6-rack8-host13



I can see log files for cheets_StartAndroid.stress itself and the test seems to have succeeded:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/145353456-chromeos-test/chromeos2-row6-rack8-host13/cheets_StartAndroid.stress





09/27 22:32:41.641 INFO |        arc_common:0037| Waiting for Android to boot completely.
09/27 22:32:41.642 DEBUG|             utils:0212| Running 'android-sh -c "getprop sys.boot_completed"'
09/27 22:32:43.677 DEBUG|             utils:0212| Running 'android-sh -c "getprop sys.boot_completed"'
09/27 22:32:45.710 DEBUG|             utils:0212| Running 'android-sh -c "getprop sys.boot_completed"'
09/27 22:32:47.760 DEBUG|             utils:0212| Running 'android-sh -c "getprop sys.boot_completed"'
09/27 22:32:49.798 DEBUG|             utils:0212| Running 'android-sh -c "getprop sys.boot_completed"'
09/27 22:32:52.182 DEBUG|             utils:0212| Running 'android-sh -c "getprop sys.boot_completed"'
09/27 22:32:54.512 DEBUG|             utils:0212| Running 'android-sh -c "getprop sys.boot_completed"'
09/27 22:32:56.883 DEBUG|             utils:0212| Running 'android-sh -c "getprop sys.boot_completed"'
09/27 22:32:59.262 DEBUG|             utils:0212| Running 'android-sh -c "getprop sys.boot_completed"'
09/27 22:32:59.860 INFO |        arc_common:0043| Android has booted completely.
09/27 22:33:01.864 DEBUG|          arc_util:0040| ARC is enabled in mode enabled
09/27 22:33:01.864 INFO |          arc_util:0104| Saving Android dumpstate.
09/27 22:33:19.167 DEBUG|   logging_manager:0627| Logging subprocess finished
09/27 22:33:19.170 DEBUG|   logging_manager:0627| Logging subprocess finished



Comment 7 by xixuan@chromium.org, Sep 29 2017

Cc: lhchavez@chromium.org
Owner: vbendeb@chromium.org
Due to it's possibly a bug in ToT and may raise more failures for following deputies:

cc arc++ constable @lhchavez to investigate the cheet failure.

assign to current sheriff @vbendeb to investigate others (maybe they're in the same root cause with cheet). 

Comment 8 by uekawa@google.com, Sep 29 2017

FYI, 
arc++ constable is uekawa@ (non-PST) and elijahtaylor@ (PST) this week.

Comment 9 by uekawa@google.com, Sep 29 2017

Blockedon: 770031

Comment 10 by uekawa@google.com, Sep 29 2017

I've tried running the test locally, it seems to generate a lot of crashdumps on each invocation.

$ test_that --board=kevin --no-quickmerge kevin cheets_StartAndroid.stress.1
 .
 .
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170929.092540.2170.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170929.092540.2170.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170929.092616.1558.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170929.092616.1558.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170925.160532.2217.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170925.160532.2217.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170925.160515.1547.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170925.160515.1547.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170928.101033.1545.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170928.101033.1545.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170929.092639.2613.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170929.092639.2613.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170926.103025.1554.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170926.103025.1554.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170929.092715.1533.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170929.092715.1533.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170925.160608.1540.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170925.160608.1540.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170928.101054.2785.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170928.101054.2785.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170926.102850.2639.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170926.102850.2639.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170926.102949.2201.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170926.102949.2201.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170928.101101.1600.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170928.101101.1600.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170928.100956.2209.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/sslh_fork.20170928.100956.2209.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170926.102931.1563.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/set_wifi_transmit_power.20170926.102931.1563.dmp
23:24:00 INFO | autoserv| Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/chrome.20170905.150319.1599.dmp
23:24:00 INFO | autoserv| INFO	----	----	timestamp=1506659040	localtime=Sep 28 23:24:00	Local crashdump exists: /tmp/test_that_results_Z7P3GS/results-1-cheets_StartAndroid.stress.1/sysinfo/var/spool/crash/chrome.20170905.150319.1599.dmp

Comment 11 by uekawa@google.com, Sep 29 2017

oops, I wasn't looking at the timestamps, local crashdumps were just previous old crash dumps.

so far the test isn't failing locally on my kevin. Not sure if this is just flakily failing 1/100 time.

Comment 12 by uekawa@google.com, Sep 29 2017

Blockedon: 765515
On login screen, I see lots of error messages on ui.log 
Hmm... the timestamp seems to be weird. 0929/003000 is PST, not local time.

localhost ui # date -u 
Fri Sep 29 07:30:32 UTC 2017
localhost ui # date
Fri Sep 29 16:30:43 JST 2017

and about 4 log lines are spit out per second, my ui log seems to have 10k lines.

[17008:17008:0929/002957.418666:ERROR:gl_image_native_pixmap.cc(240)] Failed to flush rendering

seems to be a bug for that crbug.com/765515


Comment 13 by uekawa@google.com, Sep 29 2017

I looked at another log

 kevin-release/R63-9980.0.0/bvt-perbuild/cheets_StartAndroid.stress.1

it seems to be failing at third "restart ui" and the rest of cheets_StartAndroid.stress.DEBUG are corrupted.
Hmm... 

pstore/console-ramoops contains:

[  342.638989] arc0: renamed from slave_android
[  342.745473] binder: 16538:16552 transaction failed 29189, size 0-0
[  342.939470] Unable to handle kernel NULL pointer dereference at virtual address 00000001
[  342.939490] pgd = ffffffc0b6a67000
[  342.939496] [00000001] *pgd=00000000bcec9003, *pud=00000000bcec9003, *pmd=0000000000000000
[  342.939512] Internal error: Oops: 96000006 [#1] PREEMPT SMP


[  342.943068] Call trace:
[  342.943080] [<ffffffc0003338bc>] kmem_cache_alloc+0xb8/0x23c
[  342.943091] [<ffffffc0003acb50>] __kernfs_new_node+0x5c/0xd8
[  342.943098] [<ffffffc0003acacc>] kernfs_new_node+0x44/0x6c
[  342.943106] [<ffffffc0003aed30>] __kernfs_create_file+0x50/0xd0
[  342.943114] [<ffffffc0003af57c>] sysfs_add_file_mode_ns+0x100/0x168
[  342.943122] [<ffffffc0003b06d0>] internal_create_group+0x120/0x28c
[  342.943129] [<ffffffc0003b05a4>] sysfs_create_group+0x28/0x34
[  342.943135] [<ffffffc000336160>] sysfs_slab_add+0x110/0x1a0
[  342.943142] [<ffffffc000335ffc>] __kmem_cache_create+0x2c8/0x31c
[  342.943150] [<ffffffc0003141ac>] kmem_cache_create+0x12c/0x1b0
[  342.943160] [<ffffffc0007fc380>] nf_conntrack_init_net+0x114/0x244
[  342.943168] [<ffffffc0007fc8f0>] nf_conntrack_pernet_init+0x20/0x16c
[  342.943176] [<ffffffc0007bae54>] ops_init+0xd0/0x120
[  342.943183] [<ffffffc0007ba580>] setup_net+0x90/0x12c
[  342.943189] [<ffffffc0007ba468>] copy_net_ns+0xa8/0x130
[  342.943197] [<ffffffc00023d760>] create_new_namespaces+0x118/0x190
[  342.943204] [<ffffffc00023d950>] unshare_nsproxy_namespaces+0x7c/0xa4
[  342.943214] [<ffffffc00021e374>] SyS_unshare+0x1a0/0x30c
[  342.943223] [<ffffffc000203e34>] el0_svc_naked+0x24/0x28
[  342.943230] Code: b98022aa b940196c 910003e9 9272c529 (f86a6ac8) 
[  342.943622] ---[ end trace 70c4ff6dce8c98ef ]---


so probably a kernel panic.

Owner: diand...@chromium.org
Doug,there seems to be a kernel panic happening on Kevin when recent builds are running ARC tests, can you please take a look.
Cc: briannorris@chromium.org cernekee@chromium.org
Oooo, our friend netfilter.  Added Kevin and Brian.  I'll take a quick look and see if I see something obvious.
Cc: mka@chromium.org
Actually, +mka too on the off chance this is somehow related to clang.  :-P

Comment 17 by mka@chromium.org, Sep 29 2017

Hard to tell if it is related to clang. kevin was switched to clang with 9967.0.0 (09/23).

Since this doesn't seem to be easily reproducible it might be necessary to revert the switch to clang to find out if we don't find other culprits.
At: nf_conntrack_init_net+0x114/0x244 

...in decimal (to match gdb): nf_conntrack_init_net+276/580

---

1. Let's get a little more details on the crash.  No links above, so I guess I gotta find it myself.  :(

2. Hrm, there were lots of references to the paladins failing above, but I guess this is just about "kevin-release/R63-9980.0.0/bvt-perbuild/cheets_StartAndroid.stress.1" now?  

3. Check wmatrix, go to "suites", go to "bvt-perbuild", go to "R63", click on "cheets_StartAndroid.stress", find the line for kevin/R63-9980.0.0, click "logs".  Should go to <https://wmatrix.googleplex.com/testrun_log/unfiltered?test_ids=564442454> which takes you to storage.

4. Search around and find <https://storage.cloud.google.com/chromeos-autotest-results/145280500-chromeos-test/chromeos6-row2-rack24-host1/crashinfo.chromeos6-row2-rack24-host1/pstore/pstore/console-ramoops>.


...so interestingly enough, the crash above isn't the first strange thing in the log.  First I see:

[    1.235702] WARNING: at /mnt/host/source/src/third_party/kernel/v4.4/fs/kernfs/dir.c:533

That's this warning:

> void kernfs_get(struct kernfs_node *kn)
> {
> 	if (kn) {
> 		WARN_ON(!atomic_read(&kn->count));
> 		atomic_inc(&kn->count);
> 	}
> }

That's a really unexpected warning.  Especially 1.2 seconds into boot (!).  Then more errors, like:

[  233.197309] kernfs_put: misc/cpu_dma_latency: released with incorrect active_ref 0
[  233.197338] ------------[ cut here ]------------
[  233.197343] WARNING: at /mnt/host/source/src/third_party/kernel/v4.4/fs/kernfs/dir.c:562
...
[  292.784386] WARNING: at /mnt/host/source/src/third_party/kernel/v4.4/fs/kernfs/dir.c:533


Then:

[  329.660611] ida_remove called for id=20089 which is not allocated.
[  329.660633] ------------[ cut here ]------------
[  329.660638] WARNING: at /mnt/host/source/src/third_party/kernel/v4.4/lib/idr.c:1051

...and finally the crash in question:

[  342.939470] Unable to handle kernel NULL pointer dereference at virtual address 00000001
[  342.939490] pgd = ffffffc0b6a67000
[  342.939496] [00000001] *pgd=00000000bcec9003, *pud=00000000bcec9003, *pmd=0000000000000000
[  342.939512] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[  342.939520] Modules linked in: ip6t_REJECT nf_reject_ipv6 xt_TCPMSS ip6table_mangle veth uinput rfcomm cmac btusb btrtl btbcm btintel uvcvideo mwifiex_pcie videobuf2_vmalloc mwifiex xt_nat bluetooth bridge zram stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat fuse xt_mark cfg80211 ip6table_filter cdc_ether usbnet r8152 mii joydev
[  342.939629] CPU: 1 PID: 16586 Comm: mount-passthrou Tainted: G        W       4.4.86-11747-gb3bf69ebcfe9 #1
[  342.939635] Hardware name: Google Kevin (DT)
[  342.939641] task: ffffffc0b82fe800 ti: ffffffc0db6a0000 task.ti: ffffffc0db6a0000
[  342.939655] PC is at kmem_cache_alloc+0xb8/0x23c
[  342.939663] LR is at kmem_cache_alloc+0x30/0x23c
[  342.939669] pc : [<ffffffc0003338bc>] lr : [<ffffffc000333834>] pstate: 60000145
[  342.939673] sp : ffffffc0db6a39f0
[  342.939679] x29: ffffffc0db6a3a20 x28: 00000000000011b4 
[  342.939688] x27: 000000000000ee4b x26: ffffffc0010befe8 
[  342.939697] x25: 00000000024000c0 x24: ffffffc0011c3000 
[  342.939707] x23: 000000000000a251 x22: 0000000000000001 
[  342.939716] x21: ffffffc0eef08800 x20: 00000000024080c0 
[  342.939726] x19: ffffffc0003acb50 x18: 0000000000000007 
[  342.939735] x17: 000000000000001f x16: 0000000000000001 
[  342.939745] x15: 0000000000000680 x14: 0000000000000000 
[  342.939754] x13: 0000000000486941 x12: 0000000000000000 
[  342.939764] x11: ffffffc0db6a0000 x10: 0000000000000000 
[  342.939773] x9 : ffffffc0db6a0000 x8 : ffffffc0db6a39f0 
[  342.939782] x7 : 0000000000000000 x6 : 0000000000000000 
[  342.939791] x5 : ffffffc0010bf340 x4 : ffffffc000a1bb48 
[  342.939800] x3 : 0000000000000002 x2 : 0000000000000000 
[  342.939809] x1 : 0000000000000167 x0 : ffffffc0eef08800 

---

Something is weird.  My gut instinct would be bad hardware, but I'll keep digging.
OK, I don't think it's bad hardware

===

* When I build the kernel myself functions seem to be a different size (?), so grabbing official symbols for 9980.0.0 on goldeneye.  Basically you want "debug-kevin.tgz" in the artifacts directory.  It's big.

* Get the kernel symbols out with: 

  tar xvf ~/Downloads/dev-channel%2Fkevin%2F9980.0.0%2Fdebug-kevin.tgz  debug/boot/vmlinux

* Start gdb: aarch64-cros-linux-gnu-gdb /b/tip/tmp/debug/boot/vmlinux

* Now:

(gdb) set pagination off
(gdb) disass /s kmem_cache_alloc
Dump of assembler code for function kmem_cache_alloc:
/mnt/host/source/src/third_party/kernel/v4.4/mm/slub.c:
2522    {
   0xffffffc000333804 <+0>: stp x24, x23, [sp,#-64]!
   0xffffffc000333808 <+4>: stp x22, x21, [sp,#16]
   0xffffffc00033380c <+8>: stp x20, x19, [sp,#32]
...
...
/mnt/host/source/src/third_party/kernel/v4.4/mm/slub.c:
247             return *(void **)(object + s->offset);
   0xffffffc0003338bc <+184>:   ldr     x8, [x22,x10]


* Note that 184 is 0xb8.  Also note that x22 + x10 = 0x1 which matches where the illegal access was.  Not sure that helps us much, though.  :(

===

I picked the same host and picked a random other test (started at cautotest/, went to "View Host", picked "chromeos6-row2-rack24-host1", picked something random and looked at its logs.

I ended up going to:
  https://storage.cloud.google.com/chromeos-autotest-results/145696267-chromeos-test/chromeos6-row2-rack24-host1/sysinfo/var/log/messages

...and I found in the logs more failures that look similar to the weird "dir" ones above.  :-(

2017-09-29T05:50:05.908291-07:00 WARNING kernel: [    2.235854] WARNING: at /mnt/host/source/src/third_party/kernel/v4.4/fs/kernfs/dir.c:533
...
2017-09-29T05:50:05.916280-07:00 WARNING kernel: [    2.243532] kernfs_put: power/control: released with incorrect active_ref 0
2017-09-29T05:50:05.916307-07:00 WARNING kernel: [    2.243550] ------------[ cut here ]------------
2017-09-29T05:50:05.916310-07:00 WARNING kernel: [    2.243555] WARNING: at /mnt/host/source/src/third_party/kernel/v4.4/fs/kernfs/dir.c:562

===

This is really weird...I'll try grabbing an official build and see if I can repro that way on my hardware.  I can also dig through other hardware.  No idea if the "dir" warning is somehow the root cause of the other but it's suspicious...

OK, I put official 9980.0.0 on my kevin and installed it to emmc.  I then rebooted.  I noticed that it rebooted twice and found in console-ramoops more mysterious stuff:

[    0.894320] GPT: Use GNU Parted to correct GPT errors.
[    0.894358]  mmcblk1: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12
[    0.998757] mmc0: MAN_BKOPS_EN bit is not set
[    1.013091] mmc0: new HS400 Enhanced strobe MMC card at address 0001
[    1.013872] mmcblk0: mmc0:0001 500073 29.1 GiB 
[    1.014367] mmcblk0boot0: mmc0:0001 500073 partition 1 4.00 MiB
[    1.014843] mmcblk0boot1: mmc0:0001 500073 partition 2 4.00 MiB
[    1.015250] mmcblk0rpmb: mmc0:0001 500073 partition 3 4.00 MiB
[    1.019811]  mmcblk0: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12
[    1.259699] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[    1.259733] pgd = ffffffc001210000
[    1.259739] [00000000] *pgd=0000000000221003, *pud=0000000000221003, *pmd=0000000000222003, *pte=00e80000fee00707
[    1.259760] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[    1.259768] Modules linked in:
[    1.259782] CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 4.4.86-11747-gb3bf69ebcfe9 #1
[    1.259787] Hardware name: Google Kevin (DT)
[    1.259806] Workqueue: events dwc3_rockchip_otg_extcon_evt_work
[    1.259814] task: ffffffc0eefea700 ti: ffffffc0ef004000 task.ti: ffffffc0ef004000
[    1.259827] PC is at reset_control_deassert+0x18/0x40
[    1.259838] LR is at tcphy_phy_init+0x280/0x334
[    1.259844] pc : [<ffffffc000517de4>] lr : [<ffffffc0004c7500>] pstate: 60000045
[    1.259850] sp : ffffffc0ef007820
[    1.259855] x29: ffffffc0ef007830 x28: 0000000000000001 
[    1.259865] x27: 0000000000000001 x26: ffffffc0ef1c4640 
[    1.259875] x25: ffffffc0ef180c10 x24: ffffffc0ef1c4628 
[    1.259884] x23: 000000005533194a x22: 0000000000000002 
[    1.259894] x21: 0000000000000002 x20: 0000000000000002 
[    1.259903] x19: ffffffc0ed73c000 x18: 0000000000000000 
[    1.259913] x17: 0000000000000010 x16: ffffff80002fefff 
[    1.259923] x15: ffffff80002ff000 x14: 0000000000011c8c 
[    1.259932] x13: 00000000000113a0 x12: 0000000000010c0c 
[    1.259942] x11: 0000000000008fff x10: 000000000000c104 
[    1.259951] x9 : 0000000000030020 x8 : ffffff80040b0020 
[    1.259961] x7 : 0000000000000000 x6 : 000000000000003f 
[    1.259971] x5 : 0000000000000000 x4 : 0000000000000004 
[    1.259980] x3 : ffffffc0eeeab004 x2 : 0000000000000001 
[    1.259990] x1 : 0000000000000001 x0 : 0000000000000000 
...
...
[    1.262572] Call trace:
[    1.262580] [<ffffffc000517de4>] reset_control_deassert+0x18/0x40
[    1.262588] [<ffffffc0004c7500>] tcphy_phy_init+0x280/0x334
[    1.262596] [<ffffffc0004c7be4>] rockchip_usb3_phy_power_on+0x70/0x13c
[    1.262603] [<ffffffc0004c327c>] phy_power_on+0xa8/0xe0
[    1.262614] [<ffffffc0006372ec>] dwc3_core_init+0x508/0x838
[    1.262622] [<ffffffc000636b94>] dwc3_probe+0x9c0/0xb50
[    1.262632] [<ffffffc0005c0c9c>] platform_drv_probe+0x5c/0xb0
[    1.262640] [<ffffffc0005be894>] driver_probe_device+0x224/0x264
[    1.262648] [<ffffffc0005bef64>] __device_attach_driver+0x84/0x9c
[    1.262657] [<ffffffc0005bcd58>] bus_for_each_drv+0x80/0xc8
[    1.262663] [<ffffffc0005be9f4>] __device_attach+0xa4/0x128
[    1.262676] [<ffffffc0005bea98>] device_initial_probe+0x20/0x2c
[    1.262682] [<ffffffc0005bcf98>] bus_probe_device+0x34/0x8c
[    1.262692] [<ffffffc0005ba180>] device_add+0x4b4/0x540
[    1.262703] [<ffffffc0007372b0>] of_device_add+0x3c/0x50
[    1.262712] [<ffffffc000737c64>] of_platform_device_create_pdata+0x94/0xd0
[    1.262719] [<ffffffc000737ec4>] of_platform_bus_create+0x158/0x310
[    1.262727] [<ffffffc0007380e4>] of_platform_populate+0x68/0xa8
[    1.262734] [<ffffffc000639a10>] dwc3_rockchip_of_populate+0x30/0x68
[    1.262741] [<ffffffc000639b08>] dwc3_rockchip_otg_extcon_evt_work+0x70/0xf4
[    1.262753] [<ffffffc000238e70>] process_one_work+0x198/0x2fc
[    1.262759] [<ffffffc000238910>] worker_thread+0x190/0x2b0
[    1.262769] [<ffffffc00023d568>] kthread+0xe0/0xf0
[    1.262779] [<ffffffc000203dd0>] ret_from_fork+0x10/0x40
[    1.262787] Code: 910043fd aa0003f3 d503201f f9400260 (f9400008) 
[    1.262804] ---[ end trace d6feebb2bf4439be ]---

===

First I guess I should figure out just how crashy 9980.0.0 is:

1.  Crash (above)
2.  No crash (above)
3.  No crash
4.  No crash
5.  No crash
6.  No crash
7.  No crash
8.  No crash
9.  No crash
10. No crash
11. No crash

Hrm, so I guess it's not actually as likely as I first thought?  I checked for crashes / warnings with:
  grep 'PC is' /dev/pstore/console-ramoops
  dmesg | grep "PC is"

===

I guess I need to automate this a little bit...  Hopefully I can get something to reproduce with serial console...

===

Oooo, jackpot!  slub_debug to the rescue:

[    3.337475] =============================================================================
[    3.345646] BUG kmalloc-256 (Not tainted): Redzone overwritten
[    3.351471] -----------------------------------------------------------------------------
[    3.351471]
[    3.361115] Disabling lock debugging due to kernel taint
[    3.366421] INFO: 0xffffffc0ebe88600-0xffffffc0ebe88607. First byte 0x0 instead of 0xcc
[    3.374422] INFO: Allocated in rockchip_sound_probe+0x64/0x278 age=7 cpu=5 pid=1
[    3.381813]  alloc_debug_processing+0x7c/0x108
[    3.386253]  ___slab_alloc+0x270/0x310
[    3.389999]  __kmalloc_track_caller+0x224/0x234
[    3.394526]  devm_kmalloc+0x34/0x80
[    3.398010]  rockchip_sound_probe+0x64/0x278
[    3.402276]  platform_drv_probe+0x5c/0xb0
[    3.406281]  driver_probe_device+0x224/0x264
[    3.410546]  __driver_attach+0x68/0x94
[    3.414293]  bus_for_each_dev+0x7c/0xc4
[    3.418125]  driver_attach+0x2c/0x38
[    3.421697]  bus_add_driver+0x128/0x1f0
[    3.425528]  driver_register+0x90/0xdc
[    3.429271]  __platform_driver_register+0x4c/0x58
[    3.433971]  rockchip_sound_driver_init+0x18/0x20
[    3.438671]  do_one_initcall+0x10c/0x1cc
[    3.442590]  kernel_init_freeable+0x1a4/0x23c
[    3.446943] INFO: Slab 0xffffffbdc3b02200 objects=25 used=5 fp=0xffffffc0ebe8b700 flags=0x4081
[    3.455545] INFO: Object 0xffffffc0ebe88500 @offset=1280 fp=0xffffffc0ebe88780
[    3.455545]
[    3.464236] Bytes b4 ffffffc0ebe884f0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[    3.473705] Object ffffffc0ebe88500: 00 fe e8 eb c0 ff ff ff 00 5f d7 ec c0 ff ff ff  ........._......
[    3.483001] Object ffffffc0ebe88510: 38 c4 5d 00 c0 ff ff ff 18 ee c5 00 c0 ff ff ff  8.].............
[    3.492296] Object ffffffc0ebe88520: 60 00 00 00 00 00 00 00 19 f9 c9 00 c0 ff ff ff  `...............
[    3.501591] Object ffffffc0ebe88530: 00 00 00 00 00 00 00 00 5b b9 c9 00 c0 ff ff ff  ........[.......
[    3.510886] Object ffffffc0ebe88540: 00 00 00 00 00 00 00 00 e2 de c9 00 c0 ff ff ff  ................
[    3.520181] Object ffffffc0ebe88550: 00 00 00 00 00 00 00 00 50 f9 c9 00 c0 ff ff ff  ........P.......
[    3.529476] Object ffffffc0ebe88560: 00 00 00 00 00 00 00 00 e9 de c9 00 c0 ff ff ff  ................
[    3.538772] Object ffffffc0ebe88570: 00 00 00 00 00 00 00 00 50 f9 c9 00 c0 ff ff ff  ........P.......
[    3.548066] Object ffffffc0ebe88580: 00 00 00 00 00 00 00 00 46 f5 c9 00 c0 ff ff ff  ........F.......
[    3.557362] Object ffffffc0ebe88590: 00 00 00 00 00 00 00 00 9d d7 c9 00 c0 ff ff ff  ................
[    3.566657] Object ffffffc0ebe885a0: 00 00 00 00 00 00 00 00 46 f5 c9 00 c0 ff ff ff  ........F.......
[    3.575952] Object ffffffc0ebe885b0: 00 00 00 00 00 00 00 00 a1 d7 c9 00 c0 ff ff ff  ................
[    3.585247] Object ffffffc0ebe885c0: 00 00 00 00 00 00 00 00 2f ce c9 00 c0 ff ff ff  ......../.......
[    3.594542] Object ffffffc0ebe885d0: 00 00 00 00 00 00 00 00 35 f9 c9 00 c0 ff ff ff  ........5.......
[    3.603837] Object ffffffc0ebe885e0: 00 00 00 00 00 00 00 00 c4 ca c4 00 c0 ff ff ff  ................
[    3.613132] Object ffffffc0ebe885f0: 00 00 00 00 00 00 00 00 c1 e0 c9 00 c0 ff ff ff  ................
[    3.622427] Redzone ffffffc0ebe88600: 00 00 00 00 00 00 00 00                          ........
[    3.631116] Padding ffffffc0ebe88740: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[    3.640499] Padding ffffffc0ebe88750: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[    3.649881] Padding ffffffc0ebe88760: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[    3.659262] Padding ffffffc0ebe88770: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[    3.668647] CPU: 5 PID: 1 Comm: swapper/0 Tainted: G    B           4.4.86 #49

---

I believe the allocation 

	routes = devm_kzalloc(dev, sizeof(rockchip_routes),
			      GFP_KERNEL);

...oh yeah.  That changed recently and looks wrong...

Looks like it broke in <https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/674847>

===

I'll post a patch shortly.




Comment 21 by mka@chromium.org, Sep 29 2017

Ouch, missed this in the review and it didn't blow up in manual testing :(

Thanks for tracking this down and documenting your steps!
Status: Started (was: Untriaged)
Summary: kevin: sometimes get mystery random failures / crashes, often at bootup (was: tracking: CQ is broken by DUT reboot issues)
OK, kevin should be much more robust once this lands:

* https://chromium-review.googlesource.com/#/c/chromiumos/third_party/kernel/+/693319 
  FROMLIST: ASoC: rockchip: Allocate enough memory so we don't overflow routes

...note that the above change won't help any other random DUT reboot issues, but since you assigned this bug to me specifically for the kevin one I'm assuming that you wanted me to hijack this bug for kevin.  Having a single big bug saying "the CQ fails sometimes" isn't super useful anyway since it'd be a catchall for every bug we ever will have...

If you want to keep tracking down other random CQ failures, maybe it makes sense to start a bug for each different looking failure?
Owner: pho...@chromium.org
Summary: tracking: CQ is broken by DUT reboot issues (was: kevin: sometimes get mystery random failures / crashes, often at bootup)
Re#22, I want this bug to focus on the summarize of random reboot problem in CQ. Could file a separate bug for the kevin and mark it blocking this one? Thanks!

Change the name back. Assign to next week's deputy to monitor this problem and its derivative bugs.
Cc: -briannorris@chromium.org
Super-bug = good luck and adios!
Cc: -mka@chromium.org -cernekee@chromium.org
@23: OK, if you're sure.  IMHO super bugs aren't super useful.  Next time maybe better to file the sub-bug and assign it to me so I don't pollute your umbrella bug then.

Given that the CL is already marked to go in (referencing this bug) and no merge is needed, I probably won't file a separate bug unless someone can come up with a good reason.
Re #25, can you confirm whether the failures described in #0 all come from the same problem that you're trying to fix, or what part of the failures can be fixed by your CL?

My original intent to file this bug is to question whether they happened due to a same bug or should belong to different bugs.
If the reason is the same, we can re-name this bug directly to the real issue.
If not, I want to find several proper assigners and create several sub bugs for them, and this bug is still used to make sure all failures have been weel dealt with, and it can be closed when all failures are fully represented by these sub bugs.

Sorry for the trouble that there's some miscommunication here, I should follow this bug and file a new bug when the topic is narrowed down to 'kevin'.
Project Member

Comment 27 by bugdroid1@chromium.org, Sep 30 2017

Labels: merge-merged-chromeos-4.4
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/7708447f54d518ddbb7da9c66cac9a1c6310cc4a

commit 7708447f54d518ddbb7da9c66cac9a1c6310cc4a
Author: Douglas Anderson <dianders@chromium.org>
Date: Sat Sep 30 03:16:26 2017

FROMLIST: ASoC: rockchip: Allocate enough memory so we don't overflow routes

In the recent commit d9f9c167edae ("ASoC: rockchip: Init dapm routes
dynamically") we improperly allocated memory for the card->dapm_routes
causing us to overflow the allocation on every boot.  Oops.

Let's allocate the correct amount of memory.  We'll also add a check
to make sure that we don't overrun memory even if we encounter some
sort of weird device tree.

BUG= chromium:769990 
TEST=No more slub debug errors; audio still works

Change-Id: Ibfacae048394247fb77f8f6b9395027318a7718e
Fixes: d9f9c167edae ("ASoC: rockchip: Init dapm routes dynamically")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
(am from https://patchwork.kernel.org/patch/9978763/)
Reviewed-on: https://chromium-review.googlesource.com/693319
Reviewed-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Jeffy Chen <jeffy.chen@rock-chips.com>

[modify] https://crrev.com/7708447f54d518ddbb7da9c66cac9a1c6310cc4a/sound/soc/rockchip/rk3399_gru_sound.c

Comment 28 by uekawa@google.com, Oct 2 2017

#26, the problem fixed in #27 is just for Kevin (part of "4." on #0), so the bigger problem isn't handled yet.

26/28: I'm not CCed on this bug anymore, so if you have questions either CC me again temporarily or send me an email...

...it looks like uekawa already answered you, but basically the problem above touches code that's only executed on kevin and devices with the same SoC (bob, scarlet).  The fix can't fix anyone else.

---

> and this bug is still used to make sure all failures have been weel dealt with

So you're going to keep this bug open until no DUTs ever crash?

Status: Fixed (was: Started)
I buy #29.
DUT reboot issues are pervasive and for bazillion reasons. This one is done. I opened three more in my deputy shift ;)

Comment 31 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Comment 32 by dchan@chromium.org, Jan 23 2018

Status: Fixed (was: Archived)

Sign in to add a comment