New issue
Advanced search Search tips

Issue 686830 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Feb 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

stressapptest can crash when trying to print Hardware Errors

Project Member Reported by diand...@chromium.org, Jan 30 2017

Issue description

Running:

stressapptest -M 1500 -s 100000

...on hardware with memory errors crashes stressapptest rather than printing the errors.

---

tl;dr: We need this patch to stressapptest:

$ diff -Naur os.cc.bak os.cc
--- os.cc.bak   2017-01-30 10:33:56.570855707 -0800
+++ os.cc       2017-01-30 10:34:34.947362047 -0800
@@ -63,6 +63,7 @@
   dynamic_mapped_shmem_ = false;
   mmapped_allocation_ = false;
   shmid_ = 0;
+  channels_ = NULL;
 
   time_initialized_ = 0;

---

An example crash was:

[168032.493255] stressapptest[21400]: unhandled level 2 translation fault (11) at 0xa0a24ff5, esr 0x92000006
[168032.503417] pgd = ffffffc09c8f6000
[168032.506963] [a0a24ff5] *pgd=00000000792c3003, *pud=00000000792c3003, *pmd=0000000000000000
[168032.515534]
[168032.517221] CPU: 2 PID: 21400 Comm: stressapptest Tainted: G        W       4.4.21 #501
[168032.525331] Hardware name: Google Kevin (DT)
[168032.529770] task: ffffffc0ca223800 ti: ffffffc0785d8000 task.ti: ffffffc0785d8000
[168032.537383] PC is at 0xab0012b0
[168032.540671] LR is at 0x33333333
[168032.543935] pc : [<00000000ab0012b0>] lr : [<0000000033333333>] pstate: 200d0030
[168032.551490] sp : 00000000f3cb8d58
[168032.554948] x12: 0000000055555555
[168032.558515] x11: 0000000001010101 x10: 00000000010f0f0f
[168032.564001] x9 : 00000000f3cb8de0 x8 : 0000000000000100
[168032.569519] x7 : 00000000f3cb8d78 x6 : 0000000000000000
[168032.575061] x5 : 0000000000000000 x4 : 0000000006060603
[168032.580550] x3 : 0000000000000000 x2 : 000000000002ac6c
[168032.586036] x1 : 00000000a0a24ff5 x0 : 0000000010a228f8
[168032.591529]
Segmentation fault (core dumped)

---

My debugging was:

As far as I can tell, it's always crashing dereferencing r1 and r1 is always a0a24ff5.

I downloaded the stressapptest.debug symbols from the server, then combined:
  eu-unstrip /b/tip/tmp/stressapptest /b/tip/tmp/9202_symb/stressapptest.debug

I then objdumped:
  armv7a-cros-linux-gnueabi-objdump -drSF /b/tip/tmp/9202_symb/stressapptest.debug | less

I then looked for PC ending with 2b0 where we were dereferencing r1.  That gave me one hit:

---

00004238 <_ZN7OsLayer8FindDimmEyPci> (File Offset: 0x4238):
    4238:       e92d 4ff0       stmdb   sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
    423c:       af03            add     r7, sp, #12
    423e:       b085            sub     sp, #20
    4240:       494d            ldr     r1, [pc, #308]  ; (4378 <_ZN7OsLayer8FindDimmEyPci+0x140> (File Offset: 0x4378))
    4242:       4479            add     r1, pc
    4244:       6809            ldr     r1, [r1, #0]
    4246:       6809            ldr     r1, [r1, #0]
    4248:       9104            str     r1, [sp, #16]
    424a:       6bc1            ldr     r1, [r0, #60]   ; 0x3c
    424c:       e9d7 9806       ldrd    r9, r8, [r7, #24]
    4250:       2900            cmp     r1, #0
    4252:       d063            beq.n   431c <_ZN7OsLayer8FindDimmEyPci+0xe4> (File Offset: 0x431c)
    4254:       e9d0 6510       ldrd    r6, r5, [r0, #64]       ; 0x40
    4258:       f04f 3c55       mov.w   ip, #1431655765 ; 0x55555555
    425c:       f04f 3e33       mov.w   lr, #858993459  ; 0x33333333
    4260:       f640 7a0f       movw    sl, #3855       ; 0xf0f
    4264:       f04f 3b01       mov.w   fp, #16843009   ; 0x1010101
    4268:       6c80            ldr     r0, [r0, #72]   ; 0x48
    426a:       f2c0 1a0f       movt    sl, #271        ; 0x10f
    426e:       4016            ands    r6, r2
    4270:       401d            ands    r5, r3
    4272:       ea0c 0456       and.w   r4, ip, r6, lsr #1
    4276:       1b36            subs    r6, r6, r4
    4278:       ea0e 0496       and.w   r4, lr, r6, lsr #2
    427c:       f026 36cc       bic.w   r6, r6, #3435973836     ; 0xcccccccc
    4280:       4434            add     r4, r6
    4282:       ea0c 0655       and.w   r6, ip, r5, lsr #1
    4286:       1bae            subs    r6, r5, r6
    4288:       eb04 1414       add.w   r4, r4, r4, lsr #4
    428c:       ea0e 0596       and.w   r5, lr, r6, lsr #2
    4290:       f026 36cc       bic.w   r6, r6, #3435973836     ; 0xcccccccc
    4294:       ea04 040a       and.w   r4, r4, sl
    4298:       442e            add     r6, r5
    429a:       fb04 f40b       mul.w   r4, r4, fp
    429e:       eb06 1616       add.w   r6, r6, r6, lsr #4
    42a2:       ea06 060a       and.w   r6, r6, sl
    42a6:       fb06 f60b       mul.w   r6, r6, fp
    42aa:       4066            eors    r6, r4
    42ac:       f3c6 6600       ubfx    r6, r6, #24, #1
      operator[](size_type __n) _GLIBCXX_NOEXCEPT
      {
#if __google_stl_debug_vector
        _M_range_check(__n);
#endif
        return *(this->_M_impl._M_start + __n);
    42b0:       6809            ldr     r1, [r1, #0]
    42b2:       ea46 0646       orr.w   r6, r6, r6, lsl #1

---

I ran:
   armv7a-cros-linux-gnueabi-gdb /b/tip/tmp/9202_symb/stressapptest.debug

...and then gdb can deal with this better with "disass /s 0x4238", showing us as being in "OsLayer::FindDimm", more specifically:

279       uint32 high = static_cast<uint32>((addr & channel_hash_) >> 32);
280       vector<string>& channel = (*channels_)[
281           __builtin_parity(high) ^ __builtin_parity(low)];
   0x00004272 <+58>:    and.w   r4, r12, r6, lsr #1
   0x00004276 <+62>:    subs    r6, r6, r4
   0x00004278 <+64>:    and.w   r4, lr, r6, lsr #2
   0x0000427c <+68>:    bic.w   r6, r6, #3435973836     ; 0xcccccccc
   0x00004280 <+72>:    add     r4, r6
   0x00004282 <+74>:    and.w   r6, r12, r5, lsr #1
   0x00004286 <+78>:    subs    r6, r5, r6
   0x00004288 <+80>:    add.w   r4, r4, r4, lsr #4
   0x0000428c <+84>:    and.w   r5, lr, r6, lsr #2
   0x00004290 <+88>:    bic.w   r6, r6, #3435973836     ; 0xcccccccc
   0x00004294 <+92>:    and.w   r4, r4, r10
   0x00004298 <+96>:    add     r6, r5
   0x0000429a <+98>:    mul.w   r4, r4, r11
   0x0000429e <+102>:   add.w   r6, r6, r6, lsr #4
   0x000042a2 <+106>:   and.w   r6, r6, r10
   0x000042a6 <+110>:   mul.w   r6, r6, r11
   0x000042aa <+114>:   eors    r6, r4
   0x000042ac <+116>:   ubfx    r6, r6, #24, #1

/usr/bin/../lib/gcc/armv7a-cros-linux-gnueabi/4.9.x/include/g++-v4/bits/stl_vector.h:
866             return *(this->_M_impl._M_start + __n);
   0x000042b0 <+120>:   ldr     r1, [r1, #0]
   0x000042b2 <+122>:   orr.w   r6, r6, r6, lsl #1

728             return size_type(this->_M_impl._M_finish - this->_M_impl._M_start);
   0x000042b6 <+126>:   ldr.w   r4, [r1, r6, lsl #2]

---

So if I understand correctly we are trying to dereference "channels_" ?  I think that will be uninitialized if SetDramMappingParams() hasn't been called.  My C++ is rusty, but I _think_ that there's no guarantee that uninitialized member variables are 0, right?

...and it looks as if SetDramMappingParams() is only called if:

  if (channels_.size() > 0) {

 
Cc: vapier@chromium.org diand...@chromium.org
This is a simple fix. However stressapptest is included in chrome os in a pretty roundabout way, so it may be worthwhile to add it to third_party/chromiumos-overlay and have a local copy of the source.
Dunno if we need a local copy or not.  I was thinking we could just fix it "upstream" and then submit a roll to portage to pick it up, then get the newer portage package.  That seems like the cleanest way...

Comment 3 by vapier@chromium.org, Jan 31 2017

yeah, we want to stop building any random tarballs+source out of the autotools repo and instead have dedicated ebuilds in chromiumos-overlay (or portage-stable if possible)

if you want to post a CL that fixes the ebuild, we can get that into Gentoo
Here's the fixed upstream: https://github.com/stressapptest/stressapptest/releases (1.0.8)

vapier, can you pull the new version to gentoo, then chrome os?

Comment 5 by vapier@chromium.org, Jan 31 2017

ah, easy enough! :)

now in Gentoo:
https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=8c2a6c26aa3f4902ea1b86a0fac9b1d27d51874d

you should be able to pull it into portage-stable via cros_portage_upgrade
Cc: nsanders@chromium.org
Status: Fixed (was: Untriaged)
Should be checked in now.

Comment 7 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 8 by dchan@google.com, May 30 2017

Labels: VerifyIn-60

Comment 9 by dchan@chromium.org, Aug 1 2017

Labels: VerifyIn-61

Comment 10 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)

Sign in to add a comment