Issue metadata
Sign in to add a comment
|
veyron: Device crash reboots due to CPU hang, triggered by addition of -fomit-frame-pointers to cflags |
|||||||||||||||||||||||||
Issue description
Glancing at two crashes on veyron_minnie:
1. f2f90cb610000000: Looks like some yet unknown CPU errata. CPU0 is wedged on PC 0x2d64ad24, which is userspace.
Since CPU0 is wedged we get various errors about interrupts not being serviced, too.
2. 77ee0a8c80000000: Looks like the same CPU errata. CPU3 wedged on PC 0x372c7d24.
At this point we don't know much more than that. :(
...but we've had a number of difficult to track down errata for A17. This one sounds a little like <https://chromium-review.googlesource.com/331746> did back in the day...
,
Apr 10 2017
The other bug had some steps to reproduce: --- What steps will reproduce the problem? (1)Sign in to chrome or Browse as guest>> Go to youtube.com and play any video (2)Now start clicking on the video continuously to pause & play and Observe. Expected: Chrome should not crash while continuously clicking on pause and play in youtube.com Actual: Instead chrome crash is seen. This is regression issue as no such crash is seen in M58. Crash id's: f2f90cb610000000,77ee0a8c80000000 Note: 1.Issue is only seen in Minnie device. 2.Issue is not seen in same chrome version Daisy and Candy Attaching screen-cast for reference. --- That means we probably have a chance of debugging this, actually...
,
Apr 10 2017
Not reproduce the crash(comment#2) on Chrome 59.0.3065.0/CrOS9449.0.0 -Cyan, Samus.
,
Apr 10 2017
@3: I'd expect this would reproduce on any veyron device. ...since you logged in as guest, it seems unlikely this is ARC++ related.
,
Apr 10 2017
Unable to reproduce the issue on 9449.0.0/59.0.3065.0 - Minnie device when logged in to an user account.
,
Apr 13 2017
Looks like this is also reported in bug #711087 . There we have a hang: <3>[ 290.958742] CPU1 PC: <331f9cd4> 0x331f9cd4 --- As per all bugs, I see: - 59.0.3054.0/9413.0.0 - maybe reproduced for wez@ at login ( bug #711087 , original report) - 59.0.3065.0/9448.0.0 - reproduces pause / unpause video ( bug #709945 ) - 59.0.3065.0/9449.0.0 - the version on wez@'s crash report ( bug #711087 ) - also the version in comment #5 above that says "Unable to reproduce the issue"
,
Apr 13 2017
OK, I've reproduced this pretty easily on "9454.0.0". --- My best guess is that this in the skia function setLine: https://cs.chromium.org/chromium/src/third_party/skia/src/core/SkEdge.cpp?type=cs&q=setLine+package:%5Echromium$&l=33 Why do I guess that? I've done these steps twice: -- 0. Run w/ a custom kernel with kdb 1. Start up a youtube video and from command line, run this (save results somewhere): grep 'chrome/chrome$' /proc/*/maps 2. Reproduce the crash (the most recent time I didn't even need to do the pause/unpause) 3. When you see the failure, you'll be in kdb and see a line like: [ 1006.023641] CPU0 PC: <34895cd4> 0x34895cd4 4. In kdb, run: [3]kdb> set LINES 100000 [3]kdb> ps 5. Look at the kdb output for a process in "R" state (running) on the stuck CPU. Hint, look for "Error: no saved data for this cpu". I see: 0xe7d3a880 3340 1501 1 0 R 0xe7d3ac88 CompositorTileW Error: no saved data for this cpu 6. See if you can find the process ID (3340) in your maps from step #1. If not, maybe the parent (1501) is close enough. That's what I used. AKA: /proc/1501/maps:3292a000-32a00000 r-xp 00000000 b3:13 72886 /opt/google/chrome/chrome /proc/1501/maps:34800000-37caf000 r-xp 01ed6000 b3:13 72886 /opt/google/chrome/chrome /proc/1501/maps:37caf000-37fac000 r--p 05384000 b3:13 72886 /opt/google/chrome/chrome /proc/1501/maps:37fac000-37fc1000 rw-p 05681000 b3:13 72886 /opt/google/chrome/chrome 7. Do the math finding out the offset for your address. In [42]: hex(0x34895cd4-0x34800000+0x01ed6000) Out[42]: '0x1f6bcd4' 8. Use magic steps to get gdb to help you debug Chrome: a) Download image and "debug-veyron-minnie.tgz" to a folder that chroot can see. b) Unzip mkdir /b/tip/tmp/minnie-9454.0.0-symbols cd /b/tip/tmp/minnie-9454.0.0-symbols tar xvf ~/Downloads/canary-channel%2Fveyron-minnie%2F9454.0.0%2Fdebug-veyron-minnie.tgz tar xvf ~/Downloads/canary-channel%2Fveyron-minnie%2F9454.0.0%2FChromeOS-test-R59-9454.0.0-veyron-minnie.tar.xz c) Mount (in chroot); START=$(($(cgpt show chromiumos_test_image.bin | grep ROOT-A | awk '{print $1};') * 512)) mkdir r sudo mount -oro -oloop,offset=${START} chromiumos_test_image.bin r d) Run (in chroot): armv7a-cros-linux-gnueabi-gdb \ --ex "set sysroot r" \ --ex "set debug-file-directory debug/" \ --ex "file r/opt/google/chrome/chrome" 9. In debugger, type: disass 0x1f6bcd4 --- If you do that, you'll see: (gdb) disass 0x1f6bcd4 Dump of assembler code for function setLine: 0x01f6bcd4 <+0>: stmdb sp!, {r4, r5, r6, r7, r8, r9, r10, r11, lr} 0x01f6bcd8 <+4>: sub sp, #12 0x01f6bcda <+6>: mov r4, r0 0x01f6bcdc <+8>: movs r0, #0 0x01f6bcde <+10>: str r0, [r4, #8] 0x01f6bce0 <+12>: movw r3, #16383 ; 0x3fff 0x01f6bce4 <+16>: vmov.f32 s0, #16 ; 0x40800000 4.0 0x01f6bce8 <+20>: vldr s4, [r2, #4] 0x01f6bcec <+24>: vmul.f32 s4, s4, s0 0x01f6bcf0 <+28>: vldr s2, [r2] 0x01f6bcf4 <+32>: mov.w r2, #8192 ; 0x2000 0x01f6bcf8 <+36>: vldr s8, [r1, #4] 0x01f6bcfc <+40>: vldr s10, [pc, #524] ; 0x1f6bf0c <setLine+568> 0x01f6bd00 <+44>: vmul.f32 s2, s2, s0 0x01f6bd04 <+48>: vmul.f32 s8, s8, s0 0x01f6bd08 <+52>: vmul.f32 s4, s4, s10 0x01f6bd0c <+56>: vldr s6, [r1] 0x01f6bd10 <+60>: vmul.f32 s0, s6, s0 0x01f6bd14 <+64>: vmul.f32 s2, s2, s10 0x01f6bd18 <+68>: vmul.f32 s6, s8, s10 0x01f6bd1c <+72>: vcvt.s32.f32 s4, s4 0x01f6bd20 <+76>: vmul.f32 s0, s0, s10 0x01f6bd24 <+80>: vcvt.s32.f32 s2, s2 0x01f6bd28 <+84>: vmov r1, s4 0x01f6bd2c <+88>: vcvt.s32.f32 s6, s6 0x01f6bd30 <+92>: mov.w r1, r1, lsl #10 0x01f6bd34 <+96>: add.w r1, r2, r1, asr #2 0x01f6bd38 <+100>: vcvt.s32.f32 s0, s0 0x01f6bd3c <+104>: bic.w r8, r1, r3 0x01f6bd40 <+108>: vmov r3, s6 0x01f6bd44 <+112>: vmov r1, s2 0x01f6bd48 <+116>: mov r11, r8 0x01f6bd4a <+118>: mov.w r3, r3, lsl #10 0x01f6bd4e <+122>: add.w r9, r2, r3, asr #2 0x01f6bd52 <+126>: vmov r2, s0 0x01f6bd56 <+130>: mov.w r1, r1, lsl #10 0x01f6bd5a <+134>: bfc r9, #0, #14 0x01f6bd5e <+138>: mov.w r2, r2, lsl #10 0x01f6bd62 <+142>: mov.w r3, r2, asr #2 0x01f6bd66 <+146>: cmp r9, r8 0x01f6bd68 <+148>: mov r10, r3 0x01f6bd6a <+150>: mov r5, r9 0x01f6bd6c <+152>: ittt gt ... ... === IMPORTANT NOTE! IMPORTANT NOTE! IMPORTANT NOTE! IMPORTANT NOTE! The "PC" reported above is approximate. It is a PC that was recently executed by the CPU but not guaranteed to be the actual PC that the CPU is actually at right now. ...so we're likely not actually on the "stmdb sp!, {r4, r5, r6, r7, r8, r9, r10, r11, lr}" but some number of instructions after it. === The "setLine" seems like a really good candidate for causing a deadlocking CPU errata on veyron because it appears to use floating point instructions. Many previous A12^H^H^H A17 CPU errata were related to floating point and/or neon instructions, so this seems pretty likely.
,
Apr 13 2017
Wow, the "setLine" function's disassmebly changed quite drastically recently. I'll see if I can bisect, but I happened to have 9369.0.0 on my PC already. Here's the disassembly of the two. This looks a bit more like it might have been a compiler flag change??
,
Apr 13 2017
Ugh, there's more than one setLine in Chrome. Sigh. Here's the proper one from 9369.0.0, found via "disass SkEdgeBuilder::setLine" It's still pretty different, though.
,
Apr 13 2017
,
Apr 13 2017
Presuming setLine changing in a major way is actually the root cause, the regression would have happened between 9438.0.0 and 9439.0.0 ...now I guess we can actually check if that's true by actually putting builds on devices and trying to reproduce. Diffs: * https://crosland.corp.google.com/log/9438.0.0..9439.0.0 * https://chromium.googlesource.com/chromium/src/+log/59.0.3064.0..59.0.3065.0?pretty=fuller&n=10000 Nothing terribly fishy on the Chrome OS side, so presumably something on the Chrome side?
,
Apr 13 2017
,
Apr 13 2017
Anyone know if we are using "jumper" ? One of the commits in this range is: https://github.com/google/skia/commit/3b80558bd2d1f68c175475343328ee14a71d1a3b jumper, turn off a few fancy features That looks the most fishy. Ah, also attaching the exact disassembly between 9438 and 9439... I'd love some help tracking down what changed this disassembly. While the new disassembly is probably fine in theory, it seems like it's tickling this new mystery errata bigtime. As a stopgap until we get it solved, it would be good to find a way to revert back. Personally I'm not setup to build Chrome. Maybe it's time to get setup, but also maybe someone else could help me poke? --- Current list of things to do: 1. Confirm that regression started at 9439. 2. See if we can nail down exactly why the disassembly changed. Change it back and see if problem goes away. 3. Work with ARM to find Chicken Bits to set. See if we can find one that makes the problem go away. 4. Try to reproduce the problem in a standalone binary.
,
Apr 13 2017
The "jumper, turn off a few fancy features" CL is a no-op. Can't cause any problems. It turns off fancy features in a code generation step that have no impact on the results of that step, which are also checked into Skia (src/jumper/SkJumper_generated.S). SkJumper is a little interpreter that we use to do a lot of CPU-bound work in Skia. It's pretty much confined to two object files, SkJumper.o and SkJumper_generated.o. The ones in SkJumper_generated are assembly, and look like sk_<name>_<arch>, where arch is something like sse2, vfp4, or arm64. You're all still pretty sure the crash is in SkEdge::setLine()? What's a minnie and a veyron?
,
Apr 13 2017
> You're all still pretty sure the crash is in SkEdge::setLine()?
It seems fairly likely given all the above, but not 100%.
> What's a minnie and a veyron?
veyron_minnie (the device "minnie" in the family "veyron") is the Asus Chromebook Flip, a 32-bit ARM (rk3288 SoC) Chromebook. It's likely that problems affecting one device in the "veyron" family affect all of them.
---
As an FYI, I've confirmed 9439 reproduces:
[ 531.252136] CPU3 PC: <32c90d24> 0x32c90d24
...and nicely the lower 3 nybbles match the address of setLine for that version
Dump of assembler code for function setLine:
0x01f6bd24 <+0>: stmdb sp!, {r4, r5, r6, r7, r8, r9, r10, r11, lr}
---
Trying 9438, though confirming that the problem _doesn't_ reproduce will be slightly harder, of course.
,
Apr 13 2017
Here's all the Skia changes in the window: https://skia.googlesource.com/skia.git/+log/45dcc0c42c40..9a121cc6a That file hasn't changed in 4 months, and that function in 2 years. :(
,
Apr 13 2017
Hard for me to prove it can't happen, but I spent quite some time in 9438.0.0 and I couldn't reproduce the problem there. If anyone else wants to try, please do...
,
Apr 13 2017
Wez were you signed in as guest? Comment 5: didn't see the crash when signed in as guest, but saw it when logged into own user account.
,
Apr 13 2017
I haven't managed to make a standalone reproduction of this, which isn't a terrible surprise given the nature of these types of problems. They tend to need very complicated setups. --- * laszio@ has volunteered to try to track down why the disassembly is so different. * bbatacha@arm.com is checking for likely chicken bits. The plan is to have something by next Monday. I'm not sure there's too much more I can do for this right now other than wait for those two results. I could try to object code patch the Chrome binary to see if I can make the problem go away, but that seems like a lot of work when the other avenues are better uses of time. === For now assigning to laszio@chromium.org as he is the one who is likely to make the next step.
,
Apr 14 2017
Re #18: I was logged-in with my corp account; I had quite a few tabs open, so issue may be specific to something one of them is rendering..?
,
Apr 14 2017
Some updates: 1. The setLine shown by gdb should be "SkAnalyticEdge::setLine(const SkPoint& p0, const SkPoint& p1)" https://cs.chromium.org/chromium/src/third_party/skia/src/core/SkAnalyticEdge.h?type=cs&l=133 rather than the link mentioned in comment 7. They both have difference across 9438 and 9439, though. mtklein@, is there any (indirect) change to SkAnalyticEdge::setLine recently? 2. AutoFDO has nothing to do with this; I checked veyron_minnie-chrome-pfq, which doesn't use AutoFDO, and the differences between the builds are there. The size of the functions matches the builds with AutoFDO; I would guess AutoFDO didn't touch them at all.
,
Apr 14 2017
I stand corrected. ARM will try to have something by Tuesday, not Monday.
,
Apr 14 2017
Still don't see anything that'd affect that in that CL window, no.
,
Apr 14 2017
@21: Ah, OK. This still works to find the assembly, right? disass SkEdgeBuilder::setLine When I do that on 9454 I get the right address... Ah, yeah. That seems to be right: (gdb) info symbol SkEdgeBuilder::setLine SkAnalyticEdge::setLine(SkPoint const&, SkPoint const&) in section .text of /mnt/host/source/tmp/minnie-9454.0.0-symbols/r/opt/google/chrome/chrome -- I'm assuming that laszio@ is still digging into why the assembly changed.
,
Apr 14 2017
OK, there is a difference in compiler flags: the newer build has -fomit-frame-pointer while the older one doesn't. There is no source code change. The flag change is in chromium, not in CrOS; I used the same CrOS checkout and the ebuilds: chromeos-chrome-59.0.3064.0_rc-r1.ebuild chromeos-chrome-59.0.3065.0_rc-r1.ebuild The only difference besides autofdo in the ebuild is + sys-apps/dbus The change is probably introduced by https://codereview.chromium.org/2782063005 Issue 2782063005: Explicitly specify whether to emit frame pointers by default.
,
Apr 14 2017
BTW, when I tried to remove the -fomit-frame-pointer and the difference between 3064 and 3065 disappeared. So this is the reason why the assembly changes.
,
Apr 14 2017
erikchen@: Currently this regression is being blamed on your CL <https://codereview.chromium.org/2782063005>. Quick summary: it appears that your change in compiler flags is tickling a CPU errata on ARM A12 CPUs (like veyron). Can you submit a CL to change the compiler flags back to the way they were until we can find a way to workaround the errata?
,
Apr 14 2017
Drafted a quick-fix to un-do explicit omit-frame-pointers in ARM 32-bit builds.
,
Apr 14 2017
Filed issue 711784 for the follow-up toolchain work.
,
Apr 14 2017
@29: more accurately, issue 711784 is about somehow addressing the errata. We'll use this bug here to track the workaround of just undoing the compiler flag.
,
Apr 14 2017
BTW: it's hard to prove 100%, but I think this patch "fixed" it for me:
---
--- a/src/core/SkAnalyticEdge.h
+++ b/src/core/SkAnalyticEdge.h
@@ -24,6 +24,7 @@ struct SkAnalyticEdge {
// During aaa_walk_edges, if this edge is a left edge,
// then fRiteE is its corresponding right edge. Otherwise it's nullptr.
SkAnalyticEdge* fRiteE;
+ volatile int doug;
SkFixed fX;
SkFixed fDX;
@@ -138,9 +139,13 @@ bool SkAnalyticEdge::setLine(const SkPoint& p0, const SkPoint& p1) {
const int accuracy = kDefaultAccuracy;
const int multiplier = (1 << kDefaultAccuracy);
SkFixed x0 = SkFDot6ToFixed(SkScalarToFDot6(p0.fX * multiplier)) >> accuracy;
+ doug = x0;
SkFixed y0 = SnapY(SkFDot6ToFixed(SkScalarToFDot6(p0.fY * multiplier)) >> accuracy);
+ doug = y0;
SkFixed x1 = SkFDot6ToFixed(SkScalarToFDot6(p1.fX * multiplier)) >> accuracy;
+ doug = x1;
SkFixed y1 = SnapY(SkFDot6ToFixed(SkScalarToFDot6(p1.fY * multiplier)) >> accuracy);
+ doug = y1;
int winding = 1;
---
I'll do more digging and testing and confirming Monday and also try to get assembly of the minimal change to make it work. I'm already on the bus and this bug requires me to interact with the device.
---
wez@ has his patch to adjust compiler flags at <https://codereview.chromium.org/2820803003/>
,
Apr 14 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/bec44505ea17b6b8fbd366ab22ca724f3c448fe6 commit bec44505ea17b6b8fbd366ab22ca724f3c448fe6 Author: wez <wez@chromium.org> Date: Fri Apr 14 23:55:20 2017 Remove explicit -fomit_frame_pointer from ARM 32-bit builds. -fomit_frame_pointer causes the assembly generated from SkEdge::setLine to trigger a CPU errata in ARM A12/A17 devices, so we are temporarily removing the flag until the toolchain can be fixed to avoid that. Revert this when issue 711784 is resolved. BUG= 710131 , 706654 , 711784 Review-Url: https://codereview.chromium.org/2820803003 Cr-Commit-Position: refs/heads/master@{#464832} [modify] https://crrev.com/bec44505ea17b6b8fbd366ab22ca724f3c448fe6/build/config/compiler/BUILD.gn
,
Apr 15 2017
#CBC-RS/TC-watchlist CBC topic https://goo.gl/rvK5Ky
,
Apr 17 2017
We will almost certainly need a merge request for this. I can't verify this on R60 ToT at the moment because Chrome isn't uprevving. However, I managed to get setup to build Chrome. I can confirm that I can reproduce the problem and that the changes fixes it. I suggest merging to R59. --- See also b/37405018 for someone else running into this on veryon.
,
Apr 17 2017
This bug requires manual review: Reverts referenced in bugdroid comments after merge request. Please contact the milestone owner if you have questions. Owners: amineer@(Android), cmasso@(iOS), gkihumba@(ChromeOS), Abdul Syed@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 17 2017
Probably don't want to do any more R59 dev releases for veyron without fixing this.
,
Apr 17 2017
,
Apr 17 2017
,
Apr 17 2017
I can't make the Chrome pick, but wez@ can.
,
Apr 17 2017
By the way, we may need to keep frame pointer for another reason: CWP. gmx@ may provide more details.
,
Apr 17 2017
@40: OK, good to know. If so we should probably make this explicit rather than implicit. Also: it would be good to know if this is just for CWP whether the gain from CWP outweighs the loss from the extra bloat of using frame pointers.
,
Apr 17 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3dcba7c1facb5bb49535476185c5a396669d52f0 commit 3dcba7c1facb5bb49535476185c5a396669d52f0 Author: Wez <wez@chromium.org> Date: Mon Apr 17 22:20:54 2017 Remove explicit -fomit_frame_pointer from ARM 32-bit builds. -fomit_frame_pointer causes the assembly generated from SkEdge::setLine to trigger a CPU errata in ARM A12/A17 devices, so we are temporarily removing the flag until the toolchain can be fixed to avoid that. Revert this when issue 711784 is resolved. BUG= 710131 , 706654 , 711784 Review-Url: https://codereview.chromium.org/2820803003 Cr-Commit-Position: refs/heads/master@{#464832} (cherry picked from commit bec44505ea17b6b8fbd366ab22ca724f3c448fe6) Review-Url: https://codereview.chromium.org/2819373002 . Cr-Commit-Position: refs/branch-heads/3071@{#26} Cr-Branched-From: a106f0abbf69dad349d4aaf4bcc4f5d376dd2377-refs/heads/master@{#464641} [modify] https://crrev.com/3dcba7c1facb5bb49535476185c5a396669d52f0/build/config/compiler/BUILD.gn
,
Apr 17 2017
,
Apr 17 2017
As a reminder, future work to track down the CPU errata and make sure we can't hit this again is in bug #711784 If there is a reason why we _need_ Frame Pointers, please open a new bug and provide a link to it here. Preferably no more comments in this bug, since it's marked as "Fixed"
,
Apr 17 2017
We can talk about CWP and its need of frame pointers in bug #706654
,
Apr 17 2017
@41: I agree that having an explicit setting for enabling frame pointers makes sense. I thought the chromeos toolchain enables them explicitly with -fno-omit-frame-pointer, but I don't know if that is really true. Can the frame pointers still be enabled explicitly on Chrome OS? Is there a different flag that the toolchain needs to use? If not, is there a bug that covers the removal of frame pointers for chrome on ChromeOS where we can discuss the value of having them or not? Thanks.
,
Apr 17 2017
@46: let's talk about it in bug #706654
,
Apr 19 2017
Issue 712898 has been merged into this issue.
,
Apr 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/38f02400540606a7c418f9fa5148d3a07fd8b4fa commit 38f02400540606a7c418f9fa5148d3a07fd8b4fa Author: wez <wez@chromium.org> Date: Thu Apr 20 18:41:09 2017 Enable frame pointers explicitly under ARM32. This replaces the quick-fix in crrev.com/2820803003, to address issues with ARM32 builds when frame pointers are disabled. This CL explicitly enables frame pointers in ARM32 builds, and pulls out the ARM32 and ARM64 special-cases to be handling separately from Debug, profiling and sanitizer build configurations, for clarity. BUG= 710131 , 706654 , 711784 Review-Url: https://codereview.chromium.org/2829433003 Cr-Commit-Position: refs/heads/master@{#466080} [modify] https://crrev.com/38f02400540606a7c418f9fa5148d3a07fd8b4fa/build/config/compiler/BUILD.gn [modify] https://crrev.com/38f02400540606a7c418f9fa5148d3a07fd8b4fa/build/config/compiler/compiler.gni
,
Apr 20 2017
Please request merge into M59
,
Apr 20 2017
Chrome version: 59.0.3071.8
,
Apr 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1e22c07d34fdb782f532155ae5fca2f048a5e2f3 commit 1e22c07d34fdb782f532155ae5fca2f048a5e2f3 Author: Wez <wez@chromium.org> Date: Thu Apr 20 20:46:57 2017 Enable frame pointers explicitly under ARM32. This replaces the quick-fix in crrev.com/2820803003, to address issues with ARM32 builds when frame pointers are disabled. This CL explicitly enables frame pointers in ARM32 builds, and pulls out the ARM32 and ARM64 special-cases to be handling separately from Debug, profiling and sanitizer build configurations, for clarity. BUG= 710131 , 706654 , 711784 Review-Url: https://codereview.chromium.org/2829433003 Cr-Commit-Position: refs/heads/master@{#466080} (cherry picked from commit 38f02400540606a7c418f9fa5148d3a07fd8b4fa) Review-Url: https://codereview.chromium.org/2831733005 . Cr-Commit-Position: refs/branch-heads/3071@{#98} Cr-Branched-From: a106f0abbf69dad349d4aaf4bcc4f5d376dd2377-refs/heads/master@{#464641} [modify] https://crrev.com/1e22c07d34fdb782f532155ae5fca2f048a5e2f3/build/config/compiler/BUILD.gn [modify] https://crrev.com/1e22c07d34fdb782f532155ae5fca2f048a5e2f3/build/config/compiler/compiler.gni
,
Apr 21 2017
I am seeing inconsistent crash on https://www.cheapoair.com/ with 4/10 frequency on latest Chrome version# 59.0.3071.15/9460.5.0 dev channel veryon minnie. Below are the details of crash id's: 8bddc3e190000000,845aadd640000000,47cda38e80000000,f109975630000000
,
Apr 21 2017
,
Apr 21 2017
@53: yes, looks like you're still running into this. :( Digging...
Well, it definitely looks like wez's change is there. ...and it also looks like the PC being blamed is still setLine. From one of your crashes:
<3>[ 116.990523] CPU0 PC: <3beeda40> 0x3beeda40
Here you can see the start of setLine seems to match this address (note the distinctive "a40" match):
(gdb) disass 0x1fc7a40
Dump of assembler code for function _ZN14SkAnalyticEdge7setLineERK7SkPointS2_:
0x01fc7a40 <+0>: push {r4, r5, r6, r7, lr}
0x01fc7a42 <+2>: add r7, sp, #12
0x01fc7a44 <+4>: stmdb sp!, {r8, r9, r10, r11}
0x01fc7a48 <+8>: sub sp, #20
0x01fc7a4a <+10>: mov r4, r0
0x01fc7a4c <+12>: movs r0, #0
0x01fc7a4e <+14>: str r0, [r4, #8]
0x01fc7a50 <+16>: movw r3, #16383 ; 0x3fff
0x01fc7a54 <+20>: vmov.f32 s0, #16 ; 0x40800000 4.0
0x01fc7a58 <+24>: vldr s4, [r2, #4]
0x01fc7a5c <+28>: vmul.f32 s4, s4, s0
0x01fc7a60 <+32>: vldr s2, [r2]
0x01fc7a64 <+36>: mov.w r2, #8192 ; 0x2000
0x01fc7a68 <+40>: vldr s8, [r1, #4]
0x01fc7a6c <+44>: vldr s10, [pc, #544] ; 0x1fc7c90 <_ZN14SkAnalyticEdge7setLineERK7SkPointS2_+592>
...and from the above you can see that, indeed, frame pointers are enabled.
---
So why are we still seeing this? Well, it sure looks like we either have had additional changes (compiler, flags, or code) and (yet again) we're triggering this errata. We knew we had a ticking time bomb here when we "fixed" it by turning frame pointers back on, I just don't think anyone expected the bomb to go off so quickly.
I'll see if I can figure out what could have happened. FWIW: here's the assembly in 9460.5.0
Sigh.
,
Apr 21 2017
,
Apr 21 2017
If it's always setLine on ARM32, maybe we can "implement" it in a separate .S file by just copying the working version?
,
Apr 21 2017
@57: Argh. I tried my own local build again (where I could reproduce the problem previously) and on that build wez's fix fixes it. While it's always hard to confirm the "fixed" case (possibly I just got several minutes of "luck" in a row), it seems so consistently fixed across several tests of several minutes each that I'm not totally willing to believe that. I just tried 9460.5.0 w/ my same test (constantly click on a youtube video to pause / play / fullscreen / out of fullscreen). ...and I reproduced it just fine. So my test case is fine and I can confirm 9460.5.0 is broken. --- So the bad news: on my local build the assembly for setLine looks _identical_ to the one from 9460.5.0. ...so presumably, like all of the similar A17 errata, there need to be more magic sauce aside from just the assembly near the failing instruction. It could come down to cache alignment, previous instructions executed, etc. I guess it's worth it to start doing more object code patching now to see if I can find a minimal fix.
,
Apr 21 2017
BTW: if we're willing to do code changes, we could (presumably) add a single "nop" into the function and that (might) have as good a chance as anything else of fixing it. Basically this:
dianders@tictac:src ((96d5b9a273a5...))$ git diff
diff --git a/src/core/SkAnalyticEdge.h b/src/core/SkAnalyticEdge.h
index 65fb11c69c41..77d30eca8dc8 100644
--- a/src/core/SkAnalyticEdge.h
+++ b/src/core/SkAnalyticEdge.h
@@ -138,6 +138,7 @@ bool SkAnalyticEdge::setLine(const SkPoint& p0, const SkPoint& p1) {
const int accuracy = kDefaultAccuracy;
const int multiplier = (1 << kDefaultAccuracy);
SkFixed x0 = SkFDot6ToFixed(SkScalarToFDot6(p0.fX * multiplier)) >> accuracy;
+ asm volatile ("nop");
SkFixed y0 = SnapY(SkFDot6ToFixed(SkScalarToFDot6(p0.fY * multiplier)) >> accuracy);
SkFixed x1 = SkFDot6ToFixed(SkScalarToFDot6(p1.fX * multiplier)) >> accuracy;
SkFixed y1 = SnapY(SkFDot6ToFixed(SkScalarToFDot6(p1.fY * multiplier)) >> accuracy);
...but you'd want to put that around some sort of ifdefs. I confirmed previously that "fixed" it for me, but it's definitely a bit of a shot in the dark...
,
Apr 21 2017
I'd happily upstream
#if defined(__arm__)
asm volatile("nop"); // crbug.com/710131
#endif
,
Apr 21 2017
E.g. https://skia-review.googlesource.com/c/14062/. When we've got something we like I'd be happy to pick it onto our M59 branch too.
,
Apr 21 2017
Have we heard anything from ARM folks?
,
Apr 21 2017
mtklein: trying to get setup to build _exactly_ R59 so we don't get bit again, then I'll try that patch. gkihumba: yes, we've been chatting with them via email and in bug #711784 . We've also got a bug opened internally in the ARM bug tracker. It's likely getting this resolved won't be a super fast thing, though.
,
Apr 21 2017
I've been unable to get an _exact_ R59 9460.5.0 / 59.0.3071.15 image and the one I produce doesn't show the problem. I might need to find some help from someone who builds Chrome to help me get something that's more exact. It's entirely possible that any small thing could make a difference--it might be that somehow cache alignment of setLine might be important?
,
Apr 24 2017
Issue 714127 has been merged into this issue.
,
Apr 25 2017
In bug #711784 I tried a whole bunch of poking and prodding to see exactly what would make this bug come and go. You can see my experiments there. The net result of that is that _I think_ the safest solution would be to add a "dsb" instruction rather than a "nop". A "dsb" is a data synchronization barrier and basically tells the CPU to finish up outstanding memory transactions before continuing. Presumably executing that instruction should get the cpu out of any state that might have been important to trigger this errata. @mtklein: would you take this (couched in "#if defined(__arm__)")? If you're OK with this, can you post the patch and land it? I'm OOTO for the next two days for the most part. :( --- diff --git a/src/core/SkAnalyticEdge.h b/src/core/SkAnalyticEdge.h index 65fb11c69c41..50f3f5a89b9a 100644 --- a/src/core/SkAnalyticEdge.h +++ b/src/core/SkAnalyticEdge.h @@ -131,6 +131,7 @@ struct SkAnalyticCubicEdge : public SkAnalyticEdge { }; bool SkAnalyticEdge::setLine(const SkPoint& p0, const SkPoint& p1) { + asm volatile("dsb"); fRiteE = nullptr; // We must set X/Y using the same way (e.g., times 4, to FDot6, then to Fixed) as Quads/Cubics. --- I still can't actually confirm that this really fixes things on R59, but it seems likely to and likely to be a lot safer than the "nop". Personally I'd say send it in and we can revert it if it doesn't work. Also note that I wish we could limit this to just "cortex-12" (that's the flag we have set: -mcpu=cortex-a12 -mtune=cortex-a12) but I couldn't find a way to do that. Note that my own personal R59 build doesn't show the problem (and has the same assembly as the official build), but presumably that's because of Auto FDO. I've been doing all my testing (and reproducing) with the frame pointers patched reverted, which lets me see it in my local build. I started trying to build what I think is an R59 auto FDO build, but it took more than 4 hours to build and didn't finish by the time I left. :(
,
Apr 25 2017
,
Apr 25 2017
The following revision refers to this bug: https://skia.googlesource.com/skia/+/101806f4526d0ba5d48515c82a28dbf6956aca4d commit 101806f4526d0ba5d48515c82a28dbf6956aca4d Author: Mike Klein <mtklein@chromium.org> Date: Tue Apr 25 22:47:52 2017 Inject a dsb into SkAnalyticEdge::setLine() to work around ARMv7 CPU erratum. BUG= chromium:710131 Change-Id: I4568bc24cc8fabb6f9df3b7645b01a98e11791b8 Reviewed-on: https://skia-review.googlesource.com/14062 Reviewed-by: Doug Anderson <dianders@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org> [modify] https://crrev.com/101806f4526d0ba5d48515c82a28dbf6956aca4d/src/core/SkAnalyticEdge.h
,
Apr 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9fbda07cfc941305ace888e3d3b742cdb0295bbf commit 9fbda07cfc941305ace888e3d3b742cdb0295bbf Author: skia-deps-roller@chromium.org <skia-deps-roller@chromium.org> Date: Wed Apr 26 16:37:00 2017 Roll src/third_party/skia/ c1889823d..c7be00366 (26 commits) https://skia.googlesource.com/skia.git/+log/c1889823de68..c7be00366bb0 $ git log c1889823d..c7be00366 --date=short --no-merges --format='%ad %ae %s' 2017-04-25 mtklein remove to_2dot2 and from_2dot2 2017-04-26 bungeman Remove SK_IGNORE_GASP_VERSION_CHECK. 2017-04-25 bsalomon Link to vulkan in tools but not library. 2017-04-26 robertphillips Prevent creation of zero-sized proxies 2017-04-25 cblume Make SkNoncopyable movable 2017-04-21 mtklein Inject a dsb into SkAnalyticEdge::setLine() to work around ARMv7 CPU erratum. 2017-04-25 mtklein remove SkOpts::run_pipeline() declaration. 2017-04-25 benjaminwagner Update PixelC to latest build. 2017-04-25 bsalomon Add ability to relinquish control of VkDevice and VkInstance lifetime to GrVkBackendContext 2017-04-25 ethannicholas sksl can now fold constant vector or matrix equality expressions 2017-04-25 brianosman Fix writePixels of sRGB data to legacy GPU surface 2017-04-25 bsalomon Add GrVkInterface constructor with sepereate instance/device proc getters. 2017-04-25 stephana Disable msaa on all current iOS devices 2017-04-25 egdaniel Use system Vulkan headers except when no building with vulkan support 2017-04-25 msarett Make SkColorSpaceXformer::apply(SkPaint) safe to call recursively 2017-04-25 brianosman Remove compressed texture support from cacherator 2017-04-25 bungeman Improve variation comment for iOS. 2017-04-25 mtklein xform saveLayer() backdrop 2017-04-25 stephana Enable ios on Raspberry Pi 2017-04-24 msarett Balance save()/restore() calls in dont_clip_to_layer gm 2017-04-25 msarett Delete SkGTypeface and SkGScalerContext 2017-04-25 msarett SkMergeImageFilter: fModes might be nullptr 2017-04-25 brianosman Name 'client' parameter so comment makes sense 2017-04-19 halcanary SkDiscardableMemoryPool: modernize 2017-04-24 scroggo Improve the Codec_end test and add fixes 2017-04-24 robertphillips Rm makeRenderTargetContext in favor of deferred version (take 3) Created with: roll-dep src/third_party/skia BUG= 710131 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_trusty_blink_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel;master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=benjaminwagner@chromium.org Change-Id: I7be0640ddc22b704be4d62f67ad2481606ecde48 Reviewed-on: https://chromium-review.googlesource.com/488122 Reviewed-by: Skia Deps Roller <skia-deps-roller@chromium.org> Commit-Queue: Skia Deps Roller <skia-deps-roller@chromium.org> Cr-Commit-Position: refs/heads/master@{#467345} [modify] https://crrev.com/9fbda07cfc941305ace888e3d3b742cdb0295bbf/DEPS
,
Apr 26 2017
Sorry it took a while to make it into Chromium head. Nothing wrong with it I suspect, just other unrelated issues in that roll. If nothing screams I'll pick that CL into M59 tomorrow.
,
Apr 26 2017
@70: Great, thanks! When I get in tomorrow I'll do some testing with this change on an actual official build too (presuming Chrome rolls before tomorrow morning).
,
Apr 26 2017
Note that we'll need to merge https://codereview.chromium.org/2845433002/ into M59 to make sure frame pointers are enabled in x64 binaries (in practice this enables them for all ChromeOS, as was the case before) to support CWP. Grace, shall I go ahead and merge that now?
,
Apr 26 2017
Yes, please go ahead and merge that. How is cl in #68 looking on tot?
,
Apr 26 2017
As far as I know, fine. Want me to merge it into M59 now?
,
Apr 26 2017
I think that'll be https://skia-review.googlesource.com/c/14407/ whenever we want to land it (ignore the CQ bots... I forgot it doesn't work against branches). We've just had a little power outage here in Chapel Hill so many of our bots are down or recovering and the local tree is closed, but I'm happy to just jam it if it's urgent.
,
Apr 26 2017
As long as we merge it in the next 24 hrs it should be ok. Test team will verify fix on M59 on Friday.
,
Apr 26 2017
Sounds good. I'll land it when the tree opens up or late tomorrow morning, whichever's sooner.
,
Apr 26 2017
Oh! Should have looked before I wrote that. Our tree is back. Here we go.
,
Apr 26 2017
The following revision refers to this bug: https://skia.googlesource.com/skia/+/3027fda3cfe425b6fae53ef59923d554402c1414 commit 3027fda3cfe425b6fae53ef59923d554402c1414 Author: Mike Klein <mtklein@chromium.org> Date: Wed Apr 26 22:13:25 2017 Inject a dsb into SkAnalyticEdge::setLine() to work around ARMv7 CPU erratum. BUG= chromium:710131 Change-Id: I4568bc24cc8fabb6f9df3b7645b01a98e11791b8 Reviewed-on: https://skia-review.googlesource.com/14062 Reviewed-by: Doug Anderson <dianders@google.com> Commit-Queue: Mike Klein <mtklein@chromium.org> Reviewed-on: https://skia-review.googlesource.com/14407 Reviewed-by: Mike Klein <mtklein@chromium.org> [modify] https://crrev.com/3027fda3cfe425b6fae53ef59923d554402c1414/src/core/SkAnalyticEdge.h
,
Apr 27 2017
I'm not really sure if it is related to this bug, but an ARC test (cheets_CTSHelper.stress; which does (login+launch ARC+logout)*10 times) started failing again by cpu lockup kernel panic only on veyron_* family. https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=cheets_CTSHelper.stress&days_back=30&platforms=veyron_minnie,veyron_fievel,veyron_tiger,veyron_speedy <0>[ 1147.822274] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3 Roll of Chrome 60.0.3077.0 once fixed the issue, thanks to the frame-pointer change in this issue I thought, but it looks coming back again in R60-9496.0.0 = Chrome 60.0.3080.3 roll.
,
Apr 27 2017
@80: It's entirely possible. Can you give more logs? Specifically, if you see something in the logs where one CPU fails to stop and has an address in userspace, it's likely this issue: <3>[ 290.958742] CPU1 PC: <331f9cd4> 0x331f9cd4 As far as we can tell, the frame pointer revert is inconsistent at best. As far as we can tell, this was a ticking time bomb. The frame pointer happened to tickle it first, but even with the revert it looks like things have changed and we're tickling again. For the record, we're still pretty much crossing our fingers that the "dsb" will fix it a little better. We have strong confidence that the dsb won't do anything terribly bad, but it's a bit of a shot in the dark that it will fix things across many builds. So far I've been unable to find an official build that includes it. :( It looks like Chrome hasn't revved in ToT yet... QUESTION: do we need to manually rev skia in R59 to get the above fix? I don't see any skia mentions in <https://chromium.googlesource.com/chromium/src/+log/59.0.3071.25..59.0.3071.28?pretty=fuller&n=10000>
,
Apr 27 2017
Here's the logs on R60-9497.0.0 veyron_minnie (attached one of them as an example): https://wmatrix.googleplex.com/failures/unfiltered?platforms=veyron_minnie&tests=cheets_CTSHelper.stress&days_back=30&builds=R60-9497.0.0&hide_missing=True >> Logs >> ../crashinfo.*.tgz >> *.kcrash Yes, some CPU is stopping at some address. Below is the addressed of 5 crashes in the link above. <3>[ 1557.005114] CPU1 PC: <33b94658> 0x33b94658 <3>[ 1275.083341] CPU2 PC: <39e98658> 0x39e98658 <3>[ 1148.989250] CPU3 PC: <38a5d658> 0x38a5d658 <3>[ 369.128921] CPU1 PC: <39634658> 0x39634658 <3>[ 1005.435413] CPU3 PC: <2ff19658> 0x2ff19658
,
Apr 27 2017
@82: yup, that's this errata. Well, it's good that it's being caught by a test case. Now we just need a Chrome uprev.
,
Apr 28 2017
,
Apr 28 2017
FYI: problem seems "gone" for now. Specifically: * I confirmed that 9460.14.0 (59.0.3071.30) has the dsb. ...and I can't reproduce it there. As suggested by bbatacha@ I tried object-code patching the "dsb" into two "nops" and I _still_ couldn't reproduce it (though my previous testing ( bug #711784 , comment 13) also showed that two "nop" instructions at _this_ location also fixed it, it was nops at a different location that could still reproduce it where a dsb would fix it). * On 9460.13.0 (59.0.3071.28) there is no dsb, but I still can't reproduce the problems. Actually, on 9460.13.0 I can't even find setLine, so possibly the function got inlined on that version? It seems like with AFDO each version is subtly different. * I have confirmed that on 9460.11.0 (59.0.3071.25) I _can_ reproduce the problem. --- So summary: * I'll call this "Fixed" and unblock any builds 9460.14.0 or later (technically we might unblock 9460.13.0 since I can't reproduce, but I'd still rather not). * We'll just have to keep fingers crossed that the dsb() patch stops the problem. * We need to keep bug #711784 (fix the root cause) as a high priority. --- Thanks to everyone who's been helping with this.
,
Apr 28 2017
...oh, and also note that there are _no_ R60 builds with this fixed, since Chrome appears to be failing to uprev. We're still stuck on 60.0.3080.3 which doesn't have the dsb() hack.
,
May 1 2017
The test I mentioned in #c80 started reliably passing after R60-9503.0.0 and R59-9460.13.0. It looks good but I'm a bit confused because the Chrome roll for the "dsb" change only happened at R60-9505.0.0... The problem seems gone, in any way, though.
,
May 1 2017
@87: great to hear. We'll certainly need to keep our eyes peeled to see if this shows up again, but hopefully this holds until we can find the root cause. Not sure why 9505.0.0 would have passed. Could be luck or could be that some bits of code got rearranged somewhere that made the problem not hit. The problem is certainly slippery.
,
May 1 2017
Crashes are not reproducible in Chrome OS 9460.18.0, 59.0.3071.33 on Minnie.
,
May 1 2017
Verified by playing youtube video on 9460.18.0, 59.0.3071.33 on minnie.
,
May 2 2017
I'll second that verification: No video-playing app I use has crashed on Tiger or Fievel with 9460.18.0.
,
May 2 2017
One random note is that it looks like this crash did exist even in M-58, it was just a lot more rare...
Looking randomly at 6ffc109e80000000 (9334.58.0 on jaq):
<4>[ 5318.928607] SMP: failed to stop secondary CPUs
<3>[ 5318.928625] CPU3 PC: <2e58a254> 0x2e58a254
---
Even on M-58 it looks like it was SkAnalyticEdge::setLine since the location of the function appears to match (254 matches 254):
Dump of assembler code for function SkAnalyticEdge::setLine(SkPoint const&, SkPoint const&):
0x01fbc254 <+0>: push {r4, r5, r6, r7, lr}
0x01fbc256 <+2>: add r7, sp, #12
0x01fbc258 <+4>: stmdb sp!, {r8, r9, r10, r11}
,
Jul 16 2017
The following revision refers to this bug: https://skia.googlesource.com/skia/+/42102420c02a525f57264c124ef941882d22c5a1 commit 42102420c02a525f57264c124ef941882d22c5a1 Author: Mike Klein <mtklein@google.com> Date: Sun Jul 16 20:24:07 2017 Revert "Inject a dsb into SkAnalyticEdge::setLine() to work around ARMv7 CPU erratum." This reverts commit 101806f4526d0ba5d48515c82a28dbf6956aca4d. Reason for revert: no more need. BUG= chromium:740805 Original change's description: > Inject a dsb into SkAnalyticEdge::setLine() to work around ARMv7 CPU erratum. > > BUG= chromium:710131 > > Change-Id: I4568bc24cc8fabb6f9df3b7645b01a98e11791b8 > Reviewed-on: https://skia-review.googlesource.com/14062 > Reviewed-by: Doug Anderson <dianders@google.com> > Commit-Queue: Mike Klein <mtklein@chromium.org> TBR=mtklein@chromium.org,dianders@google.com # Not skipping CQ checks because original CL landed > 1 day ago. Bug: chromium:710131 Change-Id: Ib5b58a951ae56e0f0523cfed27754e0cd150e6ce Reviewed-on: https://skia-review.googlesource.com/23840 Reviewed-by: Mike Klein <mtklein@google.com> Commit-Queue: Mike Klein <mtklein@google.com> [modify] https://crrev.com/42102420c02a525f57264c124ef941882d22c5a1/src/core/SkAnalyticEdge.h |
||||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||||
Comment 1 by gkihumba@google.com
, Apr 10 2017