Alleged armv8 hw bug on SIGILL + fault addr not 4-byte aligned + snapdragon 820 |
||||||
Issue description
Summarizing the investigation from Issue 655913 to remove the noise specific to that bug.
jkummerow@ and I are theorizing that crashes that meet the following conditions:
- OS: Android (M seems more likely, but generally L+)
- Chip: Snapdragon 820, which shows as: CPU Architecture, Info: "arm, ARMv1 Qualcomm part(0x51002150)...".
Popular devices having this chip are SM-G930V, SM-G935V, HTC 10, OnePlus 3.
- Crash reason: SIGILL
- Crash address: 4byte instruction spanning across a 64B boundary (0x???be, 0x???fe, and any other X such that X & 0x3f == 0x3e)
are possibly due to a HW bug. Possibly a cache line migration issue between big-little cores.
A list of crashes that meet this criteria is [1]. Unfortunately [1] is a superset and includes also real crashes (real: not due to alleged HW bug). This is because everything that hits a CHECK/RELEASE_ASSERT/V8_IMMEDIATE_CRASH on arm{v7,v8} executing 32-bit code (we ship only 32-bit code on arm today) also shows up as a crash report with SIGILL (the fault addr might be aligned just by chance) and there is no easy way to filter them out in the crash/ query.
How does this affect stability?
-------------------------------
If the theory is valid, this would mean that whenever we ship a release that has a very hot function that contains a 4byte instruction spanning across a cache line, such function is very likely to trigger the alleged bug and hence create, on a large scale, a noticeable number of crashes having the same signature.
This would be a plausible explanation for cases like:
- blink::AXObjectCacheImpl::focusedObject' being very crashy but only in 52.0.2743.98 [2]
- content::BlinkAXTreeSource::GetChildren being very crashy only in 52.0.2743.98 and 53.0.2785.124
- Issue 655913, where v8::internal::Parser::ParseWhileStatement became suddenly crashy in 54.0.2840.61 without any v8 change from the previous release.
The fact that a hot function meets the triggering criteria above is totally arbitrary and out of developers control. It's all a matter of (bad) luck and can go on and off just by changing one line of code and rebuilding.
[1]
https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Android%27%20%20AND%20crash.reason%3D%27SIGILL%27%20AND%20cpu.info%3D%27ARMv1%20Qualcomm%20part(0x51002150)%20features%3A%20half%2Cthumb%2Cfastmult%2Cvfpv2%2Cedsp%2Cneon%2Cvfpv3%2Ctls%2Cvfpv4%2Cidiva%2Cidivt%27%20AND%20crash.address%20%26%200x3F%20%3D%200x3E&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D
[2] https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_Android%27%20%20AND%20crash.reason%3D%27SIGILL%27%20AND%20cpu.info%3D%27ARMv1%20Qualcomm%20part(0x51002150)%20features%3A%20half%2Cthumb%2Cfastmult%2Cvfpv2%2Cedsp%2Cneon%2Cvfpv3%2Ctls%2Cvfpv4%2Cidiva%2Cidivt%27%20AND%20crash.address%20%26%200x3F%20%3D%200x3E%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27blink%3A%3AAXObjectCacheImpl%3A%3AfocusedObject%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D
,
Oct 17 2016
> Aren't arm instructions aligned Yes, at 2 bytes. To be clear 2-byte misalignment is not supposed to be a problem on arm. However, it seem to be a problem on this specific SoC together with some other conditions that make the issue very hard to repro (the theorized cache migration above). > and fixed size? No. only arm ISA is 4 byte fixed. THUMB-2 is either 2 or 4 bytes. Most of chrome code is compiled as THUMB-2.
,
Oct 17 2016
yes, it's thumb code. We ship 32bit binaries on arm64, and thumb is enabled for C++ code
,
Oct 18 2016
I made a chart of the distribution of (crash.address & 0xFFF). Not sure if it's trying to tell us anything.
,
Oct 18 2016
Do I read this correctly that the crashing instructions all have the address at 0x***E?
,
Oct 18 2016
Well, ending in "E" is a necessary condition in order to have a 4 byte instruction misaligned by -2 bytes w.r.t an alignment of 16 bytes or integer multiples of it.
,
Oct 18 2016
,
Oct 18 2016
Nothing private here as far as I can tell.
,
Oct 19 2016
,
Oct 19 2016
I tried to create a repro and run it on a OnePlus2 (which has a SnapDragon 820), but didn't succeed. So there is definitely something more other than migrations and jumping across cache lines that needs to happen. Attaching in the case somebody wants to do some other experiment on top.
,
Oct 19 2016
*typo in #10 above: s/OnePlus2/OnePlus3/
,
Oct 24 2016
,
Nov 6 2017
https://bugs.chromium.org/p/chromium/issues/detail?id=781710 is another example of this. It has 1092 reports that match cpu.Info LIKE '%0x51002150%'.
,
Nov 6 2017
Issue 781710 has been merged into this issue.
,
Feb 12 2018
Have we evaluated whether it is practical to get a compiler workaround for this? It would certainly be possible to insert a two-byte NOP instruction if the compiler (or linker?) detected that a four-byte instruction would cross a 64-byte boundary. The binary-size overhead would probably be modest (at most an expansion by 66/64 which is 3.125% but almost certainly less). I am not qualified to comment on the complexity and practicality of such a solution.
,
Feb 13 2018
Technically makes sense but I am not sure we should spend too much time trying to workaround what it seems a CPU/SoC bug that affects one specific SoC only (Snapdragon 820).
,
Jun 8 2018
Issue 849301 has been merged into this issue.
,
Sep 8
No longer on the Chrome team, e-mail me @google.com if any attention still required from me here, otherwise good luck! |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by yangguo@chromium.org
, Oct 17 2016