New issue
Advanced search Search tips

Issue 654213 link

Starred by 25 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 0
Type: Bug-Regression



Sign in to add a comment

Chrome: Crash Report - SkNx<8,unsigned short>::SkNx<8,unsigned short>

Project Member Reported by ajha@chromium.org, Oct 8 2016

Issue description

Product name: Chrome
Magic Signature: SkNx<8,unsigned short>::SkNx<8,unsigned short>

Current link:
https://crash.corp.google.com/browse?q=custom_data.ChromeCrashProto.magic_signature_1.name%3D'SkNx%3C8%2Cunsigned%20short%3E%3A%3ASkNx%3C8%2Cunsigned%20short%3E'%20AND%20product.name%3D'Chrome'%20AND%20product.version%3D'56.0.2884.0'%20AND%20ReportID%3D'288bfb8b00000000'&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D#3


Search properties:
custom_data.chromecrashproto.magic_signature_1.name: SkNx<8,unsigned short>::SkNx<8,unsigned short>
product.name: Chrome
product.version: 56.0.2884.0
reportid: 288bfb8b00000000

Metadata :
Product Name: Chrome
Product Version: 56.0.2884.0
Report ID: 288bfb8b00000000
Report Time: Sat, 08 Oct 2016 15:41:39 GMT
Uptime: 6000 ms
Cumulative Uptime: 0 ms
User Email: 
OS Name: Windows NT
OS Version: 10.0.14393 0
CPU Architecture: x86
CPU Info: AuthenticAMD family 16 model 5 stepping 3

Stack trace:
============
Thread 0 CRASHED [EXCEPTION_ILLEGAL_INSTRUCTION @ 0x0f98d4a0 ] MAGIC SIGNATURE THREAD
0x0f98d4a0	(chrome.dll -sknx_sse.h:301 )	SkNx<8,unsigned short>::SkNx<8,unsigned short>(__m128i const &)
0x10570078	(chrome.dll -sk4px_sse2.h:40 )	`anonymous namespace'::Sk4px::mulWiden
0x75211223	(KERNELBASE.dll + 0x000d1223 )	LocalBaseRegQueryValue
0x105702b6	(chrome.dll -skblitmask_opts.h:141 )	<lambda_1a787da9eb016a642625d8670baa4bef>::operator()
0x105702b6	(chrome.dll -skblitmask_opts.h:141 )	<lambda_1a787da9eb016a642625d8670baa4bef>::operator()
0x105a7528	(chrome.dll -skpath.cpp:2452 )	SkPath::internalGetConvexity()
0x77b096cf	(ntdll.dll + 0x000996cf )	wcstombs

Link to the list of the builds:
===============================
https://crash.corp.google.com/browse?q=custom_data.ChromeCrashProto.magic_signature_1.name%3D%27SkNx%3C8%2Cunsigned%20short%3E%3A%3ASkNx%3C8%2Cunsigned%20short%3E%27%20AND%20product.name%3D%27Chrome%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D

Note:
=====
1. This is #1 browser crash on Windows canary(56.0.2884.0) and constitutes 99.04% of crashes.(5338 crashes from 2700 clients)

Considering below as the changelog:
===================================
https://chromium.googlesource.com/chromium/src/+log/55.0.2883.0..56.0.2884.0?pretty=fuller&n=10000

Suspecting: https://chromium.googlesource.com/skia.git/+/33cbfd75afdd383770bb6253c06ba819a2481a35 from skia changelog.

mtklein@: Could you please take a look at these crashes and revert the CL if the change is related.

Thank you!


 

Comment 1 by ajha@chromium.org, Oct 9 2016

or, Could be related to https://chromium.googlesource.com/skia.git/+/a71e151c6f0be68dc96ad2d169bbc31edca8f946 from the skia changelog by mtklein@. 
Project Member

Comment 2 by sheriffbot@chromium.org, Oct 9 2016

Labels: FoundIn-M-56 Fracas
Users experienced this crash on the following builds:

Win Canary 56.0.2884.0 -  2850.25 CPM, 16388 reports, 7595 clients (signature SkNx<8,unsigned short>::SkNx<8,unsigned short>)

If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas
Project Member

Comment 3 by sheriffbot@chromium.org, Oct 9 2016

Labels: ReleaseBlock-Dev
This crash has high impact on Chrome's stability.
Signature: SkNx<8,unsigned short>::SkNx<8,unsigned short>.
Channel: canary. Platform: win.
Labeling  issue 654213  with ReleaseBlock-Dev.


If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas
I think the CL from Comment 1 created an ODR violation for methods like SkNx<8,unsigned short>::SkNx<8,unsigned short>(), and the linker's picking the "wrong" one.  (Though it is somewhat surprising that those methods are not completely inlined.)

In any case, I'm reverting it now.
Cc: nyerramilli@chromium.org
thanks mtklein@..

Note:
This is #1 browser crash on Windows latest canary(56.0.2886.0) and constitutes 87.31% of crashes.(2340 crashes from 1529 clients)
Sorry if I'm dense... is there something new I should be taking away from the Note in Comment 5?  Are we noting that the numbers reported in crash/ and previous comments are wrong?  The absolute numbers in both previous crash reports (Comment 0, Comment 2) are higher than what you've just written there, and crash/ shows me higher still (~54K crashes across 3 M56 builds).
We're waiting on this roll to land: https://codereview.chromium.org/2406883002
The revert has been successfully landed here: https://codereview.chromium.org/2408753002

Thank you!
Cc: sebmarchand@chromium.org mtklein@chromium.org brucedaw...@chromium.org primiano@chromium.org
 Issue 654193  has been merged into this issue.

Comment 10 by wfh@chromium.org, Oct 10 2016

Labels: -Restrict-View-Google
If the bug is that AVX instructions are being executed on non-AVX processors then it is *possible* that this is related to a SKIA build configuration problem that could (through ODR violations) lead to AVX instructions being generated in undesired places. However if this was the case then I would expect a significant number of reports of this happening.

So, if the reports are isolated and rare then please ignore this and I'll continue my investigation as time permits. However if there is any indication that AVX instructions are consistently being executed on non-AVX machines, please let me know.

I think the story here was something like this:

  1) We have a class SkNx<8, uint16_t> with a bunch of methods inlined into its declaration, including the constructor SkNx(const __m128i&).  This is a pretty simple constructor: it just assigns the argument to its one __m128i field.
  2) This code all tends to be used in situations where all the methods inline.  On most compilers this constructor inlines away completely to zero instructions.  MSVC has decided instead to not inline it, leading to a standalone copy in at least one .o we link.  It's implementation is probably something simple like "movaps xmm0 -> my-m128i-field".  They're still inline, of course, so the linker dedups them just fine.
  3) This CL started using this type from a compilation unit that compiled with /arch:AVX2.  We only call functions in that compilation unit after a runtime CPUID check.  But that compilation unit has now created a second instance of that inline constructor, still just as trivial but now VEX encoded, so something like "vmovaps xmm0 -> my-m128i-field".  This is an ODR violation.
  4) The linker links code not behind the AVX2 runtime check against the vmovaps version of this constructor.  Machines without AVX can't execute those VEX encoded instructions.  Boom.

This will crash every Windows machine without AVX, and maybe even some with AVX but not AVX2.  That said, we've already reverted the bad CL, so I'm not really sure you need to be investigating anything.

Comment 13 by wfh@chromium.org, Oct 10 2016

 Issue 654566  has been merged into this issue.
Yep, that's the bug.

Reverting the CL has fixed this particular instance, but I believe that the underlying issue is still there, waiting to be triggered. If an inlined function is compiled from two different translation units, with different compilation settings, then that is arguably an ODR violation. The linker is free to choose whichever version it wants - we've lost control. Whether it is in practice depends on whether the different settings matter (/O1 versus /O2 is relatively benign) and, probably depends on whether the functions are inlined.

So, if we have *any* inlined functions that are compiled both with and without /arch:AVX then there is nothing that we can say with certainty about those functions. That means that /arch:AVX (and friends) is an extremely sharp sword which we need to use with great care. LTCG and PGO make it more likely that this can hit us at unexpected time.

Workarounds:
- One solution is to put all /arch:AVX object files in a separate DLL.
- Another solution is to put all #include directives (seriously, all of them), from files compiled with /arch:AVX into a separate namespace. But I don't know if this works since having headers from base in an unexpected namespace seems likely to cause problems, and this advice presumably applies to system header files that contain inline functions.
- Another solution is to not use /arch:AVX. AVX intrinsics are safe (although annoying) but they avoid this problem.

P.S. This issue was reported to me (originally as a VC++ compiler bug) by some developers working on the Chromium Embedded Framework, which is why I've already had some discussions with the VC++ team about this.

Yeah, after thinking about this more, I've realized this isn't at all MSVC or AVX specific.  It's not safe for any non-non-exported function that's not __attribute__((always_inline)) or __forceinline to have different definitions depending on the preprocessor or any other compiler setting.

What bugs me is that we'll have to use __attribute__((always_inline)) or __forceinline for member functions... as far as I can tell there's no way in standard C++ to make a member function compilation-unit-static without making the whole type anonymously namespaced.  I could convert a lot of these classes' member functions to be static inline functions, but a couple really want to stay in the class: constructors, operator[], that sort of thing.

Anyway, it seems like we've got a clear forward direction.  MSVC is not inlining this code as well as GCC and Clang do, so we'll want to sprinkle __forceinline over the codebase no matter how else we proceed, for performance's sake if for nothing else.  That forced inlining ought to moot any linking problems, right?

Do you know, is there a way to slice by compiler in crash/?  There's a Clang/Windows Canary trial going on now right?  It'd be worthwhile to confirm that none of the Clang Canaries have crashed.

Comment 16 by grt@chromium.org, Oct 11 2016

Hmm. I don't see a crash key indicating clang vs MSVC. I fear you might need to split based on the version number. For example, the current x86 clang canary is 56.0.2887.2 while the MSVC build is 56.0.2887.0.
Thank-you all. 

"Relaunch" this AM to update the then current (working) 32 bit to 56.0.2887.1 worked. Subsequent "uninstall" followed by re-install of 64 bit version has been all joy. Currently running: Version 56.0.2887.0 canary (64-bit) with all tabs working as expected.
Version 56.0.2887.0 canary (64-bit) is all joy here too again. Win 10 Anniversary 1607 x64.
Comment 15: There's no Clang/MSVC experiment going on, we just have a few users using the win-clang Canary (~5, mostly the Windows Clang team), so you can't really compare these builds with the MSVC ones.
Ah, gotcha.  Sorry to confuse things!
Chrome Canary 64 bits 56.0.2887.0 works fine in Windows 10 anniversary update build 14393.222!

Comment 22 by leo....@gmail.com, Oct 11 2016

Confirmed now working on Version 56.0.2887.0 canary (64-bit). Win 10 Anniversary 1607 x64.
Thank you all - Version 56.0.2887.0 canary (64-bit) is now working for me again. The earth is again rotating in the correct direction ;)
> That forced inlining ought to moot any linking problems, right?

If the code is inlined then we should be safe. Not in any sort of provable computer science way, but practically speaking it should avoid the issue. There will continue to be a risk that 'random' perturbations could cause these bugs to return at surprising times. Having sufficient testing coverage on non-AVX machines will help to manage this risk.

Do we know how much benefit we are getting from using /arch:AVX?

Are you asking the benefit from using the /arch:AVX switch, or the benefit from using AVX instructions generally?

We're actually using /arch:AVX2 for this particular code (SkOpts_hsw.cpp), and using it only on machines that target the wider Haswell feature set (AVX, AVX2, FMA, F16C, BMI1, BMI2).  With all that together, we see between 2 and 3x speedups on the code paths we're trying to improve when using Clang or GCC, closer to 2 for most normal stuff and closer to 3 when we're dealing with half floats.

I've never timed MSVC's performance, mostly because this code is designed to work best when we can pass many vector registers as function arguments, and even __vectorcall is pretty poor there (six total arguments shared between scalars and vectors).  For Windows we're eagerly awaiting Clang where we can selectively switch functions over to use the SysV calling convention with its more generous selection of function argument registers.  Ideally this code passes around 3-4 size_t and 8 __m256 through a little pipeline of 5 to 15 functions tail-calling into each other.

Some small amounts of that 2-3x speedup come from Clang and GCC generating BMI1/BMI2 instructions themselves from portable C++ code, which I'm not sure if /arch:AVX2 lets cl.exe do.  I don't think it's anything critical, just a couple little things like replacing >>-then-& with bextr.  Most of the time we're working on vectors anyway, so the (scalar-only) BMI instructions aren't super critical.

We really only depend on the compiler's direct AVX/AVX2 code generation for a few little oddball things... the code is mostly intrinsics.  But whether we use /arch:AVX or /arch:AVX2 or stick to intrinsics exclusively, we'll still have the same ODR problems if any of this CPU-feature-dependent code is turned into a standalone function.  This isn't just a MSVC thing either... we probably have the same problems in debug builds with GCC and Clang... it's just that only MSVC fails to inline this code when optimizing.  

Again, this is not AVX specific either: it happens between SSE2, SSSE3, SSE4.1, etc.  E.g. the best way to implement Sk4i::operator+ depends on whether or not we've got SSE4.1, and zero-extending 4 bytes up to 4 ints has something like four different best implementations depending on the instruction sets we can use.  We've got ODR danger in all those places if not inlined.
It seems like the crash has been fixed. Can we update the bug status to reflect this? 

Comment 27 by byczybyk@gmail.com, Oct 11 2016

yes it is, thank you!
Status: Fixed (was: Assigned)
> But whether we use /arch:AVX or /arch:AVX2 or stick to intrinsics
> exclusively, we'll still have the same ODR problems if any of this CPU-
> feature-dependent code is turned into a standalone function.

Isn't the problem with /arch:AVX, compared to intrinsics, that we lose control over which functions might contains AVX instructions? That is, wich /arch:AVX there is the risk that *any* inline function compiled that way might end up containing AVX instructions and then being used in non-AVX-safe paths. With intrinsics we get to hand choose which functions are at risk, and we can even rename or namespace those.
I think I understand what you're getting at, but no, this is not a problem in practice.   We only ever call any function from files compiled with funky compiler settings (-msse4.1, -mavx2, /arch:AVX, /arch:AVX2, whatever) after a runtime CPU check.  The other compilers that require, e.g., -mavx2 to use AVX2 intrinsics keep us honest there already.
We have a crasher on canary right now - it looks like Chrome crashes on all non-AVX compatible processors due to a variant on this issue...

 crbug.com/666707 
can i help somehow?

Comment 33 by fdd...@gmail.com, Nov 19 2016

For CEF i'm reorder all modules and going good all time. Unless you doesnt stop violate ODR - this trick solves problem in bith way: code is valid AND it is correctly inlined. Rewrite to be C++ compliant and no hacks need...
Canary crashes on start. Icon flashes then disappears. Same as October.
Project Member

Comment 35 by bugdroid1@chromium.org, Jan 12 2017

The following revision refers to this bug:
  https://skia.googlesource.com/skia.git/+/d32303e7276b8a824a4d95e4b1676f9eb201ed5e

commit d32303e7276b8a824a4d95e4b1676f9eb201ed5e
Author: Mike Klein <mtklein@chromium.org>
Date: Tue Jan 10 13:42:16 2017

disable runtime detected AVX2 raster pipelines

It's proving too difficult to keep on top of all the ways we might cause ODR violations that crash Chrome.  I'd rather focus on other ways of running the pipelines that won't have that particular problem.  Our -Fast bots will keep testing and benchmarking AVX2 raster pipelines.

BUG=chromium:679147, chromium:654213 ,chromium:664864, chromium:666707 ,etc.

CQ_INCLUDE_TRYBOTS=skia.primary:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD

Change-Id: I35ba8f5f4303107237fd78a6ce442d7c26e5fbef
Reviewed-on: https://skia-review.googlesource.com/6827
Commit-Queue: Mike Klein <mtklein@chromium.org>
Reviewed-by: Mike Klein <mtklein@chromium.org>

[modify] https://crrev.com/d32303e7276b8a824a4d95e4b1676f9eb201ed5e/src/opts/SkOpts_hsw.cpp

Project Member

Comment 36 by bugdroid1@chromium.org, Jan 12 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/9f158899c5bec9e547cbcc49aff6f087620f6129

commit 9f158899c5bec9e547cbcc49aff6f087620f6129
Author: skia-deps-roller <skia-deps-roller@chromium.org>
Date: Thu Jan 12 16:45:36 2017

Roll src/third_party/skia/ 0c2997b6d..d32303e72 (2 commits).

https://skia.googlesource.com/skia.git/+log/0c2997b6d85e..d32303e7276b

$ git log 0c2997b6d..d32303e72 --date=short --no-merges --format='%ad %ae %s'
2017-01-10 mtklein disable runtime detected AVX2 raster pipelines
2017-01-12 djsollen Remove defunct include/images directory from GN.

BUG=679147, 654213 ,664864, 666707 ,etc.

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls

CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_trusty_blink_rel
TBR=kjlubick@google.com

Review-Url: https://codereview.chromium.org/2633483002
Cr-Commit-Position: refs/heads/master@{#443259}

[modify] https://crrev.com/9f158899c5bec9e547cbcc49aff6f087620f6129/DEPS

Now working after update 2976.5 

 

From: bugdro… via monorail [mailto:monorail+v2.3275348242@chromium.org] 
Sent: Friday, January 13, 2017 2:47 AM
To: denny.sabri@virginbroadband.com.au
Subject:  Issue 654213  in chromium: Chrome: Crash Report - SkNx<8,unsigned short>::SkNx<8,unsigned short>

 


Comment #36 on  issue 654213  by  <mailto:bugdroid1@chromium.org> bugdroid1@chromium.org: Chrome: Crash Report - SkNx<8,unsigned short>::SkNx<8,unsigned short>
 <https://bugs.chromium.org/p/chromium/issues/detail?id=654213#c36> https://bugs.chromium.org/p/chromium/issues/detail?id=654213#c36

The following revision refers to this bug:
 <https://chromium.googlesource.com/chromium/src.git/+/9f158899c5bec9e547cbcc49aff6f087620f6129> https://chromium.googlesource.com/chromium/src.git/+/9f158899c5bec9e547cbcc49aff6f087620f6129

commit 9f158899c5bec9e547cbcc49aff6f087620f6129
Author: skia-deps-roller < <mailto:skia-deps-roller@chromium.org> skia-deps-roller@chromium.org>
Date: Thu Jan 12 16:45:36 2017

Roll src/third_party/skia/ 0c2997b6d..d32303e72 (2 commits).

 <https://skia.googlesource.com/skia.git/+log/0c2997b6d85e..d32303e7276b> https://skia.googlesource.com/skia.git/+log/0c2997b6d85e..d32303e7276b

$ git log 0c2997b6d..d32303e72 --date=short --no-merges --format='%ad %ae %s'
2017-01-10 mtklein disable runtime detected AVX2 raster pipelines
2017-01-12 djsollen Remove defunct include/images directory from GN.

BUG=679147, 654213 ,664864, 666707 ,etc.

Documentation for the AutoRoller is here:
 <https://skia.googlesource.com/buildbot/+/master/autoroll/README.md> https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
 <http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls> http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls

CQ_INCLUDE_TRYBOTS=master.tryserver.blink:linux_trusty_blink_rel
 <mailto:TBR=kjlubick@google.com> TBR=kjlubick@google.com

Review-Url:  <https://codereview.chromium.org/2633483002> https://codereview.chromium.org/2633483002
Cr-Commit-Position: refs/heads/master@{#443259}

[modify]  <https://crrev.com/9f158899c5bec9e547cbcc49aff6f087620f6129/DEPS> https://crrev.com/9f158899c5bec9e547cbcc49aff6f087620f6129/DEPS

Comment 38 by fdd...@gmail.com, Jan 14 2017

It is hard to track all changes. I'm just repeat for clarification:

1. Code should met ODR. That means no nasty ifdefs should be used for same type. Every type/arch mod should lands on own namespace. As for me - this rule perfectly stupid, because every artifact and configuration are available during compilation (ltcg linking). They ignore dependencies and follow standard, but i'm again personally can't accept language which has specification which should be named: book of undefined behaviours. It just no make any sense. But returns to this issue...

2. Code should met ODR rule always, or unexpected problems will follow all time, unfortunately without warnings from compilers/linkers.

3. We initially hit in this with CEF with VS2015U3. Technically it is not compiler bug. More details on https://bitbucket.org/chromiumembedded/cef/issues/1999 . My workaround solves this WITHOUT touching codebase. Please read carefully to understand how bad code generated in worst case.

4. If code will not met to ODR - you will return to this issue all time. Before 2015U3 it had been work just by luck. On linux/gcc as i'm heard is more aggressive inlining and ignorance of ODR/arch = it work, but no one inspect so deep as i'm with windows (from info kn this issue).

As you see - the only way is make code ODR compliant, or no guarantees from any modern compiler. And not only skia can potentially hit in this. I'm hate all this situation just because it is clear initial developer ideas, but ODR-based code generation make them wrong. Of course it is very strange fir code which work correctly for years. Wonderful world. :)

Sign in to add a comment