New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
link

Issue 644525: Occasional Chrome Win builder compilation fail due to Windows kernel bug

Reported by lushnikov@chromium.org, Sep 6 2016 Project Member

Issue description

Over the perious of last 200 builds of Chrome Win builder, three compilation errors with the same symptoms happened:

https://build.chromium.org/p/chromium.chrome/builders/Google%20Chrome%20Win/builds/10530

https://build.chromium.org/p/chromium.chrome/builders/Google%20Chrome%20Win/builds/10602

https://build.chromium.org/p/chromium.chrome/builders/Google%20Chrome%20Win/builds/10609

Despite the error in tracing component, the catapult team insists it has nothing to do with them.

Suspecting CL: https://codereview.chromium.org/2239383002

Grigoriy, could you please take a look at this?
 

Comment 1 by kraynov@chromium.org, Sep 7 2016

Cc: primiano@chromium.org
Will take a look shortly

Comment 2 by primiano@chromium.org, Sep 7 2016

Uhm it's quite odd. The command line of protoc wrapper seems correct, if I copy/paste on my machine it works.
The other odd thing is that seems the only bot where this is flake, I glanced through all the other bots and they never got flakes.
Any clue about what is special about that machine?

Comment 3 by primiano@chromium.org, Sep 7 2016

Cc: brettw@chromium.org
Adding Brett as he also made some changes recently to proto_library.gni.

Bit of a summary:
- there are at least two cases where *all* the protoc invocation fail with an access violation (-1073741819 == 0xc0000005) code. these are [1,2]
- this seems to happen only on *one* windows bot (The one pointed out in #0). I looked around on the other windows bot and this flake seems to happen only there. (Would be great if I could repro on my machine, but both me and kraynov tried without success).
- So either there is something wrong with this bot (ram? hdd?), or this is some subtle race which reproduces only there due to some odd timing reasons.
- in [1,2] *All* the protoc invocation fail. The command line itself passed to protoc_wrapper.py seems correct to me, so I'd rule out a bug in the gn changes. I'd also rule out a bug in protoc itself. I find quite odd that either we hit such an hypotetical internal bug "always" 100 times in a row or "never".
- I downloaded the trace of the ninja build (attaching here) and the sequencing seems right: all the proto_gen step happen after the protoc.exe has been linked.

Is it possible that somehow the protoc.exe is still being written after the linker returns (or is corrupted?) and we end up running a corrupt exe? I am not sure what to thing here.

I would be very curious to pull the protoc.exe binary from the bot but there doesn't seem to be any easy way.

[1] https://build.chromium.org/p/chromium.chrome/builders/Google%20Chrome%20Win/builds/10530
[2] https://build.chromium.org/p/chromium.chrome/builders/Google%20Chrome%20Win/builds/10618/steps/compile/logs/stdio
trace-build.json
1.3 MB View Download

Comment 4 by primiano@chromium.org, Sep 7 2016

another possibility is that one of the recent changes to the proto_library.gni changed the order of parameters and even if correct it triggers some internal protoc bug.
After having seen https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77497 this week I stopped believing in idempotence of args.
kraynov@ is now looking at diffing the protoc invocation args on windows to see if and what changed.

Comment 5 by scottmg@chromium.org, Sep 9 2016

Since those logs I think will soon disappear, here's

#10602

[8369/34405] ACTION //components/tracing/proto:protos_gen(//build/toolchain/win:x86)
FAILED: gen/components/tracing/proto/event.pbzero.h gen/components/tracing/proto/event.pbzero.cc gen/components/tracing/proto/events_chunk.pbzero.h gen/components/tracing/proto/events_chunk.pbzero.cc 
C:/b/depot_tools/python276_bin/python.exe ../../tools/protoc_wrapper/protoc_wrapper.py event.proto events_chunk.proto --protoc ./../Release/protoc --proto-in-dir ../../components/tracing/proto --plugin proto_zero_plugin.exe --plugin-out-dir gen/components/tracing/proto --plugin-options wrapper_namespace=pbzero
--plugin_out: protoc-gen-plugin: Plugin failed with status code 3221225477.

Protoc has returned non-zero status: 1 .


#10609

[8642/34410] ACTION //components/tracing/proto:protos_gen(//build/toolchain/win:x86)
FAILED: gen/components/tracing/proto/event.pbzero.h gen/components/tracing/proto/event.pbzero.cc gen/components/tracing/proto/events_chunk.pbzero.h gen/components/tracing/proto/events_chunk.pbzero.cc 
C:/b/depot_tools/python276_bin/python.exe ../../tools/protoc_wrapper/protoc_wrapper.py event.proto events_chunk.proto --protoc ./../Release/protoc --proto-in-dir ../../components/tracing/proto --plugin proto_zero_plugin.exe --plugin-out-dir gen/components/tracing/proto --plugin-options wrapper_namespace=pbzero
--plugin_out: protoc-gen-plugin: Plugin failed with status code 3221225477.

Protoc has returned non-zero status: 1 .


#10618 (too big for here)

https://gist.github.com/sgraham/d6b447a018a93fdfa2efb1c64b793ad6


All GPFs (the first two printed unsigned, the last unsigned).

All 3 built protoc.exe during the build.

I tried running this https://gist.github.com/sgraham/25e3c7a8bc51f2cf7f81d1fce739b196 for a while with similar build settings, but no luck on reproing a crash so far. I don't have goma on though.

It also doesn't seem to have happened in the last 48h.

Comment 6 by primiano@chromium.org, Sep 9 2016

The other thing to point out, is that in 2 cases mentioned in [3] the failure has been totally unrelated with the plugin and tracing, and I have seen general GPF of conventional protoc executions:

FAILED: gen/chrome/browser/profile_resetter/profile_reset_report.pb.h gen/chrome/browser/profile_resetter/profile_reset_report.pb.cc pyproto/chrome/browser/profile_resetter/profile_reset_report_pb2.py 
C:/b/depot_tools/python276_bin/python.exe ../../tools/protoc_wrapper/protoc_wrapper.py profile_reset_report.proto --protoc ./../Release/protoc --proto-in-dir ../../chrome/browser/profile_resetter --cc-out-dir gen/chrome/browser/profile_resetter --py-out-dir pyproto/chrome/browser/profile_resetter
Protoc has returned non-zero status: -1073741819 .

Similarly, *all* other protoc invocations failed.
I wonder if something the goma cache somehow got polluted with the wrong exes?

Comment 7 by bugdroid1@chromium.org, Sep 9 2016

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e4dfc3b51a21c5c771efe6108d0e75c4239706ee

commit e4dfc3b51a21c5c771efe6108d0e75c4239706ee
Author: kraynov <kraynov@chromium.org>
Date: Fri Sep 09 09:46:29 2016

Reorder protobuf compiler command line arguments.

Attempt to chase non-repoducible bug. CL crrev.com/2239383002 has
changed a command line for protoc and there are rare occasional
crashes on Win32. It could be because of some bug in protoc's parser.
This change restores a command line like before the suspecting CL
only for the most common case with only one proto file in a library.

BUG= 644525 

Review-Url: https://codereview.chromium.org/2324823002
Cr-Commit-Position: refs/heads/master@{#417544}

[modify] https://crrev.com/e4dfc3b51a21c5c771efe6108d0e75c4239706ee/tools/protoc_wrapper/protoc_wrapper.py

Comment 8 by picksi@chromium.org, Sep 12 2016

If, as I recall, this turned out to be a cache not being correctly cleared, can we file a bug to update cache dependencies to stop this happening again? If we already did this can we link it here? Thanks.

Comment 9 by primiano@chromium.org, Sep 12 2016

Status: WontFix (was: Assigned)
We don't know what the real problem was for sure. We ended up suspecting some pollution of the goma cache, as in some instances *all* the execution of protoc failed.
The bot seems reliably green (the failures on Sep 8 are unrelated).
Marking this as WontFix. Reopen if the issue shows up again.

Comment 10 by sullivan@chromium.org, Sep 12 2016

Cc: dominicc@chromium.org kraynov@chromium.org lushnikov@chromium.org
 Issue 644497  has been merged into this issue.

Comment 11 by kraynov@chromium.org, Sep 22 2016

Cc: hbos@chromium.org
 Issue 648566  has been merged into this issue.

Comment 12 by kraynov@chromium.org, Sep 22 2016

Status: Unconfirmed (was: WontFix)

Comment 13 by primiano@chromium.org, Sep 22 2016

Status: Assigned (was: Unconfirmed)
Hmm this has happened again in  Issue 648566 , and again in https://build.chromium.org/p/chromium.chrome/builders/Google%20Chrome%20Win/builds/11038/steps/compile/logs/stdio *all* instances of protoc failed.

kraynov@ can you please sync and gn before and after your changes on windows and see what changed in the .ninja files with your CL?
I wonder if you dropped some dependency as a side-effect there.

Comment 14 by thomasanderson@chromium.org, Sep 22 2016

Labels: -OS-Linux OS-Windows

Comment 15 by primiano@chromium.org, Sep 23 2016

Blockedon: 649702

Comment 17 by primiano@chromium.org, Sep 26 2016

I asked to replaced the bot, to rule out some weird HW issue.
Out of curiosity, is this bot part of the standard sheriff rotation? Would be great if I could rdesktop in it when this issue happens, to see if the binary is sane or whatnot.

Comment 18 by xyzzyz@chromium.org, Sep 26 2016

Cc: xyzzyz@chromium.org

Comment 20 by lushnikov@chromium.org, Oct 11 2016

Cc: -lushnikov@chromium.org

Comment 21 by kraynov@chromium.org, Oct 18 2016

Trying to go with this approach crrev.com/2427943002

Comment 22 by bugdroid1@chromium.org, Oct 20 2016

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/10009d80608ae4f8c40f60a5a768cf00f7169bf4

commit 10009d80608ae4f8c40f60a5a768cf00f7169bf4
Author: kraynov <kraynov@chromium.org>
Date: Thu Oct 20 02:03:59 2016

Explicit dependency on executables in proto_library.gni.

GN uses wrapper script to invoke protobuf compiler and
script depends on it using deps variable in GN action.
This change makes this script also dependent on protoc
(or plugin) executable explicitly.

BUG= 644525 

Review-Url: https://chromiumcodereview.appspot.com/2427943002
Cr-Commit-Position: refs/heads/master@{#426374}

[modify] https://crrev.com/10009d80608ae4f8c40f60a5a768cf00f7169bf4/third_party/protobuf/proto_library.gni

Comment 23 by kraynov@chromium.org, Oct 26 2016

Blockedon: -649702
NextAction: 2016-11-21
Okay, will wait and see.
Now it's 6 days after last "fix" and nothing fails.
Going to close on 21 Nov 2016 if no fails will occur.

Comment 24 by kraynov@chromium.org, Nov 11 2016

No failures between 20 Oct and 11 Nov.
Will monitor until 21 Nov 2016.

Comment 25 by kraynov@chromium.org, Nov 21 2016

Status: Verified (was: Assigned)
Not happening for 1 month since this fix https://crrev.com/2427943002
Closing the bug :)

Comment 26 by thakis@chromium.org, Sep 3 2017

Status: Started (was: Verified)
Still (or again?) seeing this on https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.win%2Fwin_chrome_official%2F196%2F%2B%2Frecipes%2Fsteps%2Fcompile__with_patch_%2F0%2Fstdout

FAILED: gen/components/tracing/proto/event.pbzero.h gen/components/tracing/proto/event.pbzero.cc gen/components/tracing/proto/events_chunk.pbzero.h gen/components/tracing/proto/events_chunk.pbzero.cc 
E:/b/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../tools/protoc_wrapper/protoc_wrapper.py event.proto events_chunk.proto --protoc ./protoc.exe --proto-in-dir ../../components/tracing/proto --plugin proto_zero_plugin.exe --plugin-out-dir gen/components/tracing/proto --plugin-options wrapper_namespace=pbzero
--plugin_out: protoc-gen-plugin: Plugin failed with status code 3221225477.
Protoc has returned non-zero status: 1 .

Comment 27 by kraynov@chromium.org, Sep 4 2017

Ooooh, things get really interesting.

When I worked with this bug 1 year ago it was tricky to figure out what's going wrong with deps, but fixed it in https://codereview.chromium.org/2427943002
That this fix worked allegedly proves that the problem was missing protoc plugin executable.

Now we have these lines (from fix above):
if (generate_with_plugin) {
  inputs += [ plugin_path ]
  ...
  if (defined(plugin_host_label)) {
    # Action depends on native generator plugin but for host toolchain only.
    deps += [ plugin_host_label ]
  }
}

I really suspect this change caused the problem to happen again
https://chromium-review.googlesource.com/584832

If so, it means that action to generate protobuf stubs really rely on having protoc plugin (I guess protoc itself as well) in *data_deps*. And since *deps* on excutables not treated as action's *data_deps* (aka runtime deps) by default anymore, it causes the failure happen again.

Two questions:
1. It seems easy to fix by adding data_deps. But does it really makes sense?
2. Why Windows?

Either it's GN bug or we should add data_deps in proto_library.gni.

Comment 28 by kraynov@chromium.org, Sep 4 2017

Cc: agrieve@chromium.org

Comment 29 by kraynov@chromium.org, Sep 4 2017

Sorry for misleading comment #27.
Actually it should be the case only if we have shared libraries.
I will check it with GN args used on that builder.
Please ignore comment #27 before further update.

Comment 30 by kraynov@chromium.org, Sep 4 2017

Owner: agrieve@chromium.org
Doesn't seem that there are any DLLs protobuf plugin depends on.
Tried on Windows with these flags:
is_debug = false
is_official_build = true
strip_absolute_paths_from_debug_symbols = true
target_cpu = "x86"

agrieve@ could you help me with that, any ideas? Thanks!

Comment 31 by tapted@chromium.org, Oct 27 2017

Issue 765323 has been merged into this issue.

Comment 32 by brucedaw...@chromium.org, Nov 17 2017

Cc: jojwang@chromium.org jbudorick@chromium.org dpranke@chromium.org brucedaw...@chromium.org no...@chromium.org
 Issue 782128  has been merged into this issue.

Comment 33 by brucedaw...@chromium.org, Nov 17 2017

I hit this on one of my workstations and was able to investigate. In this case it was genstring.exe that was crashing. When I ran it it crashed in mainCRTStartup and the assembly language looked like this:

000000014000109B 00 00                add         byte ptr [rax],al  
000000014000109D 00 00                add         byte ptr [rax],al  
000000014000109F 00 00                add         byte ptr [rax],al  
00000001400010A1 00 00                add         byte ptr [rax],al  
00000001400010A3 00 00                add         byte ptr [rax],al  
mainCRTStartup:
00000001400010A5 00 00                add         byte ptr [rax],al  
00000001400010A7 00 00                add         byte ptr [rax],al  
00000001400010A9 00 00                add         byte ptr [rax],al  
00000001400010AB 00 00                add         byte ptr [rax],al  
00000001400010AD 00 00                add         byte ptr [rax],al  
_get_startup_commit_mode:
00000001400010AF 00 00                add         byte ptr [rax],al  
00000001400010B1 00 00                add         byte ptr [rax],al  
00000001400010B3 00 00                add         byte ptr [rax],al  
00000001400010B5 00 00                add         byte ptr [rax],al  
00000001400010B7 00 00                add         byte ptr [rax],al  

I then forced a relink (no recompilation) and on the next run it worked and the code for mainCRTStartup looked like this:

__GSHandlerCheckCommon:
00000001400010A0 E9 1B 3F 00 00       jmp         __GSHandlerCheckCommon (0140004FC0h)  
mainCRTStartup:
00000001400010A5 E9 B6 22 00 00       jmp         mainCRTStartup (0140003360h)  
__scrt_get_dyn_tls_dtor_callback:
00000001400010AA E9 21 34 00 00       jmp         __scrt_get_dyn_tls_dtor_callback (01400044D0h)  
_get_startup_commit_mode:
00000001400010AF E9 BC 32 00 00       jmp         _get_startup_commit_mode (0140004370h)  

What's going on is that this is an array of five-byte thunks, used in incremental linking to let the linker move functions around easily. In the bad builds the thunks are all zeroes which tends to be crashy.

So...

1) It's not a compiler bug. The object files are fine because relinking fixes the issue. But we already knew this because the bug happened with both VC++ and clang
2) It is an incremental linking linker bug.


It was pointed out on one of the many other related bugs that we could avoid this bug by disabling incremental linking on the affected binaries:

+    if (is_win) {
+      configs -= [ "//build/config/win:default_incremental_linking" ]
+      configs += [ "//build/config/win:no_incremental_linking" ]
+    }

They are all small enough that incremental linking buys us little and, apparently, costs us a lot. Somebody (agrieve@ or I) should apply this to the list of binaries where this error is happening. I'll file a VS bug first but I want to wait for their current bug tracker outage to end so that I can see if I have already filed this.

Comment 34 by brucedaw...@chromium.org, Nov 17 2017

Owner: brucedaw...@chromium.org
I'll grab this.

Comment 35 by brucedaw...@chromium.org, Nov 17 2017

Binaries to disable incremental linking for:

protoc.exe (by far the most frequent victim)
genstring.exe
brotli.exe
yasm.exe
mksnapshot.exe

Most of these binaries are tiny - less than 1.5 MB - so incremental linking is not important. protoc.exe is 4 MB but it hits this issue more than any other binary so the fix *must* be applied to it. mksnapshot.exe may be hitting this issue very rarely, and it is 17 MB and its linking can be a build bottleneck so I am *not* going to apply the fix to it.

crrev.com/c/777764 applies this fix to the four binaries promised, plus some other similar (small and listed in the same BUILD.gn files) binaries in order to increase coverage.

Comment 36 by bugdroid1@chromium.org, Nov 20 2017

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1942fd8a7fe9fc609f51ef1fbff210ba5f356415

commit 1942fd8a7fe9fc609f51ef1fbff210ba5f356415
Author: Bruce Dawson <brucedawson@chromium.org>
Date: Mon Nov 20 23:33:47 2017

Disable incremental linking for some tools

We occasionally get build crashes because binaries (usually protoc.exe,
but others as well) are generated incorrectly. The symptom is that the
incremental linking thunks contain all zeroes instead of a branch
instruction, leading to crashes, usually access violations. This is
presumed to be a bug in the MSVC++ incremental linker.

This turns off incremental linking for four of the binaries that hit
this issue most frequently, and some of their neighbors. These binaries
are all small enough that incremental linking is not important so there
is no real downside to making this change.

Testing over the weekend shows that this error, or something very like
it, can happen even with incremental linking disabled. I hope that this
will reduce the frequency of the failures and there is no downside so
I'm going to proceed and see if it helps.

Bug:  644525 
Cq-Include-Trybots: master.tryserver.chromium.android:android_cronet_tester;master.tryserver.chromium.mac:ios-simulator-cronet
Change-Id: I0a9b33b0ad8335868e8e6f227f9a21e5ddeff6e4
Reviewed-on: https://chromium-review.googlesource.com/777764
Commit-Queue: Bruce Dawson <brucedawson@chromium.org>
Reviewed-by: John Abd-El-Malek <jam@chromium.org>
Cr-Commit-Position: refs/heads/master@{#517990}
[modify] https://crrev.com/1942fd8a7fe9fc609f51ef1fbff210ba5f356415/net/tools/transport_security_state_generator/BUILD.gn
[modify] https://crrev.com/1942fd8a7fe9fc609f51ef1fbff210ba5f356415/third_party/brotli/BUILD.gn
[modify] https://crrev.com/1942fd8a7fe9fc609f51ef1fbff210ba5f356415/third_party/protobuf/BUILD.gn
[modify] https://crrev.com/1942fd8a7fe9fc609f51ef1fbff210ba5f356415/third_party/yasm/BUILD.gn

Comment 37 by brucedaw...@chromium.org, Nov 21 2017

I *hope* that this fixes this bug, or at least lowers the frequency. It seems quite possible that it won't fix it, in which case paying attention to whether the frequency of these issues has dropped is important.

And, capturing a bad binary and its .pdb file is also important.

Comment 38 by mbonadei@chromium.org, Nov 21 2017

Cc: mbonadei@chromium.org
Hi,

it seems that the WebRTC win32_asan trybot started to fail after WebRTC rolled https://chromium.googlesource.com/chromium/src.git/+/1942fd8a7fe9fc609f51ef1fbff210ba5f356415.

FAILED: obj/third_party/openh264/openh264_common_yasm/asm_inc.o 
C:/b/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../third_party/yasm/run_yasm.py ./yasm -DPREFIX -fwin32 -m x86 -I../../third_party/openh264/src/codec/api/svc -I../../third_party/openh264/src/codec/common/inc -I../../third_party/openh264/src/codec/common/src -I. -I../.. -Igen -DX86_32 -o obj/third_party/openh264/openh264_common_yasm/asm_inc.o ../../third_party/openh264/src/codec/common/x86/asm_inc.asm
==5684==ERROR: AddressSanitizer failed to allocate 0x16000000 (369098752) bytes at 0x3a000000 (error code: 1455)
==5684==ReserveShadowMemoryRange failed while trying to map 0x16000000 bytes. Perhaps you're using ulimit -v

And

FAILED: obj/third_party/libvpx/libvpx_yasm/sad_sse4.o 
C:/b/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../third_party/yasm/run_yasm.py ./yasm -DPREFIX -fwin32 -m x86 -I../../third_party/libvpx/source/config -I../../third_party/libvpx/source/config/win/ia32 -I../../third_party/libvpx/source/libvpx -I. -I../.. -Igen -DCHROMIUM -o obj/third_party/libvpx/libvpx_yasm/sad_sse4.o ../../third_party/libvpx/source/libvpx/vpx_dsp/x86/sad_sse4.asm
==4116==ERROR: AddressSanitizer failed to allocate 0x16000000 (369098752) bytes at 0x3a000000 (error code: 1455)
==4116==ReserveShadowMemoryRange failed while trying to map 0x16000000 bytes. Perhaps you're using ulimit -v

Can this be related?

Comment 39 by mbonadei@chromium.org, Nov 22 2017

Blocking: webrtc:8562

Comment 40 by mbonadei@chromium.org, Nov 22 2017

The commit rolled again and win32_asan is not finding anything now. Feel free to ignore the previous 2 messages.

Comment 41 by jojwang@google.com, Nov 22 2017

Cc: -jojwang@chromium.org

Comment 42 by brucedaw...@chromium.org, Dec 4 2017

I just hit this again on my workstation so I had a chance to run the broken executable and see what it looked like. The crash call stack looked like this:

>	protoc.exe!google::protobuf::internal::Acquire_Load(int const volatile *)
 	protoc.exe!google::protobuf::GoogleOnceInit(int *,void (*)(void))
 	protoc.exe!google::protobuf::compiler::protobuf_google_2fprotobuf_2fcompiler_2fplugin_2eproto::AddDescriptors(void)
 	protoc.exe!google::protobuf::compiler::protobuf_google_2fprotobuf_2fcompiler_2fplugin_2eproto::StaticDescriptorInitializer::StaticDescriptorInitializer(void)
 	protoc.exe!google::protobuf::compiler::protobuf_google_2fprotobuf_2fcompiler_2fplugin_2eproto::AddDescriptors(void)
 	protoc.exe!google::protobuf::compiler::CodeGeneratorResponse::~CodeGeneratorResponse(void)
 	ucrtbased.dll!_initterm(void(*)() * first, void(*)() * last)
 	protoc.exe!__scrt_common_main_seh()
 	protoc.exe!__scrt_common_main()
 	protoc.exe!mainCRTStartup()
 	kernel32.dll!@BaseThreadInitThunk@12()
 	ntdll.dll!__RtlUserThreadStart()
 	ntdll.dll!__RtlUserThreadStart@8()

and the crashing function looked like this:

google::protobuf::internal::Acquire_Load:
006006E0 00 00                add         byte ptr [eax],al  
006006E2 00 00                add         byte ptr [eax],al  
006006E4 00 00                add         byte ptr [eax],al  
006006E6 00 00                add         byte ptr [eax],al  
006006E8 00 00                add         byte ptr [eax],al  
006006EA 00 00                add         byte ptr [eax],al  
006006EC 00 00                add         byte ptr [eax],al  
006006EE 00 00                add         byte ptr [eax],al  
006006F0 00 00                add         byte ptr [eax],al  
006006F2 00 00                add         byte ptr [eax],al  
006006F4 00 00                add         byte ptr [eax],al  
006006F6 00 00                add         byte ptr [eax],al  
006006F8 00 00                add         byte ptr [eax],al  
006006FA 00 00                add         byte ptr [eax],al  
006006FC 00 00                add         byte ptr [eax],al  
006006FE 00 00                add         byte ptr [eax],al 

So, disabling incremental linking did not solve the problem.

However, it may have reduced the frequency. I looked at the last 473 builds (from 24038 when my change landed to 24510, the most recent) and there were only 2 failures caused by this issue (24058 and 24223), compared to 3 in 200 builds (according to the original complaint).

Much may have happened in 15 months to change the frequency so it's hard to know how to interpret this - maybe it's better?

Comment 43 by no...@chromium.org, Dec 5 2017

Cc: -no...@chromium.org

Comment 44 by brucedaw...@chromium.org, Dec 11 2017

Cc: bashi@chromium.org sergeyu@chromium.org pfeldman@chromium.org ksakamoto@chromium.org eustas@chromium.org
 Issue 739916  has been merged into this issue.

Comment 45 by eustas@chromium.org, Dec 11 2017

FWIW: when brotli.exe is linked, the following warning is emitted:
"LINK : /LTCG specified but no code generation required; remove /LTCG from the link command line to improve linker performance"

Comment 46 by brucedaw...@chromium.org, Dec 11 2017

The LTCG warning is not related. It is impractical to get LTCG settings completely in sync so this type of warning is pretty much unavoidable. It will go away when we switch to clang-cl as we won't be trying to use LTCG anymore.

Comment 47 by kbr@chromium.org, Dec 15 2017

Blocking: 793708

Comment 48 by mbonadei@chromium.org, Dec 18 2017

Blocking: -webrtc:8562
Cc: -mbonadei@chromium.org

Comment 49 by brucedaw...@chromium.org, Dec 18 2017

I just this bug on my local machine while using use_lld=true. That makes me very confused.

Initially the bug happened with the VC++ compiler and linker. When I reproduced the bug locally I found that relinking would resolve the issue. That together with the fact that the bug continued with clang-cl and the VC++ linker strongly suggested that it was a linker bug. Initially I thought it was an incremental linking bug, but disabling incremental linking failed to resolve the issue.

But now it has happened (albeit only once) with the lld linker.

I just checked the results of the last 500 builds here:

https://ci.chromium.org/buildbot/chromium.chrome/Google%20Chrome%20Win/?limit=500

This was from 24523 to 25022. There were twelve build failures and one exception. Eight of the build failures appeared to be from this bug (brotli.exe or mksnapshot).

So, 8/500 as of this date. Let's see what happens to the frequency if/when we switch to use_lld = true.

Comment 50 by brucedaw...@chromium.org, Dec 18 2017

I just hit another one of these locally, this time while linking with VC++ and compiling with clang.

I took the opportunity to debug and found that genperf.exe and re2.exe, which were both crashing multiple times during the build, worked fine under the debugger and seemed to create reasonable results.

The initial bug was a bug with instructions being replaced in the binary with zeroes. However the problem is that 0xC0000005 (the hex equivalent of the -1073741819 error code we see most often) just means access violation and can be caused by any number of errors.

The next step is probably to dump more information when these failures happen - a symbolized call stack, the address being dereferenced, the instruction being dereferenced, and perhaps the code bytes near the instruction pointer. This should give us enough clues to let us figure out whether this is one bug or many.

Comment 52 by brucedaw...@chromium.org, Jan 8 2018

Cc: zturner@chromium.org
I have carefully eliminated all possible causes of this bug and can therefore conclude that it is not happening and we must be experiencing mass hysteria.

In my most recent test I did clean builds in a loop until one of them failed, with genmodule.exe crashing. My build settings were:

is_component_build = true
is_debug = true
target_cpu = "x86"
enable_nacl = false
remove_webcore_debug_symbols = true
symbol_level = 1
use_lld = true
use_jumbo_build = true
is_clang = true

Note in particular that the VC++ compiler and VC++ linker have both been eliminated as causes because both are disabled. Incremental linking is eliminated as a cause because lld doesn't support incremental linking. Goma is eliminated as a cause because it is disabled. And yet, genmodule crashed, with a long zeroes of zero-byte instructions in mainCRTStartup, as shown below:

0:000> kc
  *** Stack trace for last set context - .thread/.cxr resets it
 # 
00 genmodule!mainCRTStartup
01 ntdll!__RtlUserThreadStart
02 ntdll!_RtlUserThreadStart

0:000> uf eip
Flow analysis was incomplete, some code may be missing
genmodule!mainCRTStartup [f:\dd\vctools\crt\vcstartup\src\startup\exe_main.cpp @ 15]:
   15 00408880 0000            add     byte ptr [eax],al
   15 00408882 0000            add     byte ptr [eax],al
   16 00408884 0000            add     byte ptr [eax],al
   16 00408886 0000            add     byte ptr [eax],al
   17 00408888 0000            add     byte ptr [eax],al
0040888a 0000            add     byte ptr [eax],al
0040888c 0000            add     byte ptr [eax],al
0040888e 0000            add     byte ptr [eax],al
  274 00408890 0000            add     byte ptr [eax],al
  274 00408892 0000            add     byte ptr [eax],al
  275 00408894 0000            add     byte ptr [eax],al
  275 00408896 0000            add     byte ptr [eax],al
... (goes on for a *long* time)

So, I deleted genmodule.exe and relinked. No other build steps ran. This should have produced identical results because it was initially a clean build, and yet, the relink fixed the issue.

Here is what the stack and disassembly looked like in the successful run:

0:000> kc
 # 
00 genmodule!mainCRTStartup
01 kernel32!BaseThreadInitThunk
02 ntdll!__RtlUserThreadStart
03 ntdll!_RtlUserThreadStart
0:000> uf eip
genmodule!mainCRTStartup [f:\dd\vctools\crt\vcstartup\src\startup\exe_main.cpp @ 15]:
   15 00408880 55              push    ebp
   15 00408881 8bec            mov     ebp,esp
   16 00408883 e8e8ffffff      call    genmodule!__scrt_common_main (00408870)
   17 00408888 5d              pop     ebp
   17 00408889 c3              ret

Note that mainCRTStartup comes from the C runtime, from a static .lib file that is part of the depot tools toolchain package and which never changes.


Now that we are starting to move towards our own linker we have some additional options for tracking down this bug - custom instrumentation in the linker itself. Any thoughts on that zturner@?

I have attached the broken genmodule.exe and genmodule.exe.pdb files, FWIW.
genmodule.exe.pdb
324 KB Download

Comment 53 by zturner@google.com, Jan 8 2018

Must be Spectre / Meltdown.

On a more serious note though, yes we are only limited by our imagination as to what we can add to the /verbose linker option.

Comment 54 by brucedaw...@chromium.org, Jan 8 2018

This is my fancy buildtests.bat script in case anybody else wants to leave this running overnight:

set basesettings="goma_dir=\"C:\src\goma\goma-win64\" is_component_build=true is_debug=true target_cpu=\"x86\" enable_nacl=false remove_webcore_debug_symbols=true
set testsettings=symbol_level=1 use_goma=false use_lld=true use_jumbo_build=true is_clang=true

:restart
@echo on
call gn gen out\BuildTest --args=%basesettings% %testsettings%" >nul
@echo on
call gn clean out\BuildTest & call gn gen out\BuildTest
@echo on
call ninja -C out\BuildTest chrome
@echo on
@if errorlevel 1 goto buildfailure
goto restart
:buildfailure
@echo Hey, the build failed. Huh.

Comment 55 by brucedaw...@chromium.org, Jan 8 2018

I decided to take a second look at the corrupt version of gen module.exe to get a better sense of how widespread the all-zero instructions were. I ran it under the debugger and... it ran just fine. I stepped in and looked at the previously crashing function and it looked just fine. To be specific, the code bytes in the file are *fine*. The problem is that when the binary got loaded during the build some of the bytes show up as zero.

I then added genmodule.exe to my local symbol store, loaded the crash dump, and ran !chkimg:

0:000> !chkimg
9322 errors : @$ip (00407000-0040b67b)

Well yeah, having 9322 code bytes that are incorrect in memory is likely to cause crashes. So, the problem happens when loading the executable. It feels like the executable is getting loaded and run before the writes to the binary have completed, which brings us back to some sort of weird OS problem.

As a side note, while looking at the binary metadata I noticed that the time date stamp is zero with lld. I assume that this is an effort to get reproducible builds. We should probably instead follow the guidelines of the Windows 10 build system:
https://blogs.msdn.microsoft.com/oldnewthing/20180103-00/?p=97705



So, is this perhaps a ninja race condition of some sort? Or an OS bug, somehow tickled by whatever anti-malware software we have running on my desktop and on our build machines (which I have been told are running none). It smells like an OS bug but I don't know how to zero in on the specifics.

Comment 56 by zturner@google.com, Jan 8 2018

Definitely not an AV bug, because I've seen this on my local machine and I'm off corp, so I *know* I don't have AV installed.

When it's happening, what happens if you run `sync` (from sysinternals suite) and then try again?

Comment 57 by brucedaw...@chromium.org, Jan 10 2018

I generated a list of all of the .exe files generated during a build of the chrome.exe target, not including chrome.exe itself. This should represent the set of binaries that are vulnerable to this bug and it does indeed seem to map well to the binaries that have caused failures:

brotli.exe
flatc.exe
character_data_generator.exe
genmodule.exe
genmacro.exe
genstring.exe
genversion.exe
genperf.exe
re2c.exe
protoc.exe
yasm.exe
transport_security_state_generator.exe
mksnapshot.exe
viz.service.exe
ui.service.exe
v8_context_snapshot_generator.exe

There are a total of 16 of them.

Comment 58 by brucedaw...@chromium.org, Jan 11 2018

Cc: thakis@chromium.org peria@chromium.org h...@chromium.org kbr@chromium.org yangguo@chromium.org
 Issue 793708  has been merged into this issue.

Comment 59 by brucedaw...@chromium.org, Jan 11 2018

 Issue 727649  has been merged into this issue.

Comment 60 by brucedaw...@chromium.org, Jan 13 2018

Cc: mstensho@chromium.org thomasanderson@chromium.org futhark@chromium.org p...@chromium.org
 Issue 772827  has been merged into this issue.

Comment 61 by peria@chromium.org, Jan 16 2018

Cc: -peria@chromium.org

Comment 62 by brucedaw...@chromium.org, Jan 17 2018

I did 1,000+ builds over the long weekend to better understand this issue.

Normal builds - 7 failures in 200 builds
7-second sleep after creating .exes - 6 failures in ~300 builds (maybe better, maybe just random)
sync after creating .exes - 0 failures in 558 builds

So, we have a 2-3.5% failure rate normally, and calling sysinternals sync.exe drops the failure rate to 0% with a high degree of confidence.

I've attempted to reproduce this bug outside of a Chrome build by running a program that repeatedly writes the contents of protoc.exe to protoc%04d.exe and then runs that executable, monitoring for crashes, while randomly changing the thread affinity before each call to CreateFile and each call to CreateProcess. Unfortunately I cannot repro the bug.

The next step is to try a custom version of lld that calls FlushFileBuffers on the target, which is similar to what sync.exe does, but doesn't require administrator privileges.

Another test is to use a stand-alone program to call FlushFileBuffers on the target after linking completes. This hack would have the advantage of also working with link.exe.

Comment 63 by brucedaw...@chromium.org, Jan 29 2018

Cc: engedy@chromium.org wvo@google.com pkalinnikov@chromium.org battre@chromium.org
 Issue 707586  has been merged into this issue.

Comment 64 by bugdroid1@chromium.org, Jan 31 2018

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/052a09014b2018e2a1e3b7f046ea9ad3355b831e

commit 052a09014b2018e2a1e3b7f046ea9ad3355b831e
Author: Bruce Dawson <brucedawson@chromium.org>
Date: Wed Jan 31 01:10:10 2018

Avoid Windows kernel bug using Python hack

On about 3-4% of Chrome builds on my workstation one of the executables
generated and then used during the build will crash. The binary on disk
is always fine but the loader sometimes maps in pages of zeroes where it
should be mapping in pages from the just-generated binary. Having a page
of zeroes where you are expecting useful instructions tends to lead to
crashes.

This appears to be a bug in the OS disk cache. My suspicion is that this
kernel bug only happens on multi-socket systems, but this is
speculation.

This bug happens regardless of which compiler or linker is used, and
appears to happen on multiple Windows versions. The best reproes have
been on Windows 10 Creators Update, or at least that is where I have
done most of my testing.

Extensive testing - hundreds of overnight builds - has shown that the
problem goes away if FlushFileBuffers is called on the output file
after linking is finished. Eventually this fix/hack will be coded into
lld-link.exe, but for now it is put in tool_wrapper.py to fix the bug
for both link.exe and lld-link.exe.

Earlier versions of this fix only applied it to files with .exe
extensions. However the bug is believed to have happened with DLLs, and
may also affect .lib files created by the linkers, so now it is done
always. The belief is that the performance impact will be negligible.

Importing of win32file required some trickiness because in the context
of ninja builds of Chrome the depot_tools python.bat file is apparently
not called. This means that the python directory is not added to the
system path. The python runtime correctly finds win32file.pyd and calls
LoadLibrary on it but the OS then finds its dependencies in another
version of python installed on the system and the DLL load fails if
those are 64-bit instead of 32-bit.

Bug:  644525 
Change-Id: I71d63b47050385e2e5ba46ced9c8018220370ba7
Reviewed-on: https://chromium-review.googlesource.com/876683
Commit-Queue: Bruce Dawson <brucedawson@chromium.org>
Reviewed-by: Zachary Turner <zturner@chromium.org>
Reviewed-by: Scott Graham <scottmg@chromium.org>
Cr-Commit-Position: refs/heads/master@{#533137}
[modify] https://crrev.com/052a09014b2018e2a1e3b7f046ea9ad3355b831e/build/toolchain/win/tool_wrapper.py

Comment 65 by brucedaw...@chromium.org, Jan 31 2018

Shortly after landing the change above but before it had time to take effect this tree-closing failure happened:

https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium%2FWin_x64%2F18699%2F%2B%2Frecipes%2Fsteps%2Fcompile%2F0%2Fstdout

transport_security_state_generator.exe failed with exit code -1073741819
transport_security_state_generator.exe failed with exit code -1073741819
transport_security_state_generator.exe failed with exit code -1073741819
transport_security_state_generator.exe failed with exit code -1073741819

Looking back slightly further in the tree status I found this failure:

https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium%2FWin_x64%2F18676%2F%2B%2Frecipes%2Fsteps%2Fcompile%2F0%2Fstdout

transport_security_state_generator.exe failed with exit code -1073741819

These are presumed to be this bug, showing up for what is hopefully the last time.

Comment 66 by tapted@chromium.org, Jan 31 2018

possibly an unrelated flake, but noting it here,

https://ci.chromium.org/buildbot/chromium.win/WinMSVC64/1684

just failed with
FAILED: obj/third_party/WebKit/Source/core/css/css_7.lib 
E:/b/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False lib.exe /nologo /ignore:4221 /OUT:obj/third_party/WebKit/Source/core/css/css_7.lib @obj/third_party/WebKit/Source/core/css/css_7.lib.rsp
Microsoft (R) Library Manager Version 14.11.25507.1
Copyright (C) Microsoft Corporation.  All rights reserved.
..
obj/third_party/WebKit/Source/core/css/css_7/WebkitBorderBeforeWidthCustom.obj 
obj/third_party/WebKit/Source/core/css/css_7/WebkitBorderBeforeWidthCustom.obj : fatal error LNK1136: invalid or corrupt file

Comment 67 by bugdroid1@chromium.org, Jan 31 2018

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ba4ede2f67fa09c512ea6063411618acad6516f7

commit ba4ede2f67fa09c512ea6063411618acad6516f7
Author: John Abd-El-Malek <jam@chromium.org>
Date: Wed Jan 31 02:42:42 2018

Reland "Move url_loader_unittest.cc and network_service_unittest.cc to services/network."

This is a reland of e1c22cc3b5dc2a4276839d053bdf12ddee14dbea. The change didn't actually break the build, it was  http://crbug.com/644525 

Bug:  644525 

Original change's description:
> Move url_loader_unittest.cc and network_service_unittest.cc to services/network.
>
> Their content dependencies have been removed in previous changes.
>
> Bug:  753658 
> Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo
> Change-Id: Ic632759ef35cfef4e707d29acdc361ae836e9b62
> Reviewed-on: https://chromium-review.googlesource.com/893651
> Commit-Queue: John Abd-El-Malek <jam@chromium.org>
> Reviewed-by: Tom Sepez <tsepez@chromium.org>
> Reviewed-by: Ken Rockot <rockot@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#533113}

Bug:  753658 
Change-Id: If780785790fb2b56aff1d91ddeb781bab2e127a3
Cq-Include-Trybots: master.tryserver.chromium.linux:linux_mojo
Reviewed-on: https://chromium-review.googlesource.com/894895
Reviewed-by: John Abd-El-Malek <jam@chromium.org>
Cr-Commit-Position: refs/heads/master@{#533175}
[modify] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/content/network/BUILD.gn
[modify] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/content/network/DEPS
[modify] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/content/test/BUILD.gn
[modify] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/content/test/unittests_manifest.json
[modify] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/content/utility/BUILD.gn
[modify] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/BUILD.gn
[modify] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/network/BUILD.gn
[rename] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/network/network_service_unittest.cc
[add] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/network/test/OWNERS
[add] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/network/test/service_unittest_manifest.json
[rename] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/network/url_loader_unittest.cc
[modify] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/BUILD.gn
[add] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/content-sniffer-test0.html
[copy] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/content-sniffer-test0.html.mock-http-headers
[add] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/content-sniffer-test1.html
[add] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/content-sniffer-test1.html.mock-http-headers
[add] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/content-sniffer-test2.html
[add] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/content-sniffer-test2.html.mock-http-headers
[rename] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/content-sniffer-test4.html
[rename] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/content-sniffer-test4.html.mock-http-headers
[copy] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/empty.html
[add] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/hello.html
[add] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/hello.html.mock-http-headers
[rename] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/nocache.html
[rename] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/nocache.html.mock-http-headers
[add] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/nosniff-test.html
[add] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/nosniff-test.html.mock-http-headers
[add] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/redirect307-to-echo
[add] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/redirect307-to-echo.mock-http-headers
[add] https://crrev.com/ba4ede2f67fa09c512ea6063411618acad6516f7/services/test/data/simple_page.html

Comment 68 by brucedaw...@chromium.org, Jan 31 2018

Summary: Occasional Chrome Win builder compilation fail due to Windows kernel bug (was: Occasional Chrome Win builder compilation fail)
Updating title

Comment 69 by bugdroid1@chromium.org, Jan 31 2018

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7f063158562d2516620386721deb1339a3160e19

commit 7f063158562d2516620386721deb1339a3160e19
Author: Bruce Dawson <brucedawson@chromium.org>
Date: Wed Jan 31 21:33:08 2018

Revert "Disable incremental linking for some tools"

This reverts commit 1942fd8a7fe9fc609f51ef1fbff210ba5f356415.

Reason for revert: This was a hack that attempted to fix random link
failures. This hack ultimately didn't work because the crashes we
were seeing were due to a kernel bug, not a linker bug.

Original change's description:
> Disable incremental linking for some tools
> 
> We occasionally get build crashes because binaries (usually protoc.exe,
> but others as well) are generated incorrectly. The symptom is that the
> incremental linking thunks contain all zeroes instead of a branch
> instruction, leading to crashes, usually access violations. This is
> presumed to be a bug in the MSVC++ incremental linker.
> 
> This turns off incremental linking for four of the binaries that hit
> this issue most frequently, and some of their neighbors. These binaries
> are all small enough that incremental linking is not important so there
> is no real downside to making this change.
> 
> Testing over the weekend shows that this error, or something very like
> it, can happen even with incremental linking disabled. I hope that this
> will reduce the frequency of the failures and there is no downside so
> I'm going to proceed and see if it helps.
> 
> Bug:  644525 
> Cq-Include-Trybots: master.tryserver.chromium.android:android_cronet_tester;master.tryserver.chromium.mac:ios-simulator-cronet
> Change-Id: I0a9b33b0ad8335868e8e6f227f9a21e5ddeff6e4
> Reviewed-on: https://chromium-review.googlesource.com/777764
> Commit-Queue: Bruce Dawson <brucedawson@chromium.org>
> Reviewed-by: John Abd-El-Malek <jam@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#517990}

TBR=jam@chromium.org,brucedawson@chromium.org

# Not skipping CQ checks because original CL landed > 1 day ago.

Bug:  644525 
Change-Id: Ib822f0850cdffe7cdf0112aac5c45a2200b63adf
Cq-Include-Trybots: master.tryserver.chromium.android:android_cronet_tester;master.tryserver.chromium.mac:ios-simulator-cronet
Reviewed-on: https://chromium-review.googlesource.com/894448
Reviewed-by: Bruce Dawson <brucedawson@chromium.org>
Reviewed-by: John Abd-El-Malek <jam@chromium.org>
Commit-Queue: Bruce Dawson <brucedawson@chromium.org>
Cr-Commit-Position: refs/heads/master@{#533419}
[modify] https://crrev.com/7f063158562d2516620386721deb1339a3160e19/net/tools/transport_security_state_generator/BUILD.gn
[modify] https://crrev.com/7f063158562d2516620386721deb1339a3160e19/third_party/brotli/BUILD.gn
[modify] https://crrev.com/7f063158562d2516620386721deb1339a3160e19/third_party/protobuf/BUILD.gn
[modify] https://crrev.com/7f063158562d2516620386721deb1339a3160e19/third_party/yasm/BUILD.gn

Comment 70 by bugdroid1@chromium.org, Feb 1 2018

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f1de72ddf27a1e397aa3cc24f0db96319d501702

commit f1de72ddf27a1e397aa3cc24f0db96319d501702
Author: Scott Graham <scottmg@chromium.org>
Date: Thu Feb 01 18:39:37 2018

Fix linux/win cross-compile

After https://chromium-review.googlesource.com/c/chromium/src/+/876683.

Bug:  644525 
Change-Id: Ic8f71d5f4cababaa252fd4c9155ea8b34b29158b
Reviewed-on: https://chromium-review.googlesource.com/896695
Reviewed-by: Bruce Dawson <brucedawson@chromium.org>
Commit-Queue: Scott Graham <scottmg@chromium.org>
Cr-Commit-Position: refs/heads/master@{#533749}
[modify] https://crrev.com/f1de72ddf27a1e397aa3cc24f0db96319d501702/build/toolchain/win/tool_wrapper.py

Comment 71 by brucedaw...@chromium.org, Feb 1 2018

The Google Chrome Win bot seemed to hit this bug fairly regularly, three times in two hundred builds in the initial report. The first build with this fix was:

https://ci.chromium.org/buildbot/chromium.chrome/Google%20Chrome%20Win/26484

The latest successful build is 26564 so that's 81 builds without this failure (there were two clearly unrelated failures on builds 26543 and 26544 but I'm counting those as successes for the purposes of this bug).

81 builds without a failure is a good start but could easily be due to chance. The waterfall (so, more builders tracked) has also not hit this since the change landed. I'll check in again over the next few days.

Comment 72 by brucedaw...@chromium.org, Feb 5 2018

The Google Chrome Win builder is now up to build 26695. There have been three build failures since the fix landed in build 26484, all clearly unrelated to this bug. Normally this bug would have caused ~3.4 failures over this many builds. This means:

1) The odds of this successful run being due to chance is down to about 3.3%
2) This fix appears to have resolved roughly half of our build breaks

I'll keep monitoring.

Note that Microsoft has indicated that they have put an equivalent workaround into their linker, and lld-link.exe will also get the workaround, so eventually the python based workaround will be unnecessary.
https://developercommunity.visualstudio.com/content/problem/191590/linker-needs-workaround-for-windows-kernel-bug.html

Comment 73 by brucedaw...@chromium.org, Feb 10 2018

Status: fixed (was: Started)
We're now at build 26878 on Google Chromium Win and there have been no recurrences. That's 395 builds when we used to have a 1.6% error rate. The odds of that happening due to chance are 0.17%. In fact there have only been four failures total on that builder in the last 395 builds, one due to goma failing to start, two due to some bad code landing, and one of unclear cause. I also checked all of the Chromium waterfall failures since the fix landed and there have been none due to this bug.

I'm calling it. The workaround works. I'm closing this bug as fixed. But, I am in contact with Microsoft engineers and I filed a formal support request - 118021017623739.

The one thing that I'm continuing to monitor is a very low frequency of failures of this type:

[753/4922] LINK chrome_elf_unittests.exe chrome_elf_unittests.exe.pdb
FAILED: chrome_elf_unittests.exe chrome_elf_unittests.exe.pdb 
C:/b/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False link.exe /nologo /OUT:./chrome_elf_unittests.exe /PDB:./chrome_elf_unittests.exe.pdb @./chrome_elf_unittests.exe.rsp
 : fatal error LNK1103: debugging information corrupt; recompile module

https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium.win%2FWin_x64_Builder__dbg_%2F62782%2F%2B%2Frecipes%2Fsteps%2Fcompile%2F0%2Fstdout

This happens in maybe 0.1% of builds so maybe we don't care but it is *possible* that this is the same bug. We will see what Microsoft says.

Comment 74 by bugdroid1@chromium.org, Feb 15 2018

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/8c50b3ed049c736406943ac24dc40fc8ac1e782c

commit 8c50b3ed049c736406943ac24dc40fc8ac1e782c
Author: Bruce Dawson <brucedawson@chromium.org>
Date: Thu Feb 15 00:08:36 2018

Print Windows crash codes in hex

Failure codes such as STATUS_ACCESS_VIOLATION are easily recognizable
and easily differentiated when printed in hex (0xC0000005) but are
cryptic and conflated when printed as decimal (-1073741819). This change
teaches two of our wrapper scripts to print large negative numbers as
hex so that those skilled in the Windows arts can automatically say
"access violation" or "not access violation."

I also removed an inelegant trailing period, for consistency.

In testing with artificially inserted error codes the output is:
  Protoc has returned non-zero status: -99
  Protoc has returned non-zero status: 0xC0000005
  genperf.exe failed with exit code -99
  re2c.exe failed with exit code 0xC0000005

Bug:  803617 , 644525 
Change-Id: I627754976ff04e334010d36e5734d73421523e47
Reviewed-on: https://chromium-review.googlesource.com/917101
Commit-Queue: Bruce Dawson <brucedawson@chromium.org>
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Cr-Commit-Position: refs/heads/master@{#536899}
[modify] https://crrev.com/8c50b3ed049c736406943ac24dc40fc8ac1e782c/build/gn_run_binary.py
[modify] https://crrev.com/8c50b3ed049c736406943ac24dc40fc8ac1e782c/tools/protoc_wrapper/protoc_wrapper.py

Comment 75 by brucedaw...@chromium.org, Feb 24 2018

 Issue 722117  has been merged into this issue.

Comment 76 by brucedaw...@chromium.org, Mar 9 2018

I might as well complete the circle and link from this bug to the blog post I wrote about it:

https://randomascii.wordpress.com/2018/02/25/compiler-bug-linker-bug-windows-kernel-bug/

Comment 77 by yangguo@google.com, Jan 23

Cc: hablich@chromium.org d...@chromium.org wfh@chromium.org qyears...@chromium.org hidehiko@chromium.org machenb...@chromium.org
 Issue 700525  has been merged into this issue.

Comment 78 by mstensho@chromium.org, Jan 23

Cc: -mstensho@chromium.org

Sign in to add a comment