goma doesn't handle -imsvc correctly when it's in a response file, leading to miscompiles |
|||||||
Issue descriptione.g. https://build.chromium.org/p/chromium.fyi/builders/CrWinClang/builds/9654/steps/compile/logs/stdio It always complains about failing to run either genmodule (without error), protoc, or when compiling a file generated by protoc: FAILED: gen/third_party/yasm/module.c C:/b/depot_tools/python276_bin/python.exe ../../build/gn_run_binary.py genmodule.exe ../../third_party/yasm/source/patched-yasm/libyasm/module.in ../../third_party/yasm/source/config/win/Makefile gen/third_party/yasm/module.c FAILED: pyproto/components/copresence/proto/data_pb2.py gen/components/copresence/proto/data.pb.cc gen/components/copresence/proto/data.pb.h C:/b/depot_tools/python276_bin/python.exe ../../tools/protoc_wrapper/protoc_wrapper.py --protobuf gen/components/copresence/proto/data.pb.h --proto-in-dir ../../components/copresence/proto --proto-in-file data.proto --use-system-protobuf=0 -- ./protoc --cpp_out gen/components/copresence/proto --python_out pyproto/components/copresence/proto config_data.proto:16640432:16640432: Couldn't parse default value "-.". config_data.proto:16640432:16640432: Couldn't parse default value "-.". config_data.proto:16640432:16640432: Couldn't parse default value "-.". config_data.proto:16640432:16640432: Couldn't parse default value "-.". config_data.proto:16640432:16640432: Couldn't parse default value "-.". data.proto: Import "config_data.proto" was not found or had errors. data.proto:16640432:16640432: "DirectiveConfiguration" is not defined. https://build.chromium.org/p/chromium.fyi/builders/CrWinClang/builds/9597/steps/compile/logs/stdio C:\b\build\slave\cache\cipd\goma/gomacc.exe ../../third_party/llvm-build/Release+Asserts/bin/clang-cl.exe /nologo /showIncludes /FC @clang_nacl_win64/obj/components/metrics/proto/proto/system_profile.pb.obj.rsp /c clang_nacl_win64/gen/components/metrics/proto/system_profile.pb.cc /Foclang_nacl_win64/obj/components/metrics/proto/proto/system_profile.pb.obj /Fd"clang_nacl_win64/obj/components/metrics/proto/proto_cc.pdb" In file included from clang_nacl_win64/gen/components/metrics/proto/system_profile.pb.cc:5: clang_nacl_win64/gen/components/metrics/proto/system_profile.pb.h(5464,58): error: source file is not valid UTF-8 primary_screen_scale_factor_ = 1.1019004687438281311395<D8><F1>[f; ^ The last error is especially weird. I ssh'd in and checked what the protoc invocation by protoc_wrapper.py is, it's: $ ./protoc --cpp_out gen/components/copresence/proto --python_out pyproto/components/copresence/proto --proto_path=../../components/copresence/proto ../../components/copresence/proto/rpcs.proto config_data.proto:3467696:3467696: Couldn't parse default value "-.". config_data.proto:3467696:3467696: Couldn't parse default value "-.". config_data.proto:3467696:3467696: Couldn't parse default value "-.". config_data.proto:3467696:3467696: Couldn't parse default value "-.". config_data.proto:3467696:3467696: Couldn't parse default value "-.". data.proto: Import "config_data.proto" was not found or had errors. data.proto:3467696:3467696: "DirectiveConfiguration" is not defined. rpcs.proto: Import "config_data.proto" was not found or had errors. rpcs.proto: Import "data.proto" was not found or had errors. rpcs.proto:3467696:3467696: "ClientVersion" is not defined. rpcs.proto:3467696:3467696: "ClientVersion" is not defined. rpcs.proto:3467696:3467696: "DeviceFingerprint" is not defined. rpcs.proto:3467696:3467696: "DebugInfo" is not defined. rpcs.proto:3467696:3467696: "Status" is not defined. rpcs.proto:3467696:3467696: "Configuration" is not defined. rpcs.proto:3467696:3467696: "PushServiceRegistration" is not defined. rpcs.proto:3467696:3467696: "DeviceIdentifiers" is not defined. rpcs.proto:3467696:3467696: "TokenObservation" is not defined. rpcs.proto:3467696:3467696: "DeviceState" is not defined. rpcs.proto:3467696:3467696: "PublishedMessage" is not defined. rpcs.proto:3467696:3467696: "Subscription" is not defined. rpcs.proto:3467696:3467696: "Token" is not defined. rpcs.proto:3467696:3467696: "SubscribedMessage" is not defined. rpcs.proto:3467696:3467696: "Directive" is not defined. rpcs.proto:3467696:3467696: "MessageResult" is not defined. rpcs.proto:3467696:3467696: "SubscriptionResult" is not defined. The last time I switched this bot, compile worked fine, not sure what broke. (dpranke, shout if you've seen something like this before, else I'll just debug a bit, feel free to uncc yourself in that case)
,
Jun 7 2016
Aha, we apparently updated to a new protobuf 4 days ago...
,
Jun 7 2016
https://chromium.googlesource.com/chromium/src/+blame/master/third_party/protobuf/src/google/protobuf/descriptor.cc#4348 claims that that update touched the "Couldn't parse default value" error line, but https://chromium.googlesource.com/chromium/src/+/a5b3d51ac5c84564b9671fbdb6afb54a527ebd90%5E%21/#F641 doesn't show anything. Hmmm.
,
Jun 7 2016
Ah no, that line was touched in the previous update in april.
,
Jun 7 2016
clang_nacl_win64/gen/components/metrics/proto/system_profile.pb.h(5464,58): error: source file is not valid UTF-8 primary_screen_scale_factor_ = 1.1019004687438281311395<D8><F1>[f; If you look at line 5464 of generated system_profile.pb.h, it's part of a method clear_primary_scale_factor(): https://cs.chromium.org/chromium/src/out/Debug/gen/components/metrics/proto/system_profile.pb.h?q=system_profile.pb.h&sq=package:chromium&dr&l=5464 I see no conceivable way of protoc compiler generating such weird float constant there, not to mention the random high byte characters. If it's not reproducible on other machines, my guess would be a hardware failure, more likely memory (because some garbage was interpreted as a float).
,
Jun 7 2016
Seems very unlikely that memory went bad in the same build that I switched the bot to gn, no? (but the line/col numbers in the error message look like garbage too)
,
Jun 7 2016
I managed to repro the genperf crash locally by a) completely blowing away my out dir b) enabling goma. It seems that this doesn't repro without goma (?)
,
Jun 7 2016
This reliably repros the genperf crash. Not using goma or not using clang or not doing official builds makes it go away. C:\src\chrome\src>type out\gnnew\args.gn is_chrome_branded = true is_clang = true is_debug = false is_official_build = true symbol_level = 1 target_cpu = "x86" use_goma = true C:\src\chrome\src>ninja -C out\gnnew gen/third_party/yasm/include/x86insn_nasm.c -t clean Cleaning... 25 files. C:\src\chrome\src>ninja -C out\gnnew gen/third_party/yasm/include/x86insn_nasm.c ninja: Entering directory `out\gnnew' [21/23] LINK genperf.exe LINK : /LTCG specified but no code generation required; remove /LTCG from the link command line to improve linker perfor mance [23/23] ACTION //third_party/yasm:compile_gperf_for_include(//build/toolchain/win:clang_x86) FAILED: gen/third_party/yasm/include/x86insn_nasm.c C:/python_27_amd64/files/python.exe ../../build/gn_run_binary.py genperf.exe gen/third_party/yasm/include/x86insn_nasm.g perf gen/third_party/yasm/include/x86insn_nasm.c ninja: build stopped: subcommand failed.
,
Jun 7 2016
It also works with the 05-09 goma client release.
,
Jun 7 2016
Hm, I then tried with the Jun 3 release (worked) and then with Jun 6 again, and now it doesn't happen any more :-/ But since I saw this on my box, it's likely not memory corruption on the slave. Will try to get it to repro again.
,
Jun 7 2016
Ok, I managed to repro this with all 3 gomas I tried (Jun 6, 4, May 9) by doing: C:\src\chrome\src>type repro.bat rmdir /s/q out\gn2 mkdir out\gn2 copy out\gnnew\args.gn out\gn2 gn gen out\gn2 and then run repro followed by ninja -C out\gnnew gen/third_party/yasm/include/x86insn_nasm.c -j200 and if it doesn't repro then, by adding "-t clean" and then running the build command again. I couldn't repro this without -j200 if I did a fully clean build, so I suppose something's timing-dependent. Not sure if this is goma's fault, or if goma just changes timings enough for something to go wrong that would go wrong if the timing was different for other reasons.
,
Jun 7 2016
Disabling incremental linking doesn't change things. Trying to load the genperf crash into msvc's debugger makes msvc hang on startup. Running it in windbg tells me 0:000> g ModLoad: 00000000`77a80000 00000000`77b9f000 WOW64_IMAGE_SECTION ModLoad: 00000000`75ec0000 00000000`75fd0000 WOW64_IMAGE_SECTION ModLoad: 00000000`77a80000 00000000`77b9f000 NOT_AN_IMAGE ModLoad: 00000000`77980000 00000000`77a7a000 NOT_AN_IMAGE ModLoad: 00000000`75ec0000 00000000`75fd0000 C:\Windows\syswow64\kernel32.dll ModLoad: 00000000`77590000 00000000`775d7000 C:\Windows\syswow64\KERNELBASE.dll ModLoad: 00000000`761b0000 00000000`76251000 C:\Windows\syswow64\ADVAPI32.dll ModLoad: 00000000`76260000 00000000`7630c000 C:\Windows\syswow64\msvcrt.dll ModLoad: 00000000`75c90000 00000000`75ca9000 C:\Windows\SysWOW64\sechost.dll ModLoad: 00000000`775e0000 00000000`776d0000 C:\Windows\syswow64\RPCRT4.dll ModLoad: 00000000`756d0000 00000000`75730000 C:\Windows\syswow64\SspiCli.dll ModLoad: 00000000`756c0000 00000000`756cc000 C:\Windows\syswow64\CRYPTBASE.dll (14c4.34e0): WOW64 breakpoint - code 4000001f (first chance) First chance exceptions are reported before any exception handling. This exception may be expected and handled. *** ERROR: Symbol file could not be found. Defaulted to export symbols for ntdll32.dll - ntdll32!LdrVerifyImageMatchesChecksum+0xf1f: 77e20e14 cc int 3 0:000:x86> k ChildEBP RetAddr WARNING: Stack unwind information not available. Following frames may be wrong. 0024fa9c 77e00f27 ntdll32!LdrVerifyImageMatchesChecksum+0xf1f 0024fc18 77dc90b5 ntdll32!RtlUlonglongByteSwap+0x4217 0024fc68 77db9889 ntdll32!RtlSetUnhandledExceptionFilter+0x50 0024fca0 77bd7fae ntdll32!LdrInitializeThunk+0x10 0024fca0 00000000 ntdll!RtlInitializeCriticalSectionEx+0x103e rnk, does that mean anything to you?
,
Jun 7 2016
I compared the bad executable with the good executable (by disabling goma), and I compared the good executable against another good executable (by building twice). The only part where the good and bad executable differ where the two good executables don't differ is in two places in the disasm. The bad executable starts with: 00401000: 55 push ebp 00401001: 89 E5 mov ebp,esp 00401003: 53 push ebx 00401004: 57 push edi 00401005: 56 push esi 00401006: 50 push eax 00401007: 8B 7D 0C mov edi,dword ptr [ebp+0Ch] 0040100A: 8D 5D F4 lea ebx,[ebp-0Ch] 0040100D: 89 7B FC mov dword ptr [ebx-4],edi The bad with: 00401000: 55 push ebp 00401001: 89 E5 mov ebp,esp 00401003: 53 push ebx 00401004: 57 push edi 00401005: 56 push esi 00401006: 50 push eax 00401007: 8B 7D 0C mov edi,dword ptr [ebp+0Ch] 0040100A: 8D 5D 10 lea ebx,[ebp+10h] 0040100D: 89 5D F0 mov dword ptr [ebp-10h],ebx (the last two lines of this is different) A bit later, the bad executable has 00402057: 55 push ebp 00402058: 89 E5 mov ebp,esp 0040205A: 53 push ebx 0040205B: 57 push edi 0040205C: 56 push esi 0040205D: 50 push eax 0040205E: 8B 7D 0C mov edi,dword ptr [ebp+0Ch] 00402061: 8B 75 08 mov esi,dword ptr [ebp+8] 00402064: 8D 5D F4 lea ebx,[ebp-0Ch] 00402067: 89 7B FC mov dword ptr [ebx-4],edi while the good has 00402057: 55 push ebp 00402058: 89 E5 mov ebp,esp 0040205A: 53 push ebx 0040205B: 57 push edi 0040205C: 56 push esi 0040205D: 50 push eax 0040205E: 8B 75 08 mov esi,dword ptr [ebp+8] 00402061: 8B 7D 0C mov edi,dword ptr [ebp+0Ch] 00402064: 8D 5D 10 lea ebx,[ebp+10h] 00402067: 89 5D F0 mov dword ptr [ebp-10h],ebx Other than that, the disassembly is identical (for the two good executables, all disassembly is identical). Looks like things are slightly different in two prologues. (most prologues are identical though)
,
Jun 7 2016
Sorry, "The bad with:" in the previous comment should be "The good with:" :-( It's always bad first, good second.
,
Jun 7 2016
So it's a miscompilation then? Are all bad executables identical, or do they differ in different places? If the latter, then maybe one of the goma slaves has binary or memory corruption.
,
Jun 7 2016
genperf consists of 6 obj files. Of these, 2 have the "bad" instructions: thakis@THAKIS6-W MINGW64 /c/src/chrome/src/out/gn2 (master) $ xxd -p obj/third_party/yasm/genperf/genperf.obj | tr -d '\n' | grep -c '8d5df4897bfc' 1 thakis@THAKIS6-W MINGW64 /c/src/chrome/src/out/gn2 (master) $ xxd -p obj/third_party/yasm/genperf/perfect.obj | tr -d '\n' | grep -c '8d5df4897bfc' 1 thakis@THAKIS6-W MINGW64 /c/src/chrome/src/out/gn2 (master) $ xxd -p obj/third_party/yasm/yasm_utils/phash.obj | tr -d '\n' | grep -c '8d5df4897bfc' 0 thakis@THAKIS6-W MINGW64 /c/src/chrome/src/out/gn2 (master) $ xxd -p obj/third_party/yasm/yasm_utils/xmalloc.obj | tr -d '\n' | grep -c '8d5df4897bfc' 0 thakis@THAKIS6-W MINGW64 /c/src/chrome/src/out/gn2 (master) $ xxd -p obj/third_party/yasm/yasm_utils/xstrdup.obj | tr -d '\n' | grep -c '8d5df4897bfc' 0 thakis@THAKIS6-W MINGW64 /c/src/chrome/src/out/gn2 (master) $ xxd -p obj/third_party/yasm/genperf/genperf.obj | tr -d '\n' | grep -c '8d5d10895df0' 0 Running `dumpbin /disasm gn2\obj\third_party\yasm\genperf\genperf.obj` suggests that the different instructions are in _sprintf and _fprintf.
,
Jun 7 2016
hm, maybe an include path ordering thing wrt vadefs.h in goma? let me try that.
,
Jun 7 2016
Mmm, vadefs.h is a good theory. The LEA instructions that differ are either for the first parameter or for a local stack variable.
,
Jun 7 2016
Looks like I can reliably repro without -j200 if I run `set GOMA_USE_LOCAL=0` first.
,
Jun 7 2016
If I remove -imsvc (like the following patch), the problem disappears. So this is due to goma not handling -imsvc quite right, and since it's in sprintf it somehow being due to vadefs.h not getting picked up seems very likely. I've failed to make a reduced repro so far, but I'll keep at it for a bit longer. `set GOMA_USE_LOCAL=0` is very useful for debugging goma stuff.
diff --git a/build/toolchain/win/BUILD.gn b/build/toolchain/win/BUILD.gn
index 097229d..f3b7264 100644
--- a/build/toolchain/win/BUILD.gn
+++ b/build/toolchain/win/BUILD.gn
@@ -91,7 +91,7 @@ template("msvc_toolchain") {
sys_include_flags = "${invoker.sys_include_flags} " # Note trailing space.
} else {
# clang-cl doesn't need this env hoop, so omit it there.
- assert(!invoker.is_clang)
+ #assert(!invoker.is_clang)
env_wrapper = "ninja -t msvc -e $env -- " # Note trailing space.
sys_include_flags = ""
}
@@ -318,7 +318,7 @@ if (target_cpu == "x86") {
cl = "${goma_prefix}$prefix/${clang_cl}"
toolchain_os = "win"
is_clang = true
- sys_include_flags = "${x86_toolchain_data.include_flags}"
+ #sys_include_flags = "${x86_toolchain_data.include_flags}"
}
}
@@ -363,7 +363,7 @@ template("win_x64_toolchains") {
cl = "${goma_prefix}$prefix/${clang_cl}"
toolchain_os = "win"
is_clang = true
- sys_include_flags = "${x64_toolchain_data.include_flags}"
+ #sys_include_flags = "${x64_toolchain_data.include_flags}"
# This shares a name with a global and forward_variables_from won't clobber
# the existing value, so we need to set it explicitly.
,
Jun 8 2016
Aha, looks like the trick to repro is to put -imsvc in an rsp file.
,
Jun 8 2016
Ok, repro steps (and retitling):
In a shell that did NOT run vcvarsall / vcvars32 / setenv.cmd, do:
c:\src\chrome\src> set GOMA_USE_LOCAL=0
C:\src\chrome\src>type file.c
#include <stdio.h>
int main() {
char buf[80];
sprintf(buf, "hi %d", 10);
}
C:\src\chrome\src>type rspfile.rsp
-imsvcc:\src\depot_tools\win_toolchain\vs_files\95ddda401ec5678f15eeed01d2bee08fcbc5ee97\win_sdk\bin\..\..\win_sdk\Inclu
de\10.0.10586.0\ucrt -imsvcc:\src\depot_tools\win_toolchain\vs_files\95ddda401ec5678f15eeed01d2bee08fcbc5ee97\win_sdk\bi
n\..\..\VC\include
C:\src\chrome\src>type build.bat
C:\goma\goma-win64/gomacc.exe third_party\llvm-build\Release+Asserts\bin\clang-cl.exe -fmsc-version=1900 -m32 /O1 /Ob2 /
Oy- /Oi /MT /TC /Fofile.obj /c file.c @rspfile.rsp
Then in a different shell that did run e.g. depot_tools\win_toolchain\vs_files\95ddda401ec5678f15eeed01d2bee08fcbc5ee97\win_sdk\bin\SetEnv.cmd
dumpbin /disasm file.obj > file_asm_goma.txt
This file will contain
0000000A: 8D 5D F4 lea ebx,[ebp-0Ch]
0000000D: 89 7B FC mov dword ptr [ebx-4],edi
Now, remove "C:\goma\goma-win64/gomacc.exe" from build.bat and run build and dumpbin again. Now these two lines will change to
0000000A: 8D 5D 10 lea ebx,[ebp+10h]
0000000D: 89 5D F0 mov dword ptr [ebp-10h],ebx
This is because goma now for some reason doesn't pick up clang's vadefs.h.
I'll file an internal bug with goma team for this.
,
Jun 8 2016
Filed https://b/29190906
,
Jun 21 2016
,
Jun 21 2016
I believe it's fixed. Could you test it?
,
Jun 21 2016
This works now, thanks! (I flipped https://build.chromium.org/p/chromium.fyi/builders/CrWinClang to gn, and last time it couldn't finish the build since host tools got miscompiled. Now the build succeeds, and all tests stay green.) |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by dpranke@chromium.org
, Jun 7 2016