New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 618038 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 3
Type: Bug

Blocking:
issue 354261
issue 498033
issue 605318



Sign in to add a comment

goma doesn't handle -imsvc correctly when it's in a response file, leading to miscompiles

Project Member Reported by thakis@chromium.org, Jun 7 2016

Issue description

e.g. https://build.chromium.org/p/chromium.fyi/builders/CrWinClang/builds/9654/steps/compile/logs/stdio

It always complains about failing to run either genmodule (without error), protoc, or when compiling a file generated by protoc:

FAILED: gen/third_party/yasm/module.c 
C:/b/depot_tools/python276_bin/python.exe ../../build/gn_run_binary.py genmodule.exe ../../third_party/yasm/source/patched-yasm/libyasm/module.in ../../third_party/yasm/source/config/win/Makefile gen/third_party/yasm/module.c


FAILED: pyproto/components/copresence/proto/data_pb2.py gen/components/copresence/proto/data.pb.cc gen/components/copresence/proto/data.pb.h 
C:/b/depot_tools/python276_bin/python.exe ../../tools/protoc_wrapper/protoc_wrapper.py --protobuf gen/components/copresence/proto/data.pb.h --proto-in-dir ../../components/copresence/proto --proto-in-file data.proto --use-system-protobuf=0 -- ./protoc --cpp_out gen/components/copresence/proto --python_out pyproto/components/copresence/proto
config_data.proto:16640432:16640432: Couldn't parse default value "-.".

config_data.proto:16640432:16640432: Couldn't parse default value "-.".

config_data.proto:16640432:16640432: Couldn't parse default value "-.".

config_data.proto:16640432:16640432: Couldn't parse default value "-.".

config_data.proto:16640432:16640432: Couldn't parse default value "-.".

data.proto: Import "config_data.proto" was not found or had errors.

data.proto:16640432:16640432: "DirectiveConfiguration" is not defined.


https://build.chromium.org/p/chromium.fyi/builders/CrWinClang/builds/9597/steps/compile/logs/stdio

C:\b\build\slave\cache\cipd\goma/gomacc.exe ../../third_party/llvm-build/Release+Asserts/bin/clang-cl.exe /nologo /showIncludes /FC @clang_nacl_win64/obj/components/metrics/proto/proto/system_profile.pb.obj.rsp /c clang_nacl_win64/gen/components/metrics/proto/system_profile.pb.cc /Foclang_nacl_win64/obj/components/metrics/proto/proto/system_profile.pb.obj /Fd"clang_nacl_win64/obj/components/metrics/proto/proto_cc.pdb"
In file included from clang_nacl_win64/gen/components/metrics/proto/system_profile.pb.cc:5:
clang_nacl_win64/gen/components/metrics/proto/system_profile.pb.h(5464,58):  error: source file is not valid UTF-8
  primary_screen_scale_factor_ = 1.1019004687438281311395<D8><F1>[f;
                                                         ^


The last error is especially weird. I ssh'd in and checked what the protoc invocation by protoc_wrapper.py is, it's:


$ ./protoc --cpp_out gen/components/copresence/proto --python_out pyproto/components/copresence/proto --proto_path=../../components/copresence/proto ../../components/copresence/proto/rpcs.proto
config_data.proto:3467696:3467696: Couldn't parse default value "-.".
config_data.proto:3467696:3467696: Couldn't parse default value "-.".
config_data.proto:3467696:3467696: Couldn't parse default value "-.".
config_data.proto:3467696:3467696: Couldn't parse default value "-.".
config_data.proto:3467696:3467696: Couldn't parse default value "-.".
data.proto: Import "config_data.proto" was not found or had errors.
data.proto:3467696:3467696: "DirectiveConfiguration" is not defined.
rpcs.proto: Import "config_data.proto" was not found or had errors.
rpcs.proto: Import "data.proto" was not found or had errors.
rpcs.proto:3467696:3467696: "ClientVersion" is not defined.
rpcs.proto:3467696:3467696: "ClientVersion" is not defined.
rpcs.proto:3467696:3467696: "DeviceFingerprint" is not defined.
rpcs.proto:3467696:3467696: "DebugInfo" is not defined.
rpcs.proto:3467696:3467696: "Status" is not defined.
rpcs.proto:3467696:3467696: "Configuration" is not defined.
rpcs.proto:3467696:3467696: "PushServiceRegistration" is not defined.
rpcs.proto:3467696:3467696: "DeviceIdentifiers" is not defined.
rpcs.proto:3467696:3467696: "TokenObservation" is not defined.
rpcs.proto:3467696:3467696: "DeviceState" is not defined.
rpcs.proto:3467696:3467696: "PublishedMessage" is not defined.
rpcs.proto:3467696:3467696: "Subscription" is not defined.
rpcs.proto:3467696:3467696: "Token" is not defined.
rpcs.proto:3467696:3467696: "SubscribedMessage" is not defined.
rpcs.proto:3467696:3467696: "Directive" is not defined.
rpcs.proto:3467696:3467696: "MessageResult" is not defined.
rpcs.proto:3467696:3467696: "SubscriptionResult" is not defined.




The last time I switched this bot, compile worked fine, not sure what broke. (dpranke, shout if you've seen something like this before, else I'll just debug a bit, feel free to uncc yourself in that case)
 
That doesn't ring a bell to me at all, but I'll be curious to see what the problem is.
Cc: xyzzyz@chromium.org
Aha, we apparently updated to a new protobuf 4 days ago...
Ah no, that line was touched in the previous update in april.
clang_nacl_win64/gen/components/metrics/proto/system_profile.pb.h(5464,58):  error: source file is not valid UTF-8
  primary_screen_scale_factor_ = 1.1019004687438281311395<D8><F1>[f;

If you look at line 5464 of generated system_profile.pb.h, it's part of a method clear_primary_scale_factor(): https://cs.chromium.org/chromium/src/out/Debug/gen/components/metrics/proto/system_profile.pb.h?q=system_profile.pb.h&sq=package:chromium&dr&l=5464

I see no conceivable way of protoc compiler generating such weird float constant there, not to mention the random high byte characters.

If it's not reproducible on other machines, my guess would be a hardware failure, more likely memory (because some garbage was interpreted as a float).
Seems very unlikely that memory went bad in the same build that I switched the bot to gn, no?

(but the line/col numbers in the error message look like garbage too)
I managed to repro the genperf crash locally by a) completely blowing away my out dir b) enabling goma. It seems that this doesn't repro without goma (?)
Cc: ukai@chromium.org
This reliably repros the genperf crash. Not using goma or not using clang or not doing official builds makes it go away.

C:\src\chrome\src>type out\gnnew\args.gn
is_chrome_branded = true
is_clang = true
is_debug = false
is_official_build = true
symbol_level = 1
target_cpu = "x86"
use_goma = true

C:\src\chrome\src>ninja -C out\gnnew gen/third_party/yasm/include/x86insn_nasm.c -t clean
Cleaning... 25 files.

C:\src\chrome\src>ninja -C out\gnnew gen/third_party/yasm/include/x86insn_nasm.c
ninja: Entering directory `out\gnnew'
[21/23] LINK genperf.exe
LINK : /LTCG specified but no code generation required; remove /LTCG from the link command line to improve linker perfor
mance
[23/23] ACTION //third_party/yasm:compile_gperf_for_include(//build/toolchain/win:clang_x86)
FAILED: gen/third_party/yasm/include/x86insn_nasm.c
C:/python_27_amd64/files/python.exe ../../build/gn_run_binary.py genperf.exe gen/third_party/yasm/include/x86insn_nasm.g
perf gen/third_party/yasm/include/x86insn_nasm.c
ninja: build stopped: subcommand failed.
It also works with the 05-09 goma client release.
Hm, I then tried with the Jun 3 release (worked) and then with Jun 6 again, and now it doesn't happen any more :-/

But since I saw this on my box, it's likely not memory corruption on the slave. Will try to get it to repro again.
Ok, I managed to repro this with all 3 gomas I tried (Jun 6, 4, May 9) by doing:

C:\src\chrome\src>type repro.bat
rmdir /s/q out\gn2
mkdir out\gn2
copy out\gnnew\args.gn out\gn2
gn gen out\gn2

and then run repro followed by

ninja -C out\gnnew gen/third_party/yasm/include/x86insn_nasm.c -j200

and if it doesn't repro then, by adding "-t clean" and then running the build command again. I couldn't repro this without -j200 if I did a fully clean build, so I suppose something's timing-dependent. Not sure if this is goma's fault, or if goma just changes timings enough for something to go wrong that would go wrong if the timing was different for other reasons.
Cc: r...@chromium.org
Disabling incremental linking doesn't change things.

Trying to load the genperf crash into msvc's debugger makes msvc hang on startup.

Running it in windbg tells me

0:000> g
ModLoad: 00000000`77a80000 00000000`77b9f000   WOW64_IMAGE_SECTION
ModLoad: 00000000`75ec0000 00000000`75fd0000   WOW64_IMAGE_SECTION
ModLoad: 00000000`77a80000 00000000`77b9f000   NOT_AN_IMAGE
ModLoad: 00000000`77980000 00000000`77a7a000   NOT_AN_IMAGE
ModLoad: 00000000`75ec0000 00000000`75fd0000   C:\Windows\syswow64\kernel32.dll
ModLoad: 00000000`77590000 00000000`775d7000   C:\Windows\syswow64\KERNELBASE.dll
ModLoad: 00000000`761b0000 00000000`76251000   C:\Windows\syswow64\ADVAPI32.dll
ModLoad: 00000000`76260000 00000000`7630c000   C:\Windows\syswow64\msvcrt.dll
ModLoad: 00000000`75c90000 00000000`75ca9000   C:\Windows\SysWOW64\sechost.dll
ModLoad: 00000000`775e0000 00000000`776d0000   C:\Windows\syswow64\RPCRT4.dll
ModLoad: 00000000`756d0000 00000000`75730000   C:\Windows\syswow64\SspiCli.dll
ModLoad: 00000000`756c0000 00000000`756cc000   C:\Windows\syswow64\CRYPTBASE.dll
(14c4.34e0): WOW64 breakpoint - code 4000001f (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for ntdll32.dll - 
ntdll32!LdrVerifyImageMatchesChecksum+0xf1f:
77e20e14 cc              int     3
0:000:x86> k
ChildEBP RetAddr  
WARNING: Stack unwind information not available. Following frames may be wrong.
0024fa9c 77e00f27 ntdll32!LdrVerifyImageMatchesChecksum+0xf1f
0024fc18 77dc90b5 ntdll32!RtlUlonglongByteSwap+0x4217
0024fc68 77db9889 ntdll32!RtlSetUnhandledExceptionFilter+0x50
0024fca0 77bd7fae ntdll32!LdrInitializeThunk+0x10
0024fca0 00000000 ntdll!RtlInitializeCriticalSectionEx+0x103e



rnk, does that mean anything to you?
I compared the bad executable with the good executable (by disabling goma), and I compared the good executable against another good executable (by building twice).

The only part where the good and bad executable differ where the two good executables don't differ is in two places in the disasm. The bad executable starts with:

  00401000: 55                 push        ebp
  00401001: 89 E5              mov         ebp,esp
  00401003: 53                 push        ebx
  00401004: 57                 push        edi
  00401005: 56                 push        esi
  00401006: 50                 push        eax
  00401007: 8B 7D 0C           mov         edi,dword ptr [ebp+0Ch]
  0040100A: 8D 5D F4           lea         ebx,[ebp-0Ch]
  0040100D: 89 7B FC           mov         dword ptr [ebx-4],edi

The bad with:


  00401000: 55                 push        ebp
  00401001: 89 E5              mov         ebp,esp
  00401003: 53                 push        ebx
  00401004: 57                 push        edi
  00401005: 56                 push        esi
  00401006: 50                 push        eax
  00401007: 8B 7D 0C           mov         edi,dword ptr [ebp+0Ch]
  0040100A: 8D 5D 10           lea         ebx,[ebp+10h]
  0040100D: 89 5D F0           mov         dword ptr [ebp-10h],ebx


(the last two lines of this is different)

A bit later, the bad executable has

  00402057: 55                 push        ebp
  00402058: 89 E5              mov         ebp,esp
  0040205A: 53                 push        ebx
  0040205B: 57                 push        edi
  0040205C: 56                 push        esi
  0040205D: 50                 push        eax
  0040205E: 8B 7D 0C           mov         edi,dword ptr [ebp+0Ch]
  00402061: 8B 75 08           mov         esi,dword ptr [ebp+8]
  00402064: 8D 5D F4           lea         ebx,[ebp-0Ch]
  00402067: 89 7B FC           mov         dword ptr [ebx-4],edi



while the good has

  00402057: 55                 push        ebp
  00402058: 89 E5              mov         ebp,esp
  0040205A: 53                 push        ebx
  0040205B: 57                 push        edi
  0040205C: 56                 push        esi
  0040205D: 50                 push        eax
  0040205E: 8B 75 08           mov         esi,dword ptr [ebp+8]
  00402061: 8B 7D 0C           mov         edi,dword ptr [ebp+0Ch]
  00402064: 8D 5D 10           lea         ebx,[ebp+10h]
  00402067: 89 5D F0           mov         dword ptr [ebp-10h],ebx



Other than that, the disassembly is identical (for the two good executables, all disassembly is identical).


Looks like things are slightly different in two prologues. (most prologues are identical though)
Sorry, "The bad with:" in the previous comment should be "The good with:" :-( It's always bad first, good second.
So it's a miscompilation then? Are all bad executables identical, or do they differ in different places? If the latter, then maybe one of the goma slaves has binary or memory corruption. 
genperf consists of 6 obj files. Of these, 2 have the "bad" instructions:

thakis@THAKIS6-W MINGW64 /c/src/chrome/src/out/gn2 (master)
$ xxd -p obj/third_party/yasm/genperf/genperf.obj | tr -d '\n' | grep -c '8d5df4897bfc'
1

thakis@THAKIS6-W MINGW64 /c/src/chrome/src/out/gn2 (master)
$ xxd -p obj/third_party/yasm/genperf/perfect.obj | tr -d '\n' | grep -c '8d5df4897bfc'
1

thakis@THAKIS6-W MINGW64 /c/src/chrome/src/out/gn2 (master)
$ xxd -p obj/third_party/yasm/yasm_utils/phash.obj | tr -d '\n' | grep -c '8d5df4897bfc'
0

thakis@THAKIS6-W MINGW64 /c/src/chrome/src/out/gn2 (master)
$ xxd -p obj/third_party/yasm/yasm_utils/xmalloc.obj | tr -d '\n' | grep -c '8d5df4897bfc'
0

thakis@THAKIS6-W MINGW64 /c/src/chrome/src/out/gn2 (master)
$ xxd -p obj/third_party/yasm/yasm_utils/xstrdup.obj | tr -d '\n' | grep -c '8d5df4897bfc'
0

thakis@THAKIS6-W MINGW64 /c/src/chrome/src/out/gn2 (master)
$ xxd -p obj/third_party/yasm/genperf/genperf.obj | tr -d '\n' | grep -c '8d5d10895df0'
0



Running `dumpbin /disasm gn2\obj\third_party\yasm\genperf\genperf.obj` suggests that the different instructions are in _sprintf and _fprintf.
hm, maybe an include path ordering thing wrt vadefs.h in goma? let me try that.

Comment 18 by r...@chromium.org, Jun 7 2016

Mmm, vadefs.h is a good theory. The LEA instructions that differ are either for the first parameter or for a local stack variable.
Looks like I can reliably repro without -j200 if I run `set GOMA_USE_LOCAL=0` first.
If I remove -imsvc (like the following patch), the problem disappears. So this is due to goma not handling -imsvc quite right, and since it's in sprintf it somehow being due to vadefs.h not getting picked up seems very likely. I've failed to make a reduced repro so far, but I'll keep at it for a bit longer. `set GOMA_USE_LOCAL=0` is very useful for debugging goma stuff.



diff --git a/build/toolchain/win/BUILD.gn b/build/toolchain/win/BUILD.gn
index 097229d..f3b7264 100644
--- a/build/toolchain/win/BUILD.gn
+++ b/build/toolchain/win/BUILD.gn
@@ -91,7 +91,7 @@ template("msvc_toolchain") {
     sys_include_flags = "${invoker.sys_include_flags} "  # Note trailing space.
   } else {
     # clang-cl doesn't need this env hoop, so omit it there.
-    assert(!invoker.is_clang)
+    #assert(!invoker.is_clang)
     env_wrapper = "ninja -t msvc -e $env -- "  # Note trailing space.
     sys_include_flags = ""
   }
@@ -318,7 +318,7 @@ if (target_cpu == "x86") {
     cl = "${goma_prefix}$prefix/${clang_cl}"
     toolchain_os = "win"
     is_clang = true
-    sys_include_flags = "${x86_toolchain_data.include_flags}"
+    #sys_include_flags = "${x86_toolchain_data.include_flags}"
   }
 }

@@ -363,7 +363,7 @@ template("win_x64_toolchains") {
     cl = "${goma_prefix}$prefix/${clang_cl}"
     toolchain_os = "win"
     is_clang = true
-    sys_include_flags = "${x64_toolchain_data.include_flags}"
+    #sys_include_flags = "${x64_toolchain_data.include_flags}"

     # This shares a name with a global and forward_variables_from won't clobber
     # the existing value, so we need to set it explicitly.
Aha, looks like the trick to repro is to put -imsvc in an rsp file.
Blocking: -82385 498033
Summary: goma doesn't handle -imsvc correctly when it's in a response file, leading to miscompiles (was: CrWinClang broken after switching it to gn)
Ok, repro steps (and retitling):

In a shell that did NOT run vcvarsall / vcvars32 / setenv.cmd, do:

c:\src\chrome\src> set GOMA_USE_LOCAL=0
C:\src\chrome\src>type file.c
#include <stdio.h>

int main() {
  char buf[80];
  sprintf(buf, "hi %d", 10);
}

C:\src\chrome\src>type rspfile.rsp
-imsvcc:\src\depot_tools\win_toolchain\vs_files\95ddda401ec5678f15eeed01d2bee08fcbc5ee97\win_sdk\bin\..\..\win_sdk\Inclu
de\10.0.10586.0\ucrt -imsvcc:\src\depot_tools\win_toolchain\vs_files\95ddda401ec5678f15eeed01d2bee08fcbc5ee97\win_sdk\bi
n\..\..\VC\include

C:\src\chrome\src>type build.bat
C:\goma\goma-win64/gomacc.exe third_party\llvm-build\Release+Asserts\bin\clang-cl.exe -fmsc-version=1900 -m32 /O1 /Ob2 /
Oy- /Oi /MT /TC /Fofile.obj /c file.c @rspfile.rsp


Then in a different shell that did run e.g. depot_tools\win_toolchain\vs_files\95ddda401ec5678f15eeed01d2bee08fcbc5ee97\win_sdk\bin\SetEnv.cmd

dumpbin /disasm file.obj > file_asm_goma.txt


This file will contain

  0000000A: 8D 5D F4           lea         ebx,[ebp-0Ch]
  0000000D: 89 7B FC           mov         dword ptr [ebx-4],edi

Now, remove "C:\goma\goma-win64/gomacc.exe" from build.bat and run build and dumpbin again. Now these two lines will change to

  0000000A: 8D 5D 10           lea         ebx,[ebp+10h]
  0000000D: 89 5D F0           mov         dword ptr [ebp-10h],ebx


This is because goma now for some reason doesn't pick up clang's vadefs.h.

I'll file an internal bug with goma team for this.
Filed https://b/29190906
Blocking: 605318 354261

Comment 25 by ukai@chromium.org, Jun 21 2016

Cc: shinyak@chromium.org
I believe it's fixed. Could you test it?
Owner: shinyak@chromium.org
Status: Fixed (was: Untriaged)
This works now, thanks! (I flipped https://build.chromium.org/p/chromium.fyi/builders/CrWinClang to gn, and last time it couldn't finish the build since host tools got miscompiled. Now the build succeeds, and all tests stay green.)

Sign in to add a comment