TotWinThinLTO64, ToTWinCFI and ToTWinCFI64 failing compile with undefined symbols |
||||||||
Issue descriptionLogs: https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium.clang%2FToTWinThinLTO64%2F117%2F%2B%2Frecipes%2Fsteps%2Fcompile%2F0%2Fstdout Excerpt: [52427/52435] LINK webkit_unit_tests.exe webkit_unit_tests.exe.pdb FAILED: webkit_unit_tests.exe webkit_unit_tests.exe.pdb cmd /c C:/b/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py delete-file ./webkit_unit_tests.exe.pdb && C:/b/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False ../../third_party/llvm-build/Release+Asserts/bin/lld-link.exe /nologo /OUT:./webkit_unit_tests.exe /PDB:./webkit_unit_tests.exe.pdb @./webkit_unit_tests.exe.rsp ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?Replay@ClipDisplayItem@blink@@UEBAXAEAVGraphicsContext@2@@Z ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?AppendToWebDisplayItemList@ClipDisplayItem@blink@@UEBAXAEBVLayoutSize@2@PEAVWebDisplayItemList@2@@Z ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?Replay@EndClipDisplayItem@blink@@UEBAXAEAVGraphicsContext@2@@Z ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?AppendToWebDisplayItemList@EndClipDisplayItem@blink@@UEBAXAEBVLayoutSize@2@PEAVWebDisplayItemList@2@@Z [52428/52435] LINK unit_tests.exe unit_tests.exe.pdb FAILED: unit_tests.exe unit_tests.exe.pdb cmd /c C:/b/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py delete-file ./unit_tests.exe.pdb && C:/b/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False ../../third_party/llvm-build/Release+Asserts/bin/lld-link.exe /nologo /OUT:./unit_tests.exe /PDB:./unit_tests.exe.pdb @./unit_tests.exe.rsp ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ??_GBidirectionalStreamQuicImpl@net@@UEAAPEAXI@Z ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?Start@BidirectionalStreamQuicImpl@net@@UEAAXPEBUBidirectionalStreamRequestInfo@2@AEBVNetLogWithSource@2@_NPEAVDelegate@BidirectionalStreamImpl@2@V?$unique_ptr@VTimer@base@@U?$default_delete@VTimer@base@@@std@@@std@@@Z ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?SendRequestHeaders@BidirectionalStreamQuicImpl@net@@UEAAXXZ ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?ReadData@BidirectionalStreamQuicImpl@net@@UEAAHPEAVIOBuffer@2@H@Z ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?SendvData@BidirectionalStreamQuicImpl@net@@UEAAXAEBV?$vector@V?$scoped_refptr@VIOBuffer@net@@@@V?$allocator@V?$scoped_refptr@VIOBuffer@net@@@@@std@@@std@@AEBV?$vector@HV?$allocator@H@std@@@4@_N@Z ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?GetProtocol@BidirectionalStreamQuicImpl@net@@UEBA?AW4NextProto@2@XZ ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?GetTotalReceivedBytes@BidirectionalStreamQuicImpl@net@@UEBA_JXZ ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?GetTotalSentBytes@BidirectionalStreamQuicImpl@net@@UEBA_JXZ ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?GetLoadTimingInfo@BidirectionalStreamQuicImpl@net@@UEBA_NPEAULoadTimingInfo@2@@Z ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?PopulateNetErrorDetails@BidirectionalStreamQuicImpl@net@@UEAAXPEAUNetErrorDetails@2@@Z ..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ??0BidirectionalStreamQuicImpl@net@@QEAA@V?$unique_ptr@VHandle@QuicChromiumClientSession@net@@U?$default_delete@VHandle@QuicChromiumClientSession@net@@@std@@@std@@@Z
,
Nov 17 2017
I managed to reproduce locally. args.gn: clang_use_chrome_plugins = false is_chrome_branded = true is_clang = true is_component_build = false is_debug = false is_official_build = true symbol_level = 2 use_lld = true use_thin_lto = true # Local settings to use clang clang_base_path = "C:\\path\\to\\clang" llvm_force_head_revision = false Then build with ninja -k 100 all. Building individual targets did not reproduce the issue for me. The targets that are seeing undefined symbols for me locally, and the symbols that are undefined, are different from the ones on the bot. I suspect this is a problem with the toolchain.
,
Nov 17 2017
I'm also seeing "clang\bin\lld-link.exe: error: unsupported relocation type 0x0".
,
Feb 15 2018
The failure is definitely non-deterministic. Using the same chromium revision and clang revision and running ninja all, I got: 1. Build failed after building about 56,000 targets, with undefined symbol errors. 2. Repeatedly running ninja all, the build made progress - more targets were built until it would eventually fail again. Errors included both undefined symbol and a few instances of "unsupported relocation type 0x0". 3. After running ninja all enough times, the build eventually succeeded. Removing the directory, creating it again with the same args.gn, and still using the same revision of chromium and clang, 4. The build failed with undefined symbol errors after about 60,000 targets had been built. The first failing target was content_perftests, which had never failed to build in any of the prior ninja all runs.
,
Feb 15 2018
This also affects the Windows CFI bots.
,
Feb 23 2018
Another interesting tidbit: ..\..\..\..\clang\bin\lld-link.exe: error: libconcrt.lib(utils.obj): undefined symbol: swprintf_s That symbol is defined in a library that is provided by the C runtime. In fact, I think it is defined in the very place where we claim it is undefined.
,
Mar 14 2018
,
Apr 6 2018
Update: After several builds that did not run into the undefined symbols error, https://ci.chromium.org/buildbot/chromium.clang/ToTWinThinLTO64/1023 encountered it again. Unsupported relocation type 0x0 also makes an appearance in that build.
,
Apr 19 2018
Analysis is still ongoing, but here are some interesting findings. Unsupported relocation type 0x0 means a relocation type of 0, which should never happen. Undefined symbols in COFF files have section number IMAGE_SYM_UNDEFINED, which is 0, and value 0. Some of the symbols I've seen this error on are defined by an object that participates in the link, but not actually used by any other object in the link. This suggests that the unsupported relocation type 0x0 and undefined symbols reported in the object that actually defines them could be caused by reading bytes as 0 when the bytes in the input file actually are not 0.
,
Apr 25 2018
Some builds (e.g. https://ci.chromium.org/buildbot/chromium.clang/ToTWinThinLTO64/1168) fail with lld-link.exe: error: broken object file: lto.tmp The error message doesn't point out the specific problem, but one of the possible causes is a symbol referring to section 0, which is a reserved value.
,
Apr 25 2018
All "broken object file" fatals come from createDefined() in lld/COFF/InputFiles.cpp. Before we call createDefined(), we check if a symbol is undefined (in ObjFile::initializeSymbols()). This looks at the symbols section number and value. The broken object file fatal in createDefined() is triggered by the section number alone. So we can get there by having a symbol whose value is nonzero but whose section number is zero. The section number comes after the value in the COFF file format, so this is consistent with reading the file contents as zeros after some point.
,
Apr 25 2018
The zeros make this sound somewhat similar to the zeros in executables in crbug.com/644525 , but there are a couple of differences. First of all, this isn't about executables, but rather about object files and archive files. Secondly, the symbols we report as undefined all have names. I'm not sure how having consecutive zeros causes us to have symbol names, given that the name must either be in-line or an index into the string table, and in neither case would we end up with symbol names (certainly not multiple symbol names) if the bytes in the symbol record were 0. Thirdly, we managed to get executables to work more reliably by calling FlushFileBuffers on the files that we write. In the case here, many of the failures we are seeing are not in files that were written by us, or even recently. libconcrt.lib(utils.obj), for example, is an object in an archive file that is provided by the system.
,
Apr 25 2018
Summary of what I know so far: 1. The probable cause of all these errors is reading bytes as 0 when they are not 0. This is clear in the case of "Unsupported relocation type 0x0". It also explains why symbols that are only defined but not used by any object in the link show up in "undefined symbol" messages (undefined symbols in COFF have value 0 and section number 0). A value of 0 for the section number also causes "broken object file" error messages. It also explains how a symbol can be reported as undefined by the object file that actually defines it. 2. The errors do not reproduce reliably. Re-running a build that failed may result in a successful build, or a build that failed in a different place. I haven't been able to reproduce these failures locally for weeks. The bots that see the failures can see over twenty successful builds before hitting the errors again. 3. The errors usually occur after many targets (~50,000) have been built, but https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium.clang%2FToTWinThinLTO64%2F1151%2F%2B%2Frecipes%2Fsteps%2Fcompile%2F0%2Fstdout shows undefined symbol errors after only a few hundred targets were built. Some of the smaller targets that the errors have occurred on are angle_shader_translator, gcp_portmon64.dll, image_operations_bench, virtual_driver_setup, and zucchini. 4. Symbols that are reported as undefined often include symbols from uuid.lib. Frequently, these symbols are reported as undefined on targets that do not actually use them. 5. Some errors occur in files that we write as part of the build (lto.tmp, for example), but many occur in files that aren't written as part of the build, such as in libconcrt.lib or uuid.lib. I originally suspected a problem with the lack of atomicity of filesystem operations, but I don't see how that would cause us to falsely read zeros from files that are provided by the system, existed before the build started, and presumably aren't changed as part of the build. 6. As far as I understand the COFF file format, symbol records have the name first, followed by the value, followed by the section number, and then some other fields. The name is either a string of up to 8 bytes, or 4 zero bytes followed by a 4-byte offset into the string table. An offset into the string table should never be 0, so the name field in a COFF symbol should never be all zeros, either. This means that, if we read zeros after a point in a file, we could end up with a symbol with a name and a value, but erroneously read the section number as 0, which causes a "broken object file" diagnostic, or both value and section number 0, which causes an "undefined symbol" diagnostic. However, for multiple symbols to be reported as undefined, another symbol record should have a non-zero name field, which wouldn't happen if we only read zeros. Also, I think the string table usually comes after the symbol records. One possible explanation for why we still report multiple symbols in the same object file could be that we read nonzero bytes later in the file and don't report on the symbols that we detected as not having a name.
,
Apr 25 2018
cc'ing some people to get more eyes on this. If you guys have ideas about what could be causing this or how to investigate, I'd love to hear. I'll be on vacation next week, but I'm hoping we can fix this soon after, so that we can start shipping Chrome with CFI in the not too distant future.
,
Apr 25 2018
A summary of what I have found so far is in https://bugs.chromium.org/p/chromium/issues/detail?id=786127#c13
,
Apr 26 2018
The bug in 644525 shouldn't be applicable - Microsoft promised that it should only apply when loading a PE file.
When zturner@ and I were investigating that bug we discussed the possibility of adding extra instrumentation and checks to lld-link.exe to detect the problem and print more information. It turns out it wouldn't have helped with that particular bug, but it might help with this one. The linker is already detecting these failures so we could get it to print additional information.
I tried to find where swprintf_s is supposed to come from by compiling this code and linking it with /VERBOSE but wouldn't tell me anything for some reason.
#include <stdio.h>
int main()
{
wchar_t buffer[100];
swprintf_s(buffer, L"A number is %d.\n", 10);
wprintf(L"%s", buffer);
return 0;
}
,
Apr 27 2018
@brucedawson, when I compile that program, by "wouldn't tell me anything" do you mean that /verbose doesn't produce the output about symbol resolution at all, or that swprintf_s is not in there? When I compile the program you gave, the object file I get does not actually reference swprintf_s. Rather, there is a definition of swprintf_s in the object file, which references _vswprintf_s_l, also defined in the object file, which references __stdio_common_vswprintf_s. That symbol is in libucrt.lib, although lld-link's /verbose output doesn't seem to show it. In the case mentioned earlier in this thread, swprintf_s is reported as undefined, but the location where it is reported as being referenced is actually the definition: utils.obj in libconcrt.lib.
,
Apr 27 2018
When I said "wouldn't tell me anything" I meant that it didn't say where swprintf_s came from. As you said this is because swprintf_s is defined in my .obj flie due to this macro-created inline definition of it:
__inline int __cdecl swprintf_s(
wchar_t* const _Buffer,
size_t const _BufferCount,
wchar_t const* const _Format,
...)
#line 1538 "C:\\Program Files (x86)\\Windows Kits\\10\\include\\10.0.16299.0\\ucrt\\corecrt_wstdio.h"
{
int _Result;
va_list _ArgList;
((void)(__vcrt_va_start_verify_argument_type<decltype(_Format)>(), ((void)(__va_start(&_ArgList, _Format)))));
_Result = _vswprintf_s_l(_Buffer, _BufferCount, _Format, 0, _ArgList);
((void)(_ArgList = (va_list)0));
return _Result;
}
I don't understand this sentence:
> the location where it is reported as being referenced is actually the definition: utils.obj in libconcrt.lib.
Every .obj file that references swprintf_s will necessarily define it, as an inline function, and there will be multiple definitions.
I can't tell if this is a compiler bug, linker bug, or hardware error. It seems important that once the problem starts happening it keeps happening for a while - see builds 1178 to 1180 here:
https://ci.chromium.org/buildbot/chromium.clang/ToTWinThinLTO64/?limit=50
,
May 9 2018
inglorion: Did you get a chance to try my suggestion of using a different cache directory for each link?
,
May 9 2018
pcc, not yet. I still think that's worth doing even if it doesn't explain everything we're seeing, so I'll take a stab at it tomorrow. In the meantime, I noticed that a number of places where we see strange values being read come after we write to files through raw_fd_ostream, and I noticed that the FlushFileBuffers call that we added does not apply to that code path, so I put up a quick diff to make raw_fd_ostream call FlushFileBuffers on Windows, too (https://reviews.llvm.org/D46621).
,
Jun 7 2018
I wrote a small lld patch that dumped the object file to stdout if we hit an unsupported reloc error:
diff --git a/lld/COFF/Chunks.cpp b/lld/COFF/Chunks.cpp
index 6d8a0c7c1f4..2931bc23d21 100644
--- a/lld/COFF/Chunks.cpp
+++ b/lld/COFF/Chunks.cpp
@@ -97,9 +97,11 @@ void SectionChunk::applyRelX64(uint8_t *Off, uint16_t Type, OutputSection *OS,
case IMAGE_REL_AMD64_REL32_5: add32(Off, S - P - 9); break;
case IMAGE_REL_AMD64_SECTION: applySecIdx(Off, OS); break;
case IMAGE_REL_AMD64_SECREL: applySecRel(this, Off, OS, S); break;
- default:
+ default: {
+ llvm::outs() << toHex(File->MB.getBuffer()) << '\n';
fatal("unsupported relocation type 0x" + Twine::utohexstr(Type));
}
+ }
}
void SectionChunk::applyRelX86(uint8_t *Off, uint16_t Type, OutputSection *OS,
and used it to build chromium in a loop on my machine overnight with this args.gn:
is_clang = true
is_component_build = false
is_debug = false
is_official_build = true
symbol_level = 2
use_lld = true
use_thin_lto = true
The error ended up happening twice and I was able to capture both object files. I've attached them to this comment as file1.o and file2.o. All other builds (a total of 7 ran) ended up failing due to undefined symbols.
Looking at file1.o I see a suspiciously page-sized block of zeros from 0xcc000 to 0xcd000. There was one file of the same size in thinlto-cache, so I compared it to file1.o. The only difference was that there appeared to be correct content in the zeroed region. I've attached it as cache1.o.
In file2.o there are two blocks of zeros: one from 0x2000 to 0x8000 and another from 0x3c000 to 0x6d000. I couldn't find a corresponding file in thinlto-cache this time.
This does seem to confirm the hypothesis that parts of files are being read as zeros. The fact that they're all on page boundaries would tend to make me suspect the operating system.
,
Jun 7 2018
Thanks, Peter. That is super helpful. Also interesting that we can have zeros, valid data, and zeros again. I suspected as much, but it's good to have this confirmed with an actual capture. Speaking of suspicions, I also suspect that this only affects files we mmap and not files we read. Is that also something you were able to determine?
,
Jun 7 2018
But aren’t we using the same MemoryBuffer APIs from llvm? Don’t those have the workaround of calling FlushFileBuffers on close?
,
Jun 7 2018
We aren't following the code path that flushes object files because we aren't using memory-mapped I/O to write them (only to read them). I figured that the problem may have something to do with the write/mmap pattern that we use. So I hacked together a "subcommand" for llvm-lto2 that basically follows the file I/O pattern that the linker uses during LTO and verifies that the mapped file content is as expected. I've attached the patch adding the subcommand. If I run this subcommand on Windows I see a lot of verifier failures. So that's certainly interesting.
,
Jun 7 2018
Try hacking together something that adds FILE_FLAG_WRITE_THROUGH when opening a file for wrie and see if that fixes it
,
Jun 7 2018
@pcc, I also wanted to ask what version of Windows you're running. I've been wondering if perhaps we're hitting a bug on the bot that doesn't exist in the version of Windows I have on my local machine. My local machine has version 10.0.14393.
,
Jun 7 2018
zturner: I'll try that. inglorion: Mine is version 10.0.17134.48.
,
Jun 7 2018
I assume that we are closing the file after writing to it before we open the file for reading? This bug certainly smells similar to the 644525 bug, but Microsoft assures me that the root-cause of 644525 is not applicable to a scenario like this. So, if it is an OS bug then it must be similar-but-different. If you have a program that reproes the bug then I could share that with Microsoft. Adding a call to FlushFileBuffers before closing the file seems like a good attempted mitigation step, even if it isn't exactly the same bug.
,
Jun 7 2018
> I assume that we are closing the file after writing to it before we open the file for reading? We aren't actually. We open the file for read/write, write the object file and then mmap it using the same handle. Maybe this isn't guaranteed to work on Windows?
,
Jun 7 2018
I think that is not guaranteed to work on Windows, so that's the problem. I'll double check.
,
Jun 7 2018
I think that should be guaranteed to work as long as you are *NOT* using the CRT IO functions. If you are using the CRT functions though, then the CRT does its own buffering, so you could write via one handle, the CRT doesnt' flush, then open via another handle and it doesn't see the changes that the CRT has buffered. If you're writing via a native OS handle though, this should definitely work.
,
Jun 7 2018
> I think that should be guaranteed to work as long as you are *NOT* using the CRT IO functions. That could explain it then. The object file writer uses raw_ostream (which translates down to CRT I/O). I found a _commit function which seems to be the equivalent of fflush for CRT file handles: https://msdn.microsoft.com/en-us/library/17618685.aspx I'll see if that helps.
,
Jun 7 2018
Note https://reviews.llvm.org/D46621 which adds a FlushFileBuffers call to raw_fd_ostream.
,
Jun 7 2018
Though I guess if it's a buffer internal to the CRT then FlushFileBuffers wouldn't help.
,
Jun 7 2018
Ah, yeah.
,
Jun 7 2018
It looks like _commit() is just a thin wrapper over FlushFileBuffers, and the only time the CRT does any internal buffering is when you have a FILE*. That said, I still think my comment in #31 is accurate. Specifically, Windows is supposed to guarantee correct operation. Perhaps it's just not satisfying the guarantee. So I think FlushFileBuffers is still worth a try
,
Jun 7 2018
I tried adding the call to _commit, and I was still seeing failures (diff attached). Since it's just a wrapper around FlushFileBuffers, I guess that wouldn't help either. Next thing I'll try is closing and reopening the file handle.
,
Jun 7 2018
The other thing to consider is that we're also seeing failures in files that we don't modify as part of the build, for example utils.obj in libconcrt.lib (which is provided by the system). That may be a different problem, or there may be a common cause and solution.
,
Jun 7 2018
Closing and reopening seems to help, I see no failures with that. But one downside is that it would mean that our handling of temporary files would no longer allow for files to be kept atomically. We currently go to great lengths to make temporary files (such as cache files, before they are added to the cache) delete-on-close from the point at which they are created until the point at which they are ready to be committed to the cache. (Specifically, we open cache files with FILE_FLAG_DELETE_ON_CLOSE and later use this function to clear the delete-on-close property: http://llvm-cs.pcc.me.uk/lib/Support/Windows/Path.inc#429 ). With close/reopen, there would be a small window in which the file would be left on disk without being delete-on-close. I tried opening the file read-only (without closing it) and mmap'ing with the opened file handle, but that didn't solve the problem. Guess I'll try a few other things.
,
Jun 7 2018
Did you try FILE_FLAG_WRITE_THROUGH yet?
,
Jun 7 2018
I tried it, but it slowed down my test program significantly and I didn't wait for it to finish.
,
Jun 7 2018
It might help if we had a diagram that showed all the various actors, 1) who is opening the file and the exact flags that they're using 2) then who is mapping the file and what flags they're using 3) who is closing the file and the order of all these things happening. I'm trying to picture all this in my head. That said, I think all of this goes back to my very early premise that the file system really shouldn't be used for atomicity on Windows, you need some kind of external synchronization mechanism. In any case, maybe bruce can ping his contacts on the kernel team with this info and see if we've found *another* kernel bug.
,
Jun 7 2018
It would also be helpful to have a repro of the bug that used OS functions directly - CreateFile, WriteFile, CloseHandle, etc. I verified this by stepping in to a debug version of this program - the CRT source code is in C:\Program Files (x86)\Windows Kits\10\Source\10.0.17134.0\ucrt\lowio\write.cpp
#define _CRT_SECURE_NO_WARNINGS
#include <io.h>
#include <fcntl.h>
int main() {
const char buffer[] = "Some test text.\n";
int fh = _open("c:\\src\\temp\\outtest.bin", _O_CREAT | _O_BINARY);
_write(fh, buffer, sizeof(buffer));
_close(fh);
return 0;
}
Although the CRT open/write/close functions map fairly directly to the OS functions (they don't add any buffering as far as I can tell) but they do perform some games to map various flags correctly. The OS team generally doesn't want to take responsibility for CRT code so they would want a native repro.
I'm assuming that by CRT I/O you mean open/write/close rather than fopen/fwrite/fclose, which do default to adding buffering.
,
Jun 7 2018
Yea we don't use any fopen/fwrite/fclose calls, so ignore my previous comment about CRT buffering.
,
Jun 8 2018
The issue can be reproduced within a single process. Here's what a program seems to need to do to reproduce the bug, at least on my machine: 1) do this at least 4000 times: - use CreateFile to open a file with FILE_FLAG_DELETE_ON_CLOSE obtaining handle h1 - use WriteFile on h1 to write 4966353 bytes of content to the file - use CreateFileMapping/MapViewOfFile on h1 to create a mapping for the file, then close the handle from CreateFileMapping - use CreateFile with OPEN_EXISTING and without FILE_FLAG_DELETE_ON_CLOSE to obtain a second handle to the file, h2 - close h1 - use SetFileInformationByHandle on h2 to clear the delete-on-close bit on the file and rename the file - close h2 2) do this at least 2000 times: - use CreateFile with OPEN_EXISTING and FILE_FLAG_DELETE_ON_CLOSE to obtain a handle to one of the files created in step 1 - close the handle, which would delete the file if not for the mapping created in step 1 3) read the mappings created during step 1 and observe that those corresponding to the files "deleted" in step 2 are incorrect. I haven't determined whether every element of this is required, and the magic numbers may be particular to my machine. I'll be playing with my program to see if I can find a smaller reproducer as well as a good way to work around the bug. I've attached a standalone reproducer that uses only the Windows API. If I run it, I see output like this: 0 bad at 1 2 bad at 1 4 bad at 1 6 bad at 1 8 bad at 1 10 bad at 1 12 bad at 1 14 bad at 1 16 bad at 1 18 bad at 1 20 bad at 1 22 bad at 1 24 bad at 1 ...
,
Jun 8 2018
I was able to reduce my reproducer to the following: 1) do this at least 4000 times: - use CreateFile to open a file with FILE_FLAG_DELETE_ON_CLOSE obtaining handle h1 - use WriteFile on h1 to write 4966353 bytes of content to the file - use CreateFileMapping/MapViewOfFile on h1 to create a mapping for the file, but don't close the handle from CreateFileMapping - close h1 2) read the mappings created during step 1 and observe that some of them (mostly the ones created earlier) are incorrect. I've attached a program that does that. If I make either of the two following changes to the program: - don't close h1 - use DuplicateHandle to duplicate h1 before closing it I can no longer reproduce. Note also that I can reproduce if I keep the file mapping handle open, which means that only actual file handles seem to matter here. That's certainly odd because I would have at least expected the file mapping handle to keep a reference to the file handle. So the issue seems to be: - FILE_FLAG_DELETE_ON_CLOSE only cares about file handles, not file mappings or file mapping handles - if the file handle refcount drops to 0, the file is deleted, but any mapped pages still point to the kernel memory originally used to map the files - the kernel may later reclaim some of those pages, causing them to be read as zeros It seems like we should be able to work around the bug by duplicating the file handle when we create a file mapping, and closing the duplicated handle when we destroy it. I'll implement that and build chromium in a loop overnight and see if I can still reproduce.
,
Jun 8 2018
I found this interesting tidbit from the documentation of UnmapViewOfFile. https://msdn.microsoft.com/en-us/library/windows/desktop/aa366882.aspx - Although an application may close the file handle used to create a file mapping object, the system holds the corresponding file open until the last view of the file is unmapped. Files for which the last view has not yet been unmapped are held open with no sharing restrictions. So if the steps you describe are correct (specifically, if you have called MapViewOfFile before closing the handle to h1), then this does look like a bug. The system should be holding the file open until the view is unampped.
,
Jun 8 2018
Yes, that's exactly what the program is doing. Upon reading that documentation I'm agreed that this looks like a bug.
,
Jun 8 2018
Has anybody found a link to what the actual guarantees are? i.e.; how do we know this is a bug in Windows rather than incorrect usage? I hope the workaround works.
,
Jun 8 2018
Bruce: See comment 47. Microsoft says: "- Although an application may close the file handle used to create a file mapping object, ******the system holds the corresponding file open until the last view of the file is unmapped******." So we have FILE_FLAG_DELETE_ON_CLOSE, then we create a file mapping M to H1, then we map a view V to M. At this point, according to the documentation, the system should hold the file H1 open until the view is closed. But we close H and the underlying file seems to be deleted anyway.
,
Jun 8 2018
The workaround from c#46 almost worked, but it caused builds to start failing with permission denied errors while opening cache files. After running a build under process monitor this is what I think was happening: - if a file handle was opened with FILE_FLAG_DELETE_ON_CLOSE, that causes subsequent open requests to fail, per the documentation for CreateFile - closing the FILE_FLAG_DELETE_ON_CLOSE file handle allows the open requests to succeed (provided of course that the program has arranged not to delete the file on close, for example by using the trick linked from c#39) - but if the FILE_FLAG_DELETE_ON_CLOSE handle had been duplicated, the file isn't actually closed until the file is unmapped, so open requests from other processes fail as long as the original process has it mapped That sent me back to the drawing board for another workaround. I started with this partially-reduced reproducer (attached as main3.cpp): 1) do this at least 4000 times: - use CreateFile to open a file with FILE_FLAG_DELETE_ON_CLOSE obtaining handle h1 - use WriteFile on h1 to write 4966353 bytes of content to the file - use CreateFileMapping/MapViewOfFile on h1 to create a mapping for the file, then close the handle from CreateFileMapping - use CreateFile with OPEN_EXISTING and without FILE_FLAG_DELETE_ON_CLOSE to obtain a second handle to the file, h2 - close h1 - use SetFileInformationByHandle on h2 to clear the delete-on-close bit on the file - close h2 2) do this at least 2000 times: - use CreateFile with OPEN_EXISTING and FILE_FLAG_DELETE_ON_CLOSE to obtain a handle to one of the files created in step 1 - close the handle, which would delete the file if not for the mapping created in step 1 3) read the mappings created during step 1 and changed it as follows (attached as main4.cpp): 1) do this at least 4000 times: - use CreateFile to open a file with FILE_FLAG_DELETE_ON_CLOSE obtaining handle h1 - use WriteFile on h1 to write 4966353 bytes of content to the file - use CreateFile with OPEN_EXISTING and without FILE_FLAG_DELETE_ON_CLOSE to obtain a second handle to the file, h2 - close h1 - use SetFileInformationByHandle on h2 to clear the delete-on-close bit on the file - use CreateFileMapping/MapViewOfFile on h2 to create a mapping for the file, then close the handle from CreateFileMapping - close h2 2) do this at least 2000 times: - use CreateFile with OPEN_EXISTING and FILE_FLAG_DELETE_ON_CLOSE to obtain a handle to one of the files created in step 1 - close the handle, which would delete the file if not for the mapping created in step 1 3) read the mappings created during step 1 The key difference is that I create the mapping on h2 rather than h1. With that I can no longer reproduce the bug using my test program. What this tells me is that the bug only repros on mappings created from file handles that were opened with FILE_FLAG_DELETE_ON_CLOSE, and a potential workaround is to use a non-FILE_FLAG_DELETE_ON_CLOSE file handle to map the file. I implemented this workaround in LLVM and started a build. It's gotten further than my original workaround, so I'm hopeful. I'll do builds in a loop over the weekend and see if it still repros.
,
Jun 8 2018
I've spent a fair amount of time reading documentation on MSDN trying to figure out if we're using this incorrectly. My conclusion is that we may want to check with Microsoft to make sure. From the MapViewOfFile remarks (https://msdn.microsoft.com/en-us/library/windows/desktop/aa366761(v=vs.85).aspx): """With one important exception, file views derived from any file mapping object that is backed by the same file are coherent or identical at a specific time. Coherency is guaranteed for views within a process and for views that are mapped by different processes. The exception is related to remote files.""" Since we're not using remote files here, this suggests that views of the same file are guaranteed to be coherent. On the other hand, from the same page: """A mapped view of a file is not guaranteed to be coherent with a file that is being accessed by the ReadFile or WriteFile function.""" I don't know how to interpret that (particularly the WriteFile part), and the documentation does not clarify it. Does calling WriteFile, waiting for it to return, then calling CreatFileMapping and MapViewOfFile count as "being accessed by WriteFile" or not? Does closing the handle, ending the thread, and synchronizing on the thread ending create an expectation that reading from the mapping sees what was written by WriteFile? If not, what does? The documentation for FlushFileBuffers (https://msdn.microsoft.com/en-us/library/windows/desktop/aa364439(v=vs.85).aspx) says: """Typically the WriteFile and WriteFileEx functions write data to an internal buffer that the operating system writes to a disk or communication pipe on a regular basis. The FlushFileBuffers function writes all the buffered information for a specified file to the device or pipe.""" This suggests that perhaps WriteFile is not expected to always immediately actually affect wherever the file mapping looks (cache or storage, I imagine). Hence my earlier attempts to use FlushFileBuffers. But @pcc tried that, and reported that it didn't make the problem go away. As @zturner pointed out, the expectation for file mappings is that the file is held open as long as a mapping exists, even if the file handle is closed. So @pcc's finding that not closing the file handle or duplicating the file handle first makes a difference seems surprising. On the other hand, the documentation for CreateFile (https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx) has this to say about FILE_FLAG_DELETE_ON_CLOSE: """The file is to be deleted immediately after all of its handles are closed, which includes the specified handle and any other open or duplicated handles.""" If I understand correctly, file handles are not the same as file objects, and so although the file object remains even if all handles are closed, the actual file is deleted once the last _handle_ is closed. That does seem to explain the behavior pcc is seeing. I found this blog post that tells a similar story: https://blogs.msdn.microsoft.com/oldnewthing/20160108-00/?p=92821 So I think we may be holding it wrong, but I'm not 100% confident. If we are, I hope we can get the documentation to be clearer about that. If what we're doing is supposed to work but doesn't, well, then I would like that fixed. Either way I think it would be good to reach out to Microsoft.
,
Jun 8 2018
>> """A mapped view of a file is not guaranteed to be coherent with a file that is being accessed by the ReadFile or WriteFile function.""" > I don't know how to interpret that (particularly the WriteFile part), and the documentation does not clarify it. I saw this comment too while reading the document. This part is not an issue for us. We are not accessing anything with WriteFile / ReadFile.
,
Jun 8 2018
> I saw this comment too while reading the document. This part is not an issue for us. We are not accessing anything with WriteFile / ReadFile. Right. I interpret it as "being accessed while the mapping is active", which isn't the case for us.
,
Jun 9 2018
I got another undefined symbol error from one of the links, so it looks like the workaround from c#51 is not enough. My assumption is that the bug can also affect mappings created from non-FILE_FLAG_DELETE_ON_CLOSE handles in some cases, so I'm going to see whether combining the c#51 workaround with the c#46 workaround helps.
,
Jun 11 2018
The good news is that with both workarounds I'm no longer seeing undefined symbol errors. The bad news is that all of my weekend builds failed with some variation on: [54104/58195] LINK headless_unittests.exe headless_unittests.exe.pdb FAILED: headless_unittests.exe headless_unittests.exe.pdb cmd /c C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py delete-file ./headless_unittests.exe.pdb && C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False C:/src/l/ra/bin/lld-link.exe /nologo /OUT:./headless_unittests.exe /PDB:./headless_unittests.exe.pdb @./headless_unittests.exe.rsp [54105/58195] LINK stress_cache.exe stress_cache.exe.pdb [54106/58195] LINK headless_shell.exe headless_shell.exe.pdb FAILED: headless_shell.exe headless_shell.exe.pdb cmd /c C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py delete-file ./headless_shell.exe.pdb && C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False C:/src/l/ra/bin/lld-link.exe /nologo /OUT:./headless_shell.exe /PDB:./headless_shell.exe.pdb @./headless_shell.exe.rsp [54107/58195] LINK net_unittests.exe net_unittests.exe.pdb FAILED: net_unittests.exe net_unittests.exe.pdb cmd /c C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py delete-file ./net_unittests.exe.pdb && C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False C:/src/l/ra/bin/lld-link.exe /nologo /OUT:./net_unittests.exe /PDB:./net_unittests.exe.pdb @./net_unittests.exe.rsp [54108/58195] LINK headless_example.exe headless_example.exe.pdb ninja: build stopped: subcommand failed. I'm not sure what this means. My hypothesis is that my machine simply ran out of memory while linking these files, so we need to use a smaller pool depth for ThinLTO link jobs on Windows. I'll rebuild with a smaller pool depth and see if that makes the problem go away on my machine. In any event I'll send a patch for review since the undefined symbol problem seems to have been fixed at least.
,
Jun 11 2018
,
Jun 12 2018
> I'll rebuild with a smaller pool depth and see if that makes the problem go away on my machine. Even with pool depth 2 I still saw this happen, but much later. I've patched ninja to print the process exit code: https://github.com/pcc/ninja/tree/exit-code and rebuilding to see if that tells me anything interesting.
,
Jun 12 2018
Do you think the compiler is crashing? You can check the event log to see. Or, if it seems to be crashing then collect crash dumps. You can configure WER to do that for you - instructions are here: https://msdn.microsoft.com/en-us/library/windows/desktop/bb787181(v=vs.85).aspx It's a good idea to have this configured anyway. How long does the linking take? You can also try monitoring memory usage to see if it's getting to a critical level.
,
Jun 12 2018
I can't seem to see any relevant crashes in the event log. I looked for log entries under Windows Logs -> Application where the source is "Application Error". With the rebuild I managed to reproduce one failure relatively quickly. My patched ninja told me that the exit code was 0xc0000409 which apparently means STATUS_STACK_BUFFER_OVERRUN (stack overflow). I'm not sure yet why the frequency of this error is correlated with higher concurrency, but I'm not ruling out some sort of memory issue that's being reported in an obtuse way. I'll rebuild again while monitoring memory usage.
,
Jun 12 2018
It doesn't look like an out of memory situation at least. I sampled the "Available Physical Memory" from systeminfo.exe every second during a build and the lowest value was 23GB. I also set up the registry as instructed in c#59 and didn't collect any dumps that way. Is there maybe a way to get Windows to be more aggressive about collecting dumps?
,
Jun 12 2018
0xc0000409 is a tricky error code, both because its name is ambiguous and it has been overloaded. 0xC00000FD is STATUS_STACK_OVERFLOW and means that we have run out of stack, with the solution either being to use less stack or to make the stack larger. This is not what you are seeing. 0xC0000409 is STATUS_STACK_BUFFER_OVERRUN which used to always mean that a buffer overrun on a stack based buffer was detected - i.e.; a stack canary was tromped. It's meaning has since been overloaded so that it is also used to indicate a CFG (Control Flow Guard) error - an indirect call going somewhere that it ought-not-to So, 0xC0000409 is used for suspected security related bugs. I'm not sure why these would be happening randomly but it smells like some missing error checking and perhaps a race condition in lld. One option would be to modify the ninja files so that the linker is always run under a debugger. Something like this will do the trick: third_party\depot_tools\win_toolchain\vs_files\3bc0ec615cf20ee342f3bc29bc991b5ad66d8d2c\win_sdk\Debuggers\x64\cdb -g -G -o "c:\src\chromium3\src\third_party\llvm-build\Release+Asserts\bin\lld-link.exe" This runs cdb from the current toolchain directory with "-g -G -o" and then invokes lld-link.exe. The "-g -G -o" options tell cdb not to halt at process start or stop and to debug child processes. This should reliably catch crashes, I believe. Ideally the WER instructions I shared would capture these crashes without having to involve cdb, but who knows? Sometimes the post-mortem debugger fails to catch crashes and there isn't much to be done about it.
,
Jun 12 2018
Thanks Bruce. Since I presumably wouldn't be able to take control of the debugger while it's running under ninja I hacked my src tree to run linker jobs under cdb so that it would create a dump file if the linker crashes, like this:
diff --git a/build/toolchain/win/tool_wrapper.py b/build/toolchain/win/tool_wrapper.py
index b2cb093377ee..70522bc6acc7 100644
--- a/build/toolchain/win/tool_wrapper.py
+++ b/build/toolchain/win/tool_wrapper.py
@@ -157,6 +157,7 @@ class WinTool(object):
m = _LINK_EXE_OUT_ARG.match(arg)
if m:
pe_name = m.group('out')
+ args = ['C:\\Program Files (x86)\\Windows Kits\\10\\Debuggers\\x64\\cdb.exe', '-g', '-G', '-c', ".dump /u /ma link.dmp"] + args
for retry_count in range(_LINKER_RETRIES):
retry = False
link = subprocess.Popen(args, shell=sys.platform == 'win32', env=env,
and kicked off another build.
,
Jun 12 2018
With that I was able to capture a dump and it turned out to be a bug in the code that I added in https://reviews.llvm.org/D48051 that tries to handle the fallback case if we were unable to rename one of the cache files. Specifically we were trying to map the file descriptor -1, which led to what looks like a CFG failure in the CRT. I think we should be able to remove that code after changing LLVM to use SetFileInformationByHandle to manage the delete-on-close bit instead of FILE_FLAG_DELETE_ON_CLOSE, and at the same time that should let us remove another couple of hacks too.
,
Jun 13 2018
Fix landed in LLVM r334630.
,
Jun 14 2018
,
Jul 30
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by divya.pa...@techmahindra.com
, Nov 17 2017Labels: TE-NeedsTriageHelp