New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 786127 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Jul 30
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 3
Type: Bug

Blocked on:
issue 852679

Blocking:
issue 598772



Sign in to add a comment

TotWinThinLTO64, ToTWinCFI and ToTWinCFI64 failing compile with undefined symbols

Project Member Reported by inglorion@chromium.org, Nov 16 2017

Issue description

Logs: https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium.clang%2FToTWinThinLTO64%2F117%2F%2B%2Frecipes%2Fsteps%2Fcompile%2F0%2Fstdout

Excerpt:
[52427/52435] LINK webkit_unit_tests.exe webkit_unit_tests.exe.pdb
FAILED: webkit_unit_tests.exe webkit_unit_tests.exe.pdb 
cmd /c C:/b/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py delete-file ./webkit_unit_tests.exe.pdb && C:/b/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False ../../third_party/llvm-build/Release+Asserts/bin/lld-link.exe /nologo /OUT:./webkit_unit_tests.exe /PDB:./webkit_unit_tests.exe.pdb @./webkit_unit_tests.exe.rsp
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?Replay@ClipDisplayItem@blink@@UEBAXAEAVGraphicsContext@2@@Z
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?AppendToWebDisplayItemList@ClipDisplayItem@blink@@UEBAXAEBVLayoutSize@2@PEAVWebDisplayItemList@2@@Z
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?Replay@EndClipDisplayItem@blink@@UEBAXAEAVGraphicsContext@2@@Z
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?AppendToWebDisplayItemList@EndClipDisplayItem@blink@@UEBAXAEBVLayoutSize@2@PEAVWebDisplayItemList@2@@Z
[52428/52435] LINK unit_tests.exe unit_tests.exe.pdb
FAILED: unit_tests.exe unit_tests.exe.pdb 
cmd /c C:/b/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py delete-file ./unit_tests.exe.pdb && C:/b/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False ../../third_party/llvm-build/Release+Asserts/bin/lld-link.exe /nologo /OUT:./unit_tests.exe /PDB:./unit_tests.exe.pdb @./unit_tests.exe.rsp
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ??_GBidirectionalStreamQuicImpl@net@@UEAAPEAXI@Z
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?Start@BidirectionalStreamQuicImpl@net@@UEAAXPEBUBidirectionalStreamRequestInfo@2@AEBVNetLogWithSource@2@_NPEAVDelegate@BidirectionalStreamImpl@2@V?$unique_ptr@VTimer@base@@U?$default_delete@VTimer@base@@@std@@@std@@@Z
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?SendRequestHeaders@BidirectionalStreamQuicImpl@net@@UEAAXXZ
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?ReadData@BidirectionalStreamQuicImpl@net@@UEAAHPEAVIOBuffer@2@H@Z
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?SendvData@BidirectionalStreamQuicImpl@net@@UEAAXAEBV?$vector@V?$scoped_refptr@VIOBuffer@net@@@@V?$allocator@V?$scoped_refptr@VIOBuffer@net@@@@@std@@@std@@AEBV?$vector@HV?$allocator@H@std@@@4@_N@Z
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?GetProtocol@BidirectionalStreamQuicImpl@net@@UEBA?AW4NextProto@2@XZ
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?GetTotalReceivedBytes@BidirectionalStreamQuicImpl@net@@UEBA_JXZ
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?GetTotalSentBytes@BidirectionalStreamQuicImpl@net@@UEBA_JXZ
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?GetLoadTimingInfo@BidirectionalStreamQuicImpl@net@@UEBA_NPEAULoadTimingInfo@2@@Z
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ?PopulateNetErrorDetails@BidirectionalStreamQuicImpl@net@@UEAAXPEAUNetErrorDetails@2@@Z
..\..\third_party\llvm-build\Release+Asserts\bin\lld-link.exe: error: lto.tmp: undefined symbol: ??0BidirectionalStreamQuicImpl@net@@QEAA@V?$unique_ptr@VHandle@QuicChromiumClientSession@net@@U?$default_delete@VHandle@QuicChromiumClientSession@net@@@std@@@std@@@Z

 
Components: Build
Labels: TE-NeedsTriageHelp
As the issue is related to compiling, which seems to be out of TE-scope. Hence, adding label TE-NeedsTriageHelp for further investigation
Labels: clang
Status: Assigned (was: Unconfirmed)
I managed to reproduce locally. args.gn:

clang_use_chrome_plugins = false
is_chrome_branded = true
is_clang = true
is_component_build = false
is_debug = false
is_official_build = true
symbol_level = 2
use_lld = true
use_thin_lto = true

# Local settings to use clang
clang_base_path = "C:\\path\\to\\clang"
llvm_force_head_revision = false


Then build with ninja -k 100 all. Building individual targets did not reproduce the issue for me. The targets that are seeing undefined symbols for me locally, and the symbols that are undefined, are different from the ones on the bot. I suspect this is a problem with the toolchain.

I'm also seeing "clang\bin\lld-link.exe: error: unsupported relocation type 0x0".
The failure is definitely non-deterministic. Using the same chromium revision and clang revision and running ninja all, I got:

1. Build failed after building about 56,000 targets, with undefined symbol errors.

2. Repeatedly running ninja all, the build made progress - more targets were built until it would eventually fail again. Errors included both undefined symbol and a few instances of "unsupported relocation type 0x0".

3. After running ninja all enough times, the build eventually succeeded.

Removing the directory, creating it again with the same args.gn, and still using the same revision of chromium and clang,

4. The build failed with undefined symbol errors after about 60,000 targets had been built. The first failing target was content_perftests, which had never failed to build in any of the prior ninja all runs.
Summary: TotWinThinLTO64, ToTWinCFI and ToTWinCFI64 failing compile with undefined symbols (was: TotWinThinLTO64 compile failing with undefined symbols)
This also affects the Windows CFI bots.
Another interesting tidbit:

..\..\..\..\clang\bin\lld-link.exe: error: libconcrt.lib(utils.obj): undefined symbol: swprintf_s

That symbol is defined in a library that is provided by the C runtime. In fact, I think it is defined in the very place where we claim it is undefined.
Blocking: 598772
Update: After several builds that did not run into the undefined symbols error, https://ci.chromium.org/buildbot/chromium.clang/ToTWinThinLTO64/1023 encountered it again. Unsupported relocation type 0x0 also makes an appearance in that build.
Analysis is still ongoing, but here are some interesting findings.

Unsupported relocation type 0x0 means a relocation type of 0, which should never happen.
Undefined symbols in COFF files have section number IMAGE_SYM_UNDEFINED, which is 0, and value 0.

Some of the symbols I've seen this error on are defined by an object that participates in the link, but not actually used by any other object in the link.

This suggests that the unsupported relocation type 0x0 and undefined symbols reported in the object that actually defines them could be caused by reading bytes as 0 when the bytes in the input file actually are not 0.
Some builds (e.g. https://ci.chromium.org/buildbot/chromium.clang/ToTWinThinLTO64/1168) fail with lld-link.exe: error: broken object file: lto.tmp

The error message doesn't point out the specific problem, but one of the possible causes is a symbol referring to section 0, which is a reserved value.
All "broken object file" fatals come from createDefined() in lld/COFF/InputFiles.cpp. Before we call createDefined(), we check if a symbol is undefined (in ObjFile::initializeSymbols()). This looks at the symbols section number and value. The broken object file fatal in createDefined() is triggered by the section number alone. So we can get there by having a
symbol whose value is nonzero but whose section number is zero. The section number comes after the value in the COFF file format, so this is consistent with reading the file contents as zeros after some point.
The zeros make this sound somewhat similar to the zeros in executables in  crbug.com/644525 , but there are a couple of differences.

First of all, this isn't about executables, but rather about object files and archive files.

Secondly, the symbols we report as undefined all have names. I'm not sure how having consecutive zeros causes us to have symbol names, given that the name must either be in-line or an index into the string table, and in neither case would we end up with symbol names (certainly not multiple symbol names) if the bytes in the symbol record were 0.

Thirdly, we managed to get executables to work more reliably by calling FlushFileBuffers on the files that we write. In the case here, many of the failures we are seeing are not in files that were written by us, or even recently. libconcrt.lib(utils.obj), for example, is an object in an archive file that is provided by the system.
Summary of what I know so far:

1. The probable cause of all these errors is reading bytes as 0 when
they are not 0. This is clear in the case of "Unsupported relocation
type 0x0". It also explains why symbols that are only defined but not
used by any object in the link show up in "undefined symbol" messages
(undefined symbols in COFF have value 0 and section number 0). A value
of 0 for the section number also causes "broken object file" error
messages. It also explains how a symbol can be reported as undefined
by the object file that actually defines it.

2. The errors do not reproduce reliably. Re-running a build that
failed may result in a successful build, or a build that failed in a
different place. I haven't been able to reproduce these failures
locally for weeks. The bots that see the failures can see over twenty
successful builds before hitting the errors again.

3. The errors usually occur after many targets (~50,000) have been
built, but
https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium.clang%2FToTWinThinLTO64%2F1151%2F%2B%2Frecipes%2Fsteps%2Fcompile%2F0%2Fstdout
shows undefined symbol errors after only a few hundred targets were
built. Some of the smaller targets that the errors have occurred on
are angle_shader_translator, gcp_portmon64.dll,
image_operations_bench, virtual_driver_setup, and zucchini.

4. Symbols that are reported as undefined often include symbols from
uuid.lib. Frequently, these symbols are reported as undefined on
targets that do not actually use them.

5. Some errors occur in files that we write as part of the build
(lto.tmp, for example), but many occur in files that aren't written as
part of the build, such as in libconcrt.lib or uuid.lib. I originally
suspected a problem with the lack of atomicity of filesystem
operations, but I don't see how that would cause us to falsely read
zeros from files that are provided by the system, existed before the
build started, and presumably aren't changed as part of the build.

6. As far as I understand the COFF file format, symbol records have
the name first, followed by the value, followed by the section number,
and then some other fields. The name is either a string of up to 8
bytes, or 4 zero bytes followed by a 4-byte offset into the string
table. An offset into the string table should never be 0, so the name
field in a COFF symbol should never be all zeros, either. This means
that, if we read zeros after a point in a file, we could end up with a
symbol with a name and a value, but erroneously read the section
number as 0, which causes a "broken object file" diagnostic, or both
value and section number 0, which causes an "undefined symbol"
diagnostic. However, for multiple symbols to be reported as undefined,
another symbol record should have a non-zero name field, which
wouldn't happen if we only read zeros. Also, I think the string table
usually comes after the symbol records. One possible explanation for
why we still report multiple symbols in the same object file could be
that we read nonzero bytes later in the file and don't report on the
symbols that we detected as not having a name.


Cc: ruiu@google.com r...@chromium.org p...@chromium.org brucedaw...@chromium.org
cc'ing some people to get more eyes on this. If you guys have ideas about what could be causing this or how to investigate, I'd love to hear. I'll be on vacation next week, but I'm hoping we can fix this soon after, so that we can start shipping Chrome with CFI in the not too distant future.
A summary of what I have found so far is in https://bugs.chromium.org/p/chromium/issues/detail?id=786127#c13
Cc: zturner@chromium.org
The bug in 644525 shouldn't be applicable - Microsoft promised that it should only apply when loading a PE file.

When zturner@ and I were investigating that bug we discussed the possibility of adding extra instrumentation and checks to lld-link.exe to detect the problem and print more information. It turns out it wouldn't have helped with that particular bug, but it might help with this one. The linker is already detecting these failures so we could get it to print additional information.

I tried to find where swprintf_s is supposed to come from by compiling this code and linking it with /VERBOSE but wouldn't tell me anything for some reason.

#include <stdio.h>

int main()
{
  wchar_t buffer[100];
  swprintf_s(buffer, L"A number is %d.\n", 10);
  wprintf(L"%s", buffer);
  return 0;
}

@brucedawson, when I compile that program, by "wouldn't tell me anything" do you mean that /verbose doesn't produce the output about symbol resolution at all, or that swprintf_s is not in there?

When I compile the program you gave, the object file I get does not actually reference swprintf_s. Rather, there is a definition of swprintf_s in the object file, which references _vswprintf_s_l, also defined in the object file, which references __stdio_common_vswprintf_s. That symbol is in libucrt.lib, although lld-link's /verbose output doesn't seem to show it.

In the case mentioned earlier in this thread, swprintf_s is reported as undefined, but the location where it is reported as being referenced is actually the definition: utils.obj in libconcrt.lib.
When I said "wouldn't tell me anything" I meant that it didn't say where swprintf_s came from. As you said this is because swprintf_s is defined in my .obj flie due to this macro-created inline definition of it:

        __inline int __cdecl swprintf_s(
               wchar_t*       const _Buffer,
                                                       size_t         const _BufferCount,
                                wchar_t const* const _Format,
            ...)
#line 1538 "C:\\Program Files (x86)\\Windows Kits\\10\\include\\10.0.16299.0\\ucrt\\corecrt_wstdio.h"
        {
            int _Result;
            va_list _ArgList;
            ((void)(__vcrt_va_start_verify_argument_type<decltype(_Format)>(), ((void)(__va_start(&_ArgList, _Format)))));
            _Result = _vswprintf_s_l(_Buffer, _BufferCount, _Format, 0, _ArgList);
            ((void)(_ArgList = (va_list)0));
            return _Result;
        }


I don't understand this sentence:

> the location where it is reported as being referenced is actually the definition: utils.obj in libconcrt.lib.

Every .obj file that references swprintf_s will necessarily define it, as an inline function, and there will be multiple definitions.

I can't tell if this is a compiler bug, linker bug, or hardware error. It seems important that once the problem starts happening it keeps happening for a while - see builds 1178 to 1180 here:
https://ci.chromium.org/buildbot/chromium.clang/ToTWinThinLTO64/?limit=50


Comment 19 by p...@chromium.org, May 9 2018

inglorion: Did you get a chance to try my suggestion of using a different cache directory for each link?
pcc, not yet. I still think that's worth doing even if it doesn't explain everything we're seeing, so I'll take a stab at it tomorrow. In the meantime, I noticed that a number of places where we see strange values being read come after we write to files through raw_fd_ostream, and I noticed that the FlushFileBuffers call that we added does not apply to that code path, so I put up a quick diff to make raw_fd_ostream call FlushFileBuffers on Windows, too (https://reviews.llvm.org/D46621).

Comment 21 by p...@chromium.org, Jun 7 2018

I wrote a small lld patch that dumped the object file to stdout if we hit an unsupported reloc error:

diff --git a/lld/COFF/Chunks.cpp b/lld/COFF/Chunks.cpp
index 6d8a0c7c1f4..2931bc23d21 100644
--- a/lld/COFF/Chunks.cpp
+++ b/lld/COFF/Chunks.cpp
@@ -97,9 +97,11 @@ void SectionChunk::applyRelX64(uint8_t *Off, uint16_t Type, OutputSection *OS,
   case IMAGE_REL_AMD64_REL32_5:  add32(Off, S - P - 9); break;
   case IMAGE_REL_AMD64_SECTION:  applySecIdx(Off, OS); break;
   case IMAGE_REL_AMD64_SECREL:   applySecRel(this, Off, OS, S); break;
-  default:
+  default: {
+    llvm::outs() << toHex(File->MB.getBuffer()) << '\n';
     fatal("unsupported relocation type 0x" + Twine::utohexstr(Type));
   }
+  }
 }

 void SectionChunk::applyRelX86(uint8_t *Off, uint16_t Type, OutputSection *OS,

and used it to build chromium in a loop on my machine overnight with this args.gn:

is_clang = true
is_component_build = false
is_debug = false
is_official_build = true
symbol_level = 2
use_lld = true
use_thin_lto = true

The error ended up happening twice and I was able to capture both object files. I've attached them to this comment as file1.o and file2.o. All other builds (a total of 7 ran) ended up failing due to undefined symbols.

Looking at file1.o I see a suspiciously page-sized block of zeros from 0xcc000 to 0xcd000. There was one file of the same size in thinlto-cache, so I compared it to file1.o. The only difference was that there appeared to be correct content in the zeroed region. I've attached it as cache1.o.

In file2.o there are two blocks of zeros: one from 0x2000 to 0x8000 and another from 0x3c000 to 0x6d000. I couldn't find a corresponding file in thinlto-cache this time.

This does seem to confirm the hypothesis that parts of files are being read as zeros. The fact that they're all on page boundaries would tend to make me suspect the operating system.
cache1.o
4.7 MB Download
file1.o
4.7 MB Download
file2.o
2.3 MB Download
Thanks, Peter. That is super helpful. Also interesting that we can have zeros, valid data, and zeros again. I suspected as much, but it's good to have this confirmed with an actual capture. Speaking of suspicions, I also suspect that this only affects files we mmap and not files we read. Is that also something you were able to determine?
But aren’t we using the same MemoryBuffer APIs from llvm? Don’t those have
the workaround of calling FlushFileBuffers on close?

Comment 24 by p...@chromium.org, Jun 7 2018

We aren't following the code path that flushes object files because we aren't using memory-mapped I/O to write them (only to read them).

I figured that the problem may have something to do with the write/mmap pattern that we use. So I hacked together a "subcommand" for llvm-lto2 that basically follows the file I/O pattern that the linker uses during LTO and verifies that the mapped file content is as expected. I've attached the patch adding the subcommand.

If I run this subcommand on Windows I see a lot of verifier failures. So that's certainly interesting.
lto.diff
4.8 KB Download
Try hacking together something that adds FILE_FLAG_WRITE_THROUGH when
opening a file for wrie and see if that fixes it
@pcc, I also wanted to ask what version of Windows you're running. I've been wondering if perhaps we're hitting a bug on the bot that doesn't exist in the version of Windows I have on my local machine. My local machine has version  10.0.14393.

Comment 27 by p...@chromium.org, Jun 7 2018

zturner: I'll try that.

inglorion: Mine is version 10.0.17134.48.
I assume that we are closing the file after writing to it before we open the file for reading?

This bug certainly smells similar to the 644525 bug, but Microsoft assures me that the root-cause of 644525 is not applicable to a scenario like this. So, if it is an OS bug then it must be similar-but-different. If you have a program that reproes the bug then I could share that with Microsoft.

Adding a call to FlushFileBuffers before closing the file seems like a good attempted mitigation step, even if it isn't exactly the same bug.

Comment 29 by p...@chromium.org, Jun 7 2018

> I assume that we are closing the file after writing to it before we open the file for reading?

We aren't actually. We open the file for read/write, write the object file and then mmap it using the same handle. Maybe this isn't guaranteed to work on Windows?
I think that is not guaranteed to work on Windows, so that's the problem. I'll double check.
I think that should be guaranteed to work as long as you are *NOT* using the CRT IO functions.  If you are using the CRT functions though, then the CRT does its own buffering, so you could write via one handle, the CRT doesnt' flush, then open via another handle and it doesn't see the changes that the CRT has buffered.

If you're writing via a native OS handle though, this should definitely work.

Comment 32 by p...@chromium.org, Jun 7 2018

> I think that should be guaranteed to work as long as you are *NOT* using the CRT IO functions.

That could explain it then. The object file writer uses raw_ostream (which translates down to CRT I/O).

I found a _commit function which seems to be the equivalent of fflush for CRT file handles: https://msdn.microsoft.com/en-us/library/17618685.aspx

I'll see if that helps.
Note https://reviews.llvm.org/D46621 which adds a FlushFileBuffers call to raw_fd_ostream.

Comment 34 by p...@chromium.org, Jun 7 2018

Though I guess if it's a buffer internal to the CRT then FlushFileBuffers wouldn't help.
Ah, yeah.
It looks like _commit() is just a thin wrapper over FlushFileBuffers, and the only time the CRT does any internal buffering is when you have a FILE*.  That said, I still think my comment in #31 is accurate.  Specifically, Windows is supposed to guarantee correct operation.  Perhaps it's just not satisfying the guarantee.

So I think FlushFileBuffers is still worth a try

Comment 37 by p...@chromium.org, Jun 7 2018

I tried adding the call to _commit, and I was still seeing failures (diff attached). Since it's just a wrapper around FlushFileBuffers, I guess that wouldn't help either.

Next thing I'll try is closing and reopening the file handle.
lto.diff
5.4 KB Download
The other thing to consider is that we're also seeing failures in files that we don't modify as part of the build, for example utils.obj in libconcrt.lib (which is provided by the system). That may be a different problem, or there may be a common cause and solution.

Comment 39 by p...@chromium.org, Jun 7 2018

Closing and reopening seems to help, I see no failures with that.

But one downside is that it would mean that our handling of temporary files would no longer allow for files to be kept atomically. We currently go to great lengths to make temporary files (such as cache files, before they are added to the cache) delete-on-close from the point at which they are created until the point at which they are ready to be committed to the cache. (Specifically, we open cache files with FILE_FLAG_DELETE_ON_CLOSE and later use this function to clear the delete-on-close property: http://llvm-cs.pcc.me.uk/lib/Support/Windows/Path.inc#429 ). With close/reopen, there would be a small window in which the file would be left on disk without being delete-on-close.

I tried opening the file read-only (without closing it) and mmap'ing with the opened file handle, but that didn't solve the problem. Guess I'll try a few other things.
Did you try FILE_FLAG_WRITE_THROUGH yet?

Comment 41 by p...@chromium.org, Jun 7 2018

I tried it, but it slowed down my test program significantly and I didn't wait for it to finish.
It might help if we had a diagram that showed all the various actors, 

1) who is opening the file and the exact flags that they're using
2) then who is mapping the file and what flags they're using
3) who is closing the file

and the order of all these things happening.  I'm trying to picture all this in my head.

That said, I think all of this goes back to my very early premise that the file system really shouldn't be used for atomicity on Windows, you need some kind of external synchronization mechanism.

In any case, maybe bruce can ping his contacts on the kernel team with this info and see if we've found *another* kernel bug.
It would also be helpful to have a repro of the bug that used OS functions directly - CreateFile, WriteFile, CloseHandle, etc. I verified this by stepping in to a debug version of this program - the CRT source code is in C:\Program Files (x86)\Windows Kits\10\Source\10.0.17134.0\ucrt\lowio\write.cpp

#define _CRT_SECURE_NO_WARNINGS
#include <io.h>
#include <fcntl.h>

int main() {
  const char buffer[] = "Some test text.\n";
  int fh = _open("c:\\src\\temp\\outtest.bin", _O_CREAT | _O_BINARY);
  _write(fh, buffer, sizeof(buffer));
  _close(fh);
  return 0;
}

Although the CRT open/write/close functions map fairly directly to the OS functions (they don't add any buffering as far as I can tell) but they do perform some games to map various flags correctly. The OS team generally doesn't want to take responsibility for CRT code so they would want a native repro.

I'm assuming that by CRT I/O you mean open/write/close rather than fopen/fwrite/fclose, which do default to adding buffering.

Yea we don't use any fopen/fwrite/fclose calls, so ignore my previous comment about CRT buffering.

Comment 45 by p...@chromium.org, Jun 8 2018

The issue can be reproduced within a single process.

Here's what a program seems to need to do to reproduce the bug, at least on my machine:
1) do this at least 4000 times:
  - use CreateFile to open a file with FILE_FLAG_DELETE_ON_CLOSE obtaining handle h1
  - use WriteFile on h1 to write 4966353 bytes of content to the file
  - use CreateFileMapping/MapViewOfFile on h1 to create a mapping for the file, then close the handle from CreateFileMapping
  - use CreateFile with OPEN_EXISTING and without FILE_FLAG_DELETE_ON_CLOSE to obtain a second handle to the file, h2
  - close h1
  - use SetFileInformationByHandle on h2 to clear the delete-on-close bit on the file and rename the file
  - close h2
2) do this at least 2000 times:
  - use CreateFile with OPEN_EXISTING and FILE_FLAG_DELETE_ON_CLOSE to obtain a handle to one of the files created in step 1
  - close the handle, which would delete the file if not for the mapping created in step 1
3) read the mappings created during step 1 and observe that those corresponding to the files "deleted" in step 2 are incorrect.

I haven't determined whether every element of this is required, and the magic numbers may be particular to my machine. I'll be playing with my program to see if I can find a smaller reproducer as well as a good way to work around the bug.

I've attached a standalone reproducer that uses only the Windows API. If I run it, I see output like this:

0 bad at 1
2 bad at 1
4 bad at 1
6 bad at 1
8 bad at 1
10 bad at 1
12 bad at 1
14 bad at 1
16 bad at 1
18 bad at 1
20 bad at 1
22 bad at 1
24 bad at 1
...
main.cpp
2.8 KB View Download

Comment 46 by p...@chromium.org, Jun 8 2018

I was able to reduce my reproducer to the following:

1) do this at least 4000 times:
  - use CreateFile to open a file with FILE_FLAG_DELETE_ON_CLOSE obtaining handle h1
  - use WriteFile on h1 to write 4966353 bytes of content to the file
  - use CreateFileMapping/MapViewOfFile on h1 to create a mapping for the file, but don't close the handle from CreateFileMapping
  - close h1
2) read the mappings created during step 1 and observe that some of them (mostly the ones created earlier) are incorrect.

I've attached a program that does that.

If I make either of the two following changes to the program:
- don't close h1
- use DuplicateHandle to duplicate h1 before closing it
I can no longer reproduce.

Note also that I can reproduce if I keep the file mapping handle open, which means that only actual file handles seem to matter here. That's certainly odd because I would have at least expected the file mapping handle to keep a reference to the file handle.

So the issue seems to be:
- FILE_FLAG_DELETE_ON_CLOSE only cares about file handles, not file mappings or file mapping handles
- if the file handle refcount drops to 0, the file is deleted, but any mapped pages still point to the kernel memory originally used to map the files
- the kernel may later reclaim some of those pages, causing them to be read as zeros

It seems like we should be able to work around the bug by duplicating the file handle when we create a file mapping, and closing the duplicated handle when we destroy it. I'll implement that and build chromium in a loop overnight and see if I can still reproduce.
main2.cpp
1.6 KB View Download
I found this interesting tidbit from the documentation of UnmapViewOfFile.

https://msdn.microsoft.com/en-us/library/windows/desktop/aa366882.aspx

- Although an application may close the file handle used to create a file mapping object, the system holds the corresponding file open until the last view of the file is unmapped. Files for which the last view has not yet been unmapped are held open with no sharing restrictions.

So if the steps you describe are correct (specifically, if you have called MapViewOfFile before closing the handle to h1), then this does look like a bug.  The system should be holding the file open until the view is unampped.

Comment 48 by p...@chromium.org, Jun 8 2018

Yes, that's exactly what the program is doing. Upon reading that documentation I'm agreed that this looks like a bug.
Has anybody found a link to what the actual guarantees are? i.e.; how do we know this is a bug in Windows rather than incorrect usage?

I hope the workaround works.
Bruce: See comment 47.  Microsoft says:  "- Although an application may close the file handle used to create a file mapping object, ******the system holds the corresponding file open until the last view of the file is unmapped******."

So we have FILE_FLAG_DELETE_ON_CLOSE, then we create a file mapping M to H1, then we map a view V to M.  At this point, according to the documentation, the system should hold the file H1 open until the view is closed.  But we close H and the underlying file seems to be deleted anyway.

Comment 51 by p...@chromium.org, Jun 8 2018

The workaround from c#46 almost worked, but it caused builds to start failing with permission denied errors while opening cache files. After running a build under process monitor this is what I think was happening:

- if a file handle was opened with FILE_FLAG_DELETE_ON_CLOSE, that causes subsequent open requests to fail, per the documentation for CreateFile
- closing the FILE_FLAG_DELETE_ON_CLOSE file handle allows the open requests to succeed (provided of course that the program has arranged not to delete the file on close, for example by using the trick linked from c#39)
- but if the FILE_FLAG_DELETE_ON_CLOSE handle had been duplicated, the file isn't actually closed until the file is unmapped, so open requests from other processes fail as long as the original process has it mapped

That sent me back to the drawing board for another workaround. I started with this partially-reduced reproducer (attached as main3.cpp):

1) do this at least 4000 times:
  - use CreateFile to open a file with FILE_FLAG_DELETE_ON_CLOSE obtaining handle h1
  - use WriteFile on h1 to write 4966353 bytes of content to the file
  - use CreateFileMapping/MapViewOfFile on h1 to create a mapping for the file, then close the handle from CreateFileMapping
  - use CreateFile with OPEN_EXISTING and without FILE_FLAG_DELETE_ON_CLOSE to obtain a second handle to the file, h2
  - close h1
  - use SetFileInformationByHandle on h2 to clear the delete-on-close bit on the file
  - close h2
2) do this at least 2000 times:
  - use CreateFile with OPEN_EXISTING and FILE_FLAG_DELETE_ON_CLOSE to obtain a handle to one of the files created in step 1
  - close the handle, which would delete the file if not for the mapping created in step 1
3) read the mappings created during step 1

and changed it as follows (attached as main4.cpp):

1) do this at least 4000 times:
  - use CreateFile to open a file with FILE_FLAG_DELETE_ON_CLOSE obtaining handle h1
  - use WriteFile on h1 to write 4966353 bytes of content to the file
  - use CreateFile with OPEN_EXISTING and without FILE_FLAG_DELETE_ON_CLOSE to obtain a second handle to the file, h2
  - close h1
  - use SetFileInformationByHandle on h2 to clear the delete-on-close bit on the file
  - use CreateFileMapping/MapViewOfFile on h2 to create a mapping for the file, then close the handle from CreateFileMapping
  - close h2
2) do this at least 2000 times:
  - use CreateFile with OPEN_EXISTING and FILE_FLAG_DELETE_ON_CLOSE to obtain a handle to one of the files created in step 1
  - close the handle, which would delete the file if not for the mapping created in step 1
3) read the mappings created during step 1

The key difference is that I create the mapping on h2 rather than h1. With that I can no longer reproduce the bug using my test program. What this tells me is that the bug only repros on mappings created from file handles that were opened with FILE_FLAG_DELETE_ON_CLOSE, and a potential workaround is to use a non-FILE_FLAG_DELETE_ON_CLOSE file handle to map the file.

I implemented this workaround in LLVM and started a build. It's gotten further than my original workaround, so I'm hopeful. I'll do builds in a loop over the weekend and see if it still repros.
main3.cpp
2.4 KB View Download
main4.cpp
2.4 KB View Download
I've spent a fair amount of time reading documentation on MSDN trying to figure out if we're using this incorrectly. My conclusion is that we may want to check with Microsoft to make sure.

From the MapViewOfFile remarks (https://msdn.microsoft.com/en-us/library/windows/desktop/aa366761(v=vs.85).aspx):

"""With one important exception, file views derived from any file mapping object that is backed by the same file are coherent or identical at a specific time. Coherency is guaranteed for views within a process and for views that are mapped by different processes.

The exception is related to remote files."""

Since we're not using remote files here, this suggests that views of the same file are guaranteed to be coherent.

On the other hand, from the same page:

"""A mapped view of a file is not guaranteed to be coherent with a file that is being accessed by the ReadFile or WriteFile function."""

I don't know how to interpret that (particularly the WriteFile part), and the documentation does not clarify it. Does calling WriteFile, waiting for it to return, then calling CreatFileMapping and MapViewOfFile count as "being accessed by WriteFile" or not? Does closing the handle, ending the thread, and synchronizing on the thread ending create an expectation that reading from the mapping sees what was written by WriteFile? If not, what does?

The documentation for FlushFileBuffers (https://msdn.microsoft.com/en-us/library/windows/desktop/aa364439(v=vs.85).aspx) says:

"""Typically the WriteFile and WriteFileEx functions write data to an internal buffer that the operating system writes to a disk or communication pipe on a regular basis. The FlushFileBuffers function writes all the buffered information for a specified file to the device or pipe."""

This suggests that perhaps WriteFile is not expected to always immediately actually affect wherever the file mapping looks (cache or storage, I imagine). Hence my earlier attempts to use FlushFileBuffers. But @pcc tried that, and reported that it didn't make the problem go away.

As @zturner pointed out, the expectation for file mappings is that the file is held open as long as a mapping exists, even if the file handle is closed. So @pcc's finding that not closing the file handle or duplicating the file handle first makes a difference seems surprising.

On the other hand, the documentation for CreateFile (https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx) has this to say about FILE_FLAG_DELETE_ON_CLOSE:

"""The file is to be deleted immediately after all of its handles are closed, which includes the specified handle and any other open or duplicated handles."""

If I understand correctly, file handles are not the same as file objects, and so although the file object remains even if all handles are closed, the actual file is deleted once the last _handle_ is closed. That does seem to explain the behavior pcc is seeing. I found this blog post that tells a similar story: https://blogs.msdn.microsoft.com/oldnewthing/20160108-00/?p=92821

So I think we may be holding it wrong, but I'm not 100% confident. If we are, I hope we can get the documentation to be clearer about that. If what we're doing is supposed to work but doesn't, well, then I would like that fixed. Either way I think it would be good to reach out to Microsoft.
>> """A mapped view of a file is not guaranteed to be coherent with a file that is being accessed by the ReadFile or WriteFile function."""

> I don't know how to interpret that (particularly the WriteFile part), and the documentation does not clarify it. 

I saw this comment too while reading the document.  This part is not an issue for us.  We are not accessing anything with WriteFile / ReadFile.

Comment 54 by p...@chromium.org, Jun 8 2018

> I saw this comment too while reading the document.  This part is not an issue for us.  We are not accessing anything with WriteFile / ReadFile.

Right. I interpret it as "being accessed while the mapping is active", which isn't the case for us.

Comment 55 by p...@chromium.org, Jun 9 2018

I got another undefined symbol error from one of the links, so it looks like the workaround from c#51 is not enough. My assumption is that the bug can also affect  mappings created from non-FILE_FLAG_DELETE_ON_CLOSE handles in some cases, so I'm going to see whether combining the c#51 workaround with the c#46 workaround helps.

Comment 56 by p...@chromium.org, Jun 11 2018

The good news is that with both workarounds I'm no longer seeing undefined symbol errors. The bad news is that all of my weekend builds failed with some variation on:

[54104/58195] LINK headless_unittests.exe headless_unittests.exe.pdb
FAILED: headless_unittests.exe headless_unittests.exe.pdb 
cmd /c C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py delete-file ./headless_unittests.exe.pdb && C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False C:/src/l/ra/bin/lld-link.exe /nologo /OUT:./headless_unittests.exe /PDB:./headless_unittests.exe.pdb @./headless_unittests.exe.rsp
[54105/58195] LINK stress_cache.exe stress_cache.exe.pdb
[54106/58195] LINK headless_shell.exe headless_shell.exe.pdb
FAILED: headless_shell.exe headless_shell.exe.pdb 
cmd /c C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py delete-file ./headless_shell.exe.pdb && C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False C:/src/l/ra/bin/lld-link.exe /nologo /OUT:./headless_shell.exe /PDB:./headless_shell.exe.pdb @./headless_shell.exe.rsp
[54107/58195] LINK net_unittests.exe net_unittests.exe.pdb
FAILED: net_unittests.exe net_unittests.exe.pdb 
cmd /c C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py delete-file ./net_unittests.exe.pdb && C:/src/depot_tools/win_tools-2_7_6_bin/python/bin/python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False C:/src/l/ra/bin/lld-link.exe /nologo /OUT:./net_unittests.exe /PDB:./net_unittests.exe.pdb @./net_unittests.exe.rsp
[54108/58195] LINK headless_example.exe headless_example.exe.pdb
ninja: build stopped: subcommand failed.

I'm not sure what this means. My hypothesis is that my machine simply ran out of memory while linking these files, so we need to use a smaller pool depth for ThinLTO link jobs on Windows. I'll rebuild with a smaller pool depth and see if that makes the problem go away on my machine.

In any event I'll send a patch for review since the undefined symbol problem seems to have been fixed at least.

Comment 58 by p...@chromium.org, Jun 12 2018

> I'll rebuild with a smaller pool depth and see if that makes the problem go away on my machine.

Even with pool depth 2 I still saw this happen, but much later.

I've patched ninja to print the process exit code: https://github.com/pcc/ninja/tree/exit-code
and rebuilding to see if that tells me anything interesting.
Do you think the compiler is crashing? You can check the event log to see. Or, if it seems to be crashing then collect crash dumps. You can configure WER to do that for you - instructions are here:

https://msdn.microsoft.com/en-us/library/windows/desktop/bb787181(v=vs.85).aspx

It's a good idea to have this configured anyway.

How long does the linking take? You can also try monitoring memory usage to see if it's getting to a critical level.

Comment 60 by p...@chromium.org, Jun 12 2018

I can't seem to see any relevant crashes in the event log. I looked for log entries under Windows Logs -> Application where the source is "Application Error".

With the rebuild I managed to reproduce one failure relatively quickly. My patched ninja told me that the exit code was 0xc0000409 which apparently means STATUS_STACK_BUFFER_OVERRUN (stack overflow). I'm not sure yet why the frequency of this error is correlated with higher concurrency, but I'm not ruling out some sort of memory issue that's being reported in an obtuse way. I'll rebuild again while monitoring memory usage.

Comment 61 by p...@chromium.org, Jun 12 2018

It doesn't look like an out of memory situation at least. I sampled the "Available Physical Memory" from systeminfo.exe every second during a build and the lowest value was 23GB.

I also set up the registry as instructed in c#59 and didn't collect any dumps that way.

Is there maybe a way to get Windows to be more aggressive about collecting dumps?
0xc0000409 is a tricky error code, both because its name is ambiguous and it has been overloaded.

0xC00000FD is STATUS_STACK_OVERFLOW and means that we have run out of stack, with the solution either being to use less stack or to make the stack larger. This is not what you are seeing.
0xC0000409 is STATUS_STACK_BUFFER_OVERRUN which used to always mean that a buffer overrun on a stack based buffer was detected - i.e.; a stack canary was tromped. It's meaning has since been overloaded so that it is also used to indicate a CFG (Control Flow Guard) error - an indirect call going somewhere that it ought-not-to

So, 0xC0000409 is used for suspected security related bugs. I'm not sure why these would be happening randomly but it smells like some missing error checking and perhaps a race condition in lld. One option would be to modify the ninja files so that the linker is always run under a debugger. Something like this will do the trick:

third_party\depot_tools\win_toolchain\vs_files\3bc0ec615cf20ee342f3bc29bc991b5ad66d8d2c\win_sdk\Debuggers\x64\cdb -g -G -o "c:\src\chromium3\src\third_party\llvm-build\Release+Asserts\bin\lld-link.exe"

This runs cdb from the current toolchain directory with "-g -G -o" and then invokes lld-link.exe. The "-g -G -o" options tell cdb not to halt at process start or stop and to debug child processes. This should reliably catch crashes, I believe.

Ideally the WER instructions I shared would capture these crashes without having to involve cdb, but who knows? Sometimes the post-mortem debugger fails to catch crashes and there isn't much to be done about it.

Comment 63 by p...@chromium.org, Jun 12 2018

Thanks Bruce. Since I presumably wouldn't be able to take control of the debugger while it's running under ninja I hacked my src tree to run linker jobs under cdb so that it would create a dump file if the linker crashes, like this:

diff --git a/build/toolchain/win/tool_wrapper.py b/build/toolchain/win/tool_wrapper.py
index b2cb093377ee..70522bc6acc7 100644
--- a/build/toolchain/win/tool_wrapper.py
+++ b/build/toolchain/win/tool_wrapper.py
@@ -157,6 +157,7 @@ class WinTool(object):
       m = _LINK_EXE_OUT_ARG.match(arg)
       if m:
         pe_name = m.group('out')
+    args = ['C:\\Program Files (x86)\\Windows Kits\\10\\Debuggers\\x64\\cdb.exe', '-g', '-G', '-c', ".dump /u /ma link.dmp"] + args
     for retry_count in range(_LINKER_RETRIES):
       retry = False
       link = subprocess.Popen(args, shell=sys.platform == 'win32', env=env,

and kicked off another build.

Comment 64 by p...@chromium.org, Jun 12 2018

With that I was able to capture a dump and it turned out to be a bug in the code that I added in https://reviews.llvm.org/D48051 that tries to handle the fallback case if we were unable to rename one of the cache files. Specifically we were trying to map the file descriptor -1, which led to what looks like a CFG failure in the CRT. I think we should be able to remove that code after changing LLVM to use SetFileInformationByHandle to manage the delete-on-close bit instead of FILE_FLAG_DELETE_ON_CLOSE, and at the same time that should let us remove another couple of hacks too.

Comment 65 by p...@chromium.org, Jun 13 2018

Fix landed in LLVM r334630.

Comment 66 by h...@chromium.org, Jun 14 2018

Blockedon: 852679
Status: Fixed (was: Assigned)

Sign in to add a comment