New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 739282 link

Starred by 13 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows , Mac
Pri: 1
Type: Bug

Blocked on:
issue 416072

Blocking:
issue 752009


Participants' hotlists:
Intensive-Care-Bugs


Sign in to add a comment

Windows and Mac compile breaks with link error every day at 12:30am PST

Project Member Reported by tsergeant@chromium.org, Jul 5 2017

Issue description

I noticed a weird pattern during my sheriff shift, which is confirmed by https://chromium-status.appspot.com/?limit=100: 

Almost every day, the tree closes because of a compile failure on Windows at 12:31am PST.

Here's the latest 5 such failures (times in AEST):

Wed, 05 Jul 17:31  https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Win%20Builder%20%28dbg%29/builds/122201
Tue, 04 Jul 17:31  https://build.chromium.org/p/chromium.win/builders/Win%20x64%20Builder%20%28dbg%29/builds/56002
Mon, 03 Jul 17:33  https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Mac%20Builder%20%28dbg%29/builds/205660
Sat, 01 Jul 17:57  https://build.chromium.org/p/chromium.win/builders/Win%20x64%20Builder%20%28dbg%29/builds/55940
Fri, 30 Jun 17:32  https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Win%20Builder%20%28dbg%29/builds/122006

Going back two weeks, I only see two days where the tree *didn't* close at 12:30am PST. It seems to have only been introduced in the last few weeks, though: The earliest similar failure I can see is Tue, 13 Jun 17:31 (https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Win%20x64%20Builder%20%28dbg%29/builds/109797)

The compile always fails with some variety of link error due to an unopenable file:

FAILED: blink_core.dll blink_core.dll.lib blink_core.dll.pdb 
C:/b/depot_tools/python276_bin/python.exe ../../build/toolchain/win/tool_wrapper.py link-wrapper environment.x64 False link.exe /nologo /IMPLIB:./blink_core.dll.lib /DLL /OUT:./blink_core.dll /PDB:./blink_core.dll.pdb @./blink_core.dll.rsp
LINK : fatal error LNK1181: cannot open input file 'gles2_c_lib.dll.lib'

Sheriffs have been reopening the tree immediately afterwards, since this is "just a flake", but it's clearly more than that: Something is going on in the Windows compile or infra which affects the first build after Midnight PST. Maybe some sort of cache gets cleared? I have no idea.

Tagging as Sheriff so that hopefully tomorrow's sheriff can be aware of it, and tagging as Infra so that someone with more knowledge about this can look into it.
 
Cc: wfh@chromium.org brucedaw...@chromium.org
Components: -Infra Infra>Labs
Labels: OS-Windows
Owner: friedman@chromium.org
Status: Assigned (was: Untriaged)
interesting. If it's happening once a day on different builders, it seems unlikely that it's a src-side thing.

friedman@ - any chance there's some job running on the builders that is locking files and keeping them open?
Wow. That's crazy. I *love* this bug.

And by "love" I mean "hate". Anti-virus scanning seems like a reasonable theory.

Windows Defender is disabled on the hosts.

I see that C:\infra-tools\cleanup_disk.exe is launched at 12:30 a.m every day 


Cc: estaab@chromium.org
Components: -Infra>Labs Infra>Platform
Owner: sergeybe...@chromium.org
That sounds like our problem, then. Nice find, pschmidt@!

@sergeyberezin - are you still the current de facto owner for cleanup_disk? Maybe we have a problem on Windows where it is keeping files open?
Labels: -Infra-Troopers -Sheriff-Chromium

Comment 6 by tkent@chromium.org, Jul 11 2017

Labels: OS-Mac
It seems this issue affects Mac too.

Failing build: https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Mac%20Builder%20%28dbg%29/builds/206577
The next build passes.

[4069/6846] LINK ./wtf_unittests
FAILED: wtf_unittests 
TOOL_VERSION=1490108443 ../../build/toolchain/mac/linker_driver.py ../../third_party/llvm-build/Release+Asserts/bin/clang++  -stdlib=libc++ -arch x86_64 -Werror -isysroot ../../../../../../../Applications/Xcode8.0.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk -mmacosx-version-min=10.9.0 -Wl,-ObjC -Wl,-rpath,@loader_path/. -Wl,-rpath,@loader_path/../../.. -Wl,-pie -o "./wtf_unittests" -Wl,-filelist,"./wtf_unittests.rsp" ./libwtf.dylib ./libbase.dylib ./libicui18n.dylib ./libicuuc.dylib ./libbase_i18n.dylib -framework AppKit
clang: error: no such file or directory: './libicui18n.dylib'
clang: error: no such file or directory: './libicuuc.dylib'
Traceback (most recent call last):
  File "../../build/toolchain/mac/linker_driver.py", line 229, in <module>
    Main(sys.argv)
  File "../../build/toolchain/mac/linker_driver.py", line 79, in Main
    subprocess.check_call(compiler_driver_args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['../../third_party/llvm-build/Release+Asserts/bin/clang++', '-stdlib=libc++', '-arch', 'x86_64', '-Werror', '-isysroot', '../../../../../../../Applications/Xcode8.0.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk', '-mmacosx-version-min=10.9.0', '-Wl,-ObjC', '-Wl,-rpath,@loader_path/.', '-Wl,-rpath,@loader_path/../../..', '-Wl,-pie', '-o', './wtf_unittests', '-Wl,-filelist,./wtf_unittests.rsp', './libwtf.dylib', './libbase.dylib', './libicui18n.dylib', './libicuuc.dylib', './libbase_i18n.dylib', '-framework', 'AppKit']' returned non-zero exit status 1

Comment 7 by engedy@chromium.org, Jul 11 2017

Same on Mac Builder (dbg), build 130771.

See: https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.mac%2FMac_Builder__dbg_%2F130771%2F%2B%2Frecipes%2Fsteps%2Fcompile%2F0%2Fstdout

[7458/7936] SOLINK libblink_platform.dylib libblink_platform.dylib.TOC
FAILED: libblink_platform.dylib libblink_platform.dylib.TOC
clang: error: no such file or directory: './libmojo_public_system.dylib'
clang: error: no such file or directory: './libicui18n.dylib'
clang: error: no such file or directory: './libicuuc.dylib'
clang: error: no such file or directory: './librange.dylib'
clang: error: no such file or directory: './libprotobuf_lite.dylib'
clang: error: no such file or directory: './libboringssl.dylib'
clang: error: no such file or directory: './libfreetype.dylib'
Traceback (most recent call last):
  File "../../build/toolchain/mac/linker_driver.py", line 229, in <module>
    Main(sys.argv)
  File "../../build/toolchain/mac/linker_driver.py", line 79, in Main
    subprocess.check_call(compiler_driver_args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
Labels: Sheriff-Chromium
Putting the sheriff label back on this so arriving sheriffs can see if on sheriff-o-matic, since it seems pertinent.

Comment 9 by engedy@chromium.org, Jul 11 2017

Labels: -Pri-2 Pri-1
Furthermore, I think the scale of the breakages (and the resulting sheriff-confusion) is now large enough to warrant increasing the priority.
Summary: Windows and Mac compile breaks with link error every day at 12:30am PST (was: Windows compile breaks with link error every day at 12:30am PST)
Having this bug in sheriff-o-matic is very useful for reducing confusion, thanks tsergeant!

A couple more data points:
https://build.chromium.org/p/chromium.win/builders/WinClang64%20%28dbg%29/builds/14593
https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Mac%20Builder%20%28dbg%29/builds/206910

Comment 12 by w...@chromium.org, Jul 17 2017

Labels: M-62
Hey Sergey, is this something you think you can take a look at? It seems our cleanup_disk pattern is a bit aggressive and is deleting files that are still in use.

Comment 14 by ojan@chromium.org, Jul 24 2017

Ping
Components: Infra>Client>Chrome
Cc: sergeybe...@chromium.org
Owner: jbudorick@chromium.org
Unfortunately I don't have time this week. Reassigning to jbuderick@ to find a better owner in the meantime.

As an idea, if someone can find out which globs cause the trouble and increase the file age in the configs, this might help reduce the frequency of collisions:
https://chrome-internal.googlesource.com/infra/infra_internal/+/master/go/src/infra_internal/tools/cleanup_disk/cmd/cleanup_disk/main.go#32

Once updated, the tool needs to be deployed via puppet: https://chrome-internal.googlesource.com/infra/puppet/+/master/puppetm/etc/puppet/hieradata/cipd.yaml#189
The globs that fail here appear to be the output directory ones. My suspicion is that it's deleting infrequently modified build products because the age check is based on modification time rather than access time. The latter doesn't seem to be as easily available in go, though.
Updating of last access time is disabled by default on Windows. Enabling it has performance implications because now reads become writes.

https://www.groovypost.com/howto/microsoft/enable-last-access-time-stamp-to-files-folder-windows-7/

Blockedon: 416072
Makes sense.

I think the right way to solve this is via the mutex charliea is working on over in issue 416072, which should prevent puppet (and thus cleanup_disk) from running during a build. In the meantime, I can bump the MaxAge for files in the output directories to reduce the frequency at which this happens.
Note that cleanup_disk runs independently of Puppet. Puppet just installs the cron task. Though we probably can somehow wrap cleanup_disk cron itself with the mutex.

Also, IIUC, the mutex thing is only for Swarming bots, which we can pause reasonably well. Pausing buildbot builds will require more work.
#20: re the cron: is that true of puppet on mac & win as well?

for the purposes of this bug, we could get away with wrapping compilation in the mutex rather than the entire build.
Project Member

Comment 22 by bugdroid1@chromium.org, Jul 27 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal/+/48a234beab93efa307f49624c43739995043ff8a

commit 48a234beab93efa307f49624c43739995043ff8a
Author: John Budorick <jbudorick@google.com>
Date: Thu Jul 27 00:59:24 2017

Canary deployment of #22 is blocked on fixing https://uberchromegw.corp.google.com/i/internal.infra/builders/infra-internal-continuous-win-32, which I'll revisit tomorrow.
re #23: manually ran gclient sync on vm97-m7, which failed on permissions issues in /b/internal.DEPS but pulled down enough for the bot to succeed on its next run.
Project Member

Comment 25 by bugdroid1@chromium.org, Jul 27 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/puppet/+/d007a896ad47ba726e6f0e8bea93fd346b526589

commit d007a896ad47ba726e6f0e8bea93fd346b526589
Author: John Budorick <jbudorick@google.com>
Date: Thu Jul 27 22:21:45 2017

Labels: -Sheriff-Chromium
Removed Sheriff-Chromium label.

Comment 27 by olka@chromium.org, Aug 11 2017

Labels: Sheriff-Chromium
Putting Sheriff-Chromium label back. Knowing about this bug would save 20 minutes of tree downtime for me today - and some nerves :).
Yeah, it's been a zany week or two. Haven't had a chance to ensure that the canary changes are ok.
Cc: infe...@chromium.org mstarzinger@chromium.org thakis@chromium.org dpranke@chromium.org pschmidt@chromium.org mmoroz@chromium.org friedman@chromium.org scottmg@chromium.org js...@chromium.org machenb...@chromium.org
 Issue 741603  has been merged into this issue.

Comment 31 by mmoss@chromium.org, Aug 16 2017

Blocking: 752009
How long does the cleanup script take to run?
Project Member

Comment 33 by ClusterFuzz, Aug 16 2017

Labels: ClusterFuzz-Verified
Status: Verified (was: Assigned)
ClusterFuzz testcase 6098683887353856 is verified as fixed, so closing issue as verified.

If this is incorrect, please add ClusterFuzz-Wrong label and re-open the issue.
Labels: -ClusterFuzz-Verified ClusterFuzz-Wrong
Status: Started (was: Verified)
Project Member

Comment 35 by bugdroid1@chromium.org, Aug 16 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/puppet/+/1d0c1bd88aad94bf9f5d1d9f63c29c87880ab381

commit 1d0c1bd88aad94bf9f5d1d9f63c29c87880ab381
Author: John Budorick <jbudorick@google.com>
Date: Wed Aug 16 22:28:48 2017

Status: Fixed (was: Started)
It looks like changing the "output directory contents MaxAge" in #35 helped: all green in https://build.chromium.org/p/chromium.win/builders/Win%20x64%20Builder%20%28dbg%29?numbuilds=200
and I also checked https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Win%20Builder%20%28dbg%29/ which has the last failure at https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Win%20Builder%20%28dbg%29/builds/123964 (Aug 15 00:12).

Assuming this is fixed.
Status: Assigned (was: Fixed)
We only have data of one day - is that enough for being sure?

I'd like to keep this open until we're really sure we have a good fix. I think the solution is more like a work around and this is broken by design. I don't think an asynchronous service should ever mess with the output folders. Why aren't we teaching the recipes to properly synchronously clean things up?

Remark: CL description says 8 days, code says 2 days:
https://chrome-internal-review.googlesource.com/c/419008/3/go/src/infra_internal/tools/cleanup_disk/cmd/cleanup_disk/main.go

I don't get how the fix helps, maybe somebody can explain:
How does an extension from 10 hours to 48 help? If the copied files get a fresh time stamp, 10 hours should already be good enough. And if the old stats are preserved on copy, it's possible that also a larger number is too small.

Comment 38 by mmoss@chromium.org, Aug 18 2017

FYI, last night's official builds had yet another missing files failure (Issue 756760), which is assumed to be this same underlying cause (Issue 752009), so it seems #35 either didn't fix things, or that fix didn't propagate to the official builders for some reason.
I don't think just adjusting the age parameter is sufficient to call this fixed.

We need to figure out a way to not touch files in a user directory while that directory is being used. I think the best path to this is likely to be to either use something like the mutex or just move the cleanup to be a step run in the build.
#37, 39: this should definitely still be open.

#37: the CL I landed was intended to reduce the frequency; I didn't expect it to permanently fix the problem. (I did drop the time from 8 days to 2 days while working on the CL and forgot to update the description...)

I'll remove output directory cleaning from cleanup_disk & try to work something similar into the recipe.
How about teaching the copy wrapper to update the stats of the destination files to get an updated timestamp? Like that they wouldn't get purged in the first place. Not sure if we can update the stats in a race-condition-free way.
#41: I think that could result in quite a bit of unnecessary rebuilding.
Project Member

Comment 43 by bugdroid1@chromium.org, Aug 18 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/infra_internal/+/f0e1674d9e1c55cd6069fea56838337e4265e589

commit f0e1674d9e1c55cd6069fea56838337e4265e589
Author: John Budorick <jbudorick@google.com>
Date: Fri Aug 18 18:20:03 2017

Project Member

Comment 44 by bugdroid1@chromium.org, Aug 18 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/puppet/+/e1a0fcde726d73dec2122bc1f04aa54ead8a0cc5

commit e1a0fcde726d73dec2122bc1f04aa54ead8a0cc5
Author: John Budorick <jbudorick@google.com>
Date: Fri Aug 18 20:49:59 2017

Project Member

Comment 45 by bugdroid1@chromium.org, Aug 19 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infra/puppet/+/060d44b43970f8ff6f0353219cbdcc75c0d7cafc

commit 060d44b43970f8ff6f0353219cbdcc75c0d7cafc
Author: John Budorick <jbudorick@google.com>
Date: Sat Aug 19 17:53:09 2017

Status: Fixed (was: Assigned)
Immediate issue should now be fixed. After talking with dpranke, I've filed https://bugs.chromium.org/p/chromium/issues/detail?id=757251 to revisit how cleanup_disk handles output directories.
Labels: -ClusterFuzz-Wrong
We have made a bunch of changes on ClusterFuzz side, so resetting ClusterFuzz-Wrong label.

Sign in to add a comment