Archiving layout test results fails when there are many results to upload. |
|||||||||||||
Issue descriptionOn builds https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/394718 https://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/396878 The archive_webkit_tests_results step fails with an error code, however, the file does seem to have been uploaded: gsutil.py du gs://chromium-layout-test-archives/win_chromium_rel_ng/396878/layout-test-results.zip 672998799 gs://chromium-layout-test-archives/win_chromium_rel_ng/396878/layout-test-results.zip gsutil.py du gs://chromium-layout-test-archives/win_chromium_rel_ng/394718/layout-test-results.zip 671461099 gs://chromium-layout-test-archives/win_chromium_rel_ng/394718/layout-test-results.zip
,
Mar 13 2017
Two points of data: 1: The build is purple b/c the "gsutil.py" command is timing out after 3600 seconds (1 hour) E:\b\depot_tools\python276_bin\python.exe E:\b\rr\tmp6lagfk\rw\checkout\scripts\slave\.recipe_deps\depot_tools\gsutil.py -- -q -h "Cache-Control:public, max-age=31556926" cp file://E:\b\c\chrome_staging\layout-test-results.zip gs://chromium-layout-test-archives/win10_blink_rel/2236/layout-test-results.zip E:\b\depot_tools\python276_bin\python.exe E:\b\rr\tmp6lagfk\rw\checkout\scripts\slave\.recipe_deps\depot_tools\gsutil.py -- -m -q -h "Cache-Control:public, max-age=31556926" cp -R E:\b\rr\tmp6lagfk\w\layout-test-results gs://chromium-layout-test-archives/win10_blink_rel/2236 command timed out: 3600 seconds without output, attempting to kill E:\b\depot_tools\python276_bin\python.exe E:\b\rr\tmp6lagfk\rw\checkout\scripts\slave\.recipe_deps\depot_tools\gsutil.py -- -q -h "Cache-Control:public, max-age=31556926" cp file://E:\b\c\chrome_staging\LAST_CHANGE gs://chromium-layout-test-archives/win10_blink_rel/2236/layout-test-results/LAST_CHANGE program finished with exit code 1 2: The "archive_webkit_test_results" is returning non-zero, which is why the step is red. E:\b\depot_tools\python276_bin\python.exe E:\b\rr\tmp6lagfk\rw\checkout\scripts\slave\.recipe_deps\depot_tools\gsutil.py -- -q -h "Cache-Control:public, max-age=31556926" cp file://E:\b\c\chrome_staging\layout-test-results.zip gs://chromium-layout-test-archives/win10_blink_rel/2236/layout-test-results.zip E:\b\depot_tools\python276_bin\python.exe E:\b\rr\tmp6lagfk\rw\checkout\scripts\slave\.recipe_deps\depot_tools\gsutil.py -- -m -q -h "Cache-Control:public, max-age=31556926" cp -R E:\b\rr\tmp6lagfk\w\layout-test-results gs://chromium-layout-test-archives/win10_blink_rel/2236 command timed out: 3600 seconds without output, attempting to kill E:\b\depot_tools\python276_bin\python.exe E:\b\rr\tmp6lagfk\rw\checkout\scripts\slave\.recipe_deps\depot_tools\gsutil.py -- -q -h "Cache-Control:public, max-age=31556926" cp file://E:\b\c\chrome_staging\LAST_CHANGE gs://chromium-layout-test-archives/win10_blink_rel/2236/layout-test-results/LAST_CHANGE program finished with exit code 1 (2) is probably b/c of (1) - gsutil is returning non-zero error code b/c it was killed.
,
Mar 13 2017
Ah, so when trying to upload some large files, it fails due to timeout? In general, I think that cases where the file to upload is 100s of MB is very rare, and in archive_webkit_tests_results if it occasionally is unable to upload layout-test-results.zip because of this, that should be OK. So I think what we want to do here is to handle this case more gracefully, maybe by skipping upload for large files, or maybe within the step itself we want abort the upload and continue if it takes more than some certain amount of time?
,
Mar 13 2017
,
Mar 13 2017
As mentioned in the thread on chrome-infrastructure-team@ I actually would need this case to work rather than failing early since I rely on webkit-patch rebaseline-cl to fetch the rebaselines and rebaseline this large change, which affects font files. I have no alternative means of retrieving all Windows rebaselines across different windows versions such as Win 8 and Win 10. As Dirk was asking on the email thread about the size of the ZIPs: The file contains actual, expected and diffs for all test cases, plus the same data for retry_(1|2|3), which is also a large number of tests due the large number of failures. I am surprised about the slow upload speed, though. Downloading this file (gs://chromium-layout-test-archives/win_chromium_rel_ng/396878/layout-test-results.zip) using 'gsutil cp' in the office takes less than 15 seconds.
,
Mar 13 2017
Ah, interesting. I forgot about the retries. We should turn that off for this change. upload speed for gsutil can vary widely. It also doesn't help that we update multiple copies of things.
,
Mar 13 2017
In any case, it sounds like the zip files aren't a problem here; looking at comment #2, it looks like it gets killed when uploading the individual files after the zip is already uploaded, since that takes at least an hour and doesn't log anything while it's running. Previously, we stopped it from logging anything there (when invoking `gsutil.py -- -m ...`) because it was really verbose: https://chromium-review.googlesource.com/c/419058/ Conceivably, if it logged stuff there it wouldn't be killed after an hour, but it would still take a really long time to finish uploading.
,
Mar 13 2017
So the timeout is because of buildbot, not because of gsutil? I vaguely remember that being a problem on another builder - if there was no stdout for an hour the step would timeout.
,
Mar 13 2017
I can't actually tell from the logs what happened at the end of the run. I'm not sure if LogDog is choking on them or if maybe I'm looking at the wrong things. dnj@ - any idea? If nothing else, these might be good stress-test cases for LogDog :).
,
Mar 13 2017
Heya - the log is really long, but LogDog does return the full log for me. I posted the last lines of it in my response in #2.
,
Mar 13 2017
ah, thanks.
,
Mar 15 2017
What are the next steps here? Unfortunately I am quite blocked from landing a Windows font improvement because of this issue. Thanks for your help again.
,
Mar 15 2017
,
Mar 15 2017
I think the next thing you should do is to modify run-webkit-tests as part of your change to turn off the retries: https://cs.chromium.org/chromium/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/run_webkit_tests.py?rcl=b2e8c35d8bde821a28d0a648d75998efed7ad1c8&l=363 change the `3` to `0` and see if that's enough to get you through. You might also modify the code to not try to save the -expected and -diff files, though of course that makes it impossible to review the changes. Also, it looks like the windows version of this step isn't passing --exit-after-n-failures-or-timeouts or --exit-after-n-failures , which is perhaps good in this case, but bad generally. We should fix that, and change the code so that if we get too many failures we don't bother to retry things. Generally speaking I think this is a tough thing to ask the infrastructure to support, because it is so far beyond what anything else needs to do when handling failures. When I did large rebaselining efforts in the past, I would do the rebaselining on my own machine and commit them directly. I'm not saying we can't make the infrastructure work for this case, but the amount of work it might take to do that well might take far longer than the amount of time it'd take devs to work around it. One way to partially address the upload problem might be to just upload a single copy of the zip file and write an app that can fetch individual file results out of the zip file on the fly, rather than having to upload multiple copies of things (see bug 310382 ).
,
Mar 15 2017
Thanks, Dirk, sounds very reasonable, and thank you for the suggestions. > change the `3` to `0` and see if that's enough to get you through. I'll try that. The other idea I just had: If the ZIP uploads succeed, could I perhaps modify the webkit-patch rebaseline-cl script to fetch those and extract them locally to get to the results, and ignore the exception state of the bot. Or could I use functions of webkit-patch to use the layout-test-results.zip from the servers to merge and deduplicate baselines locally into my commit? For example, would it work to download the win10 results, extract/copy the failed ones to LayoutTests/platform/win, then use webkit-patch optimize-baselines to deduplicate them? Any other suggestions on this, Quinten? > I would do the rebaselining on my own machine and commit them directly. Right, for Win10 I could do that, but I don't have Windows 7 and I am guessing getting a separate machine, and setting that up would be a much longer roundtrip, so I hope to be able to use the bot results.
,
Mar 22 2017
> I'll try that. The other idea I just had: If the ZIP uploads succeed, could I perhaps modify the webkit-patch rebaseline-cl script to fetch those and extract them locally to get to the results, and ignore the exception state of the bot. Or could I use functions of webkit-patch to use the layout-test-results.zip from the servers to merge and deduplicate baselines locally into my commit? For example, would it work to download the win10 results, extract/copy the failed ones to LayoutTests/platform/win, then use webkit-patch optimize-baselines to deduplicate them? Any other suggestions on this, Quinten? Sorry for the slow response -- Ignoring the exception state of the bot seems reasonable to me, although I'm not sure if theoretically it should already be doing that :-/ I just messed around with a bit (https://codereview.chromium.org/2767233002), and I think that theoretically rebaseline-cl should try to download results for purple builds (but it will still give up and abort if it fails to download any of the results, e.g. if it was only partly uploaded). My hypothesis now is that it's trying to download but Downloading layout-tests-results.zip for each builder and extracting locally and merging files into LayoutTests/platform/* should work but sounds like a lot of hassle :-( Any more luck on this? Is this still blocking work?
,
Mar 23 2017
Thank you Quinten. It's not imminently blocking anymore atm, since I am resorting to a more manual approach similar to what Dirk explained. I did the initial set of platform/win specific baselines, and now at least on Windows, the subsequent test result uploads succeed, but now I am running into issue 703680. It would still be very helpful, but that's more of a feature request - I could file that separately - if rebaseline-cl, would not insist on all bot results being available, but if I understood correctly, there are some limitations, as this could lead to inconsistent baselines?
,
Mar 23 2017
That's right - In the first version, rebaseline-cl insists on all results being available because that makes de-duping baselines simpler and generally means that we get correct baselines for all platforms as long as the tests aren't flaky, but in practice usually we can get correct baselines with just one or a few platforms' results; this is bug 673966 , and this is the next major thing I want to fix with rebaseline-cl :-) P.S. I'm not sure if you ran into this, but when trying to do a relatively large rebaseline today, rebaseline-cl got stuck, which was apparently because of some performance regressions I introduced very recently (fix: https://codereview.chromium.org/2770023003).
,
Mar 28 2017
Random thought: would isolating these files rather than directly uploading be any better? It would save bandwidth on uploading identical content, and has all the intelligent retries and doesn't depend on gsutil and its quirks. It may even be faster isolating the original files rather than their zip, as isolate would only upload files it hasn't seen before.
,
Mar 28 2017
In the meantime, I worked on the rebaselines manually and managed to land the original change. However, this situation will likely come up again occasionally if we have larger updates to FreeType or our font rendering settings on other platforms. I am reducing the priority for now. Thanks to everyone who helped on this issue so far.
,
Mar 28 2017
,
Apr 6 2018
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 6 2018
This is most likely fixed now, because it looks like it was hanging on the recursive upload of the directories (not the zip files), and we don't do that any more (thanks to martiniss' work to make things work w/ just the zip files). |
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by drott@chromium.org
, Mar 8 2017