Opening test results with 1200 failures (for Mac OS 10.13 bot) fails |
||||
Issue descriptionOpening webkit_layout_test results on https://ci.chromium.org/buildbot/tryserver.blink/mac10.13_blink_rel/23 https://test-results.appspot.com/data/layout_results/mac10_13_blink_rel/23/layout-test-results/results.html fails with a short text "Exceeded soft private memory limit of 1024 MB with 1361 MB after servicing 419 requests total" martiniss@, could you take a look?
,
Feb 20 2018
My temporary fix works, demo at https://14080-59dd026-tainted-martiniss-dot-test-results-test-hrd.appspot.com/data/layout_results/mac10_13_blink_rel/23/layout-test-results/results.html It won't be super fast because everything isn't cached. If that's a big issue let me know and I can figure out a way to make it work. This fix should land and go live today hopefully.
,
Feb 20 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/4eb8e34668c12b7bf53fdcfa55518b1c885c6d02 commit 4eb8e34668c12b7bf53fdcfa55518b1c885c6d02 Author: Stephen Martinis <martiniss@chromium.org> Date: Tue Feb 20 19:26:44 2018 Test results zip: handle large number of failures If we have too many failures, the test results server takes too long trying to cache data and OOMs every time a user requests the data. Only cache some of the failures, in case we have a huge number of them. Bug: 813803 Change-Id: I1b38ecf831b099809d6c66c7411afd69cdfea77c Reviewed-on: https://chromium-review.googlesource.com/926908 Reviewed-by: Sean McCullough <seanmccullough@chromium.org> Commit-Queue: Stephen Martinis <martiniss@chromium.org> [modify] https://crrev.com/4eb8e34668c12b7bf53fdcfa55518b1c885c6d02/go/src/infra/appengine/test-results/frontend/zip.go
,
Feb 20 2018
Thank you for quickly attacking this problem.
,
Feb 20 2018
Fix should be live
,
Feb 26 2018
I am running out of memory still: https://test-results.appspot.com/data/layout_results/linux_trusty_blink_rel/24383/layout-test-results/failing_results.json This has 2859 failures, and it is difficult to do the needed rebaseline with the server unable to handle it.
,
Feb 26 2018
I am not sure, but maybe replacing the count limit with a size limit would help.
Instead of:
if len(toPut) > 500 {
break
}
Maybe:
var totalSize uint64
...
for _,f := ... {
...
totalSize += f.UncompressedSize
if totalSize > 200 * 1024 * 1024 {
break
}
??
,
Feb 26 2018
Actually I'm a bit confused. It looks like this is uncompressing the same entry multiple times:
for _, f := range zr.File {
for _, test := range failedTests {
if strings.Contains(f.Name, test) {
freader, err := f.Open()
res, err := ioutil.ReadAll(freader)
.....
Though I am likely misunderstanding the code.
,
Feb 26 2018
This bug has nothing to do with why https://test-results.appspot.com/data/layout_results/linux_trusty_blink_rel/24383/layout-test-results/failing_results.json is failing. This URL is failing to load because the zip file is 600 MB, and the test results server frontends are running out of memory when retrieving the zip file, as far as I can tell. This bug fixed an issue with retrieving the results.html file. This URL is retrieving the results json file; the logic you're talking about in #7 and #8 isn't triggered by this URL at all. I'm not sure if there's a way to fix this... :/
,
Feb 26 2018
So there are two different failure points behind the same symptom (OOM): too many entries (failing at warming up cache), and too large ZIP (failing at retrieving). The former has been fixed in crrev.com/c/926908 . The latter is relatively rare but still concerning. For example, agoode@'s CL https://crrev.com/c/933805. Changes like this won't be able to land without fixing this bug. Is there any kind of temporary disk storage on appengine to store the zip? Or is there a way to increase the memory limit? I'd guess the upper bound of the zip size is probably ~1GB, as agoode@'s CL is almost the worst scenario... If neither is feasible, there is a theoretically possible but painful fix. IIRC, zip files can be indexed. So perhaps we can use HTTP range requests to get the metadata of the zip file first, and only download the necessary bits to extract the desired file... And we might need to hand write a custom zip library for this...
,
Feb 26 2018
> So perhaps we can use HTTP range requests to get the metadata of the zip file > first, and only download the necessary bits to extract the desired file... > And we might need to hand write a custom zip library for this... Yup, it shouldn't be hard to write a custom reader to extract an individual member, it's something like three reads.
,
Feb 26 2018
I'm fiddling with a CL to do the partial reads for zips. I have a working CL but it's really slow because the default go zip implementation reads 4096 byte chunks at a time, and the file we want to download would take ~3k of those reads, which is way too many RPCs. I'm looking to add a buffer which would speed this up.
,
Feb 26 2018
Ok, I have a WIP CL which seems to work locally. I'll hopefully be able to land this today.
,
Feb 26 2018
Thanks! And sorry for my confusion.
,
Feb 26 2018
I'm still seeing some OOM 500s when browsing results in a ~200MB zip file. https://test-results.appspot.com/data/layout_results/mac10_13_blink_rel/37/layout-test-results/results.html Some failures contained messages like "Exceeded soft private memory limit of 1024 MB with 1329 MB after servicing 805 requests total", while the others were empty. All failures went away after retry, which suggests the zip file isn't too large to retrieve and unzip. Is the threshold in crrev.com/c/926908 still a bit too high? Or is there some other sources of OOM?
,
Feb 26 2018
I noticed this as well while doing my testing. The change I'm going to land seems to make these go away; it means the app engine servers don't need to keep the whole zip file in memory, which is probably why we're OOMing so much.
,
Feb 27 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/6a350b45a2b612a1831885c8e74a8c6cbb8fe3ce commit 6a350b45a2b612a1831885c8e74a8c6cbb8fe3ce Author: Stephen Martinis <martiniss@chromium.org> Date: Tue Feb 27 22:17:15 2018 Test results zip: Handle huge zip file Bug: 813803 Change-Id: I3982017e94c5c6e8ae0940853e69bf00fca45bbe Reviewed-on: https://chromium-review.googlesource.com/938394 Commit-Queue: Stephen Martinis <martiniss@chromium.org> Reviewed-by: Sean McCullough <seanmccullough@chromium.org> [modify] https://crrev.com/6a350b45a2b612a1831885c8e74a8c6cbb8fe3ce/go/src/infra/appengine/test-results/frontend/zip.go
,
Feb 27 2018
Fix should be live now.
,
Feb 28 2018
It works, thanks! |
||||
►
Sign in to add a comment |
||||
Comment 1 by martiniss@chromium.org
, Feb 20 2018