New issue
Advanced search Search tips

Issue 813803 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Opening test results with 1200 failures (for Mac OS 10.13 bot) fails

Project Member Reported by drott@chromium.org, Feb 20 2018

Issue description

Opening webkit_layout_test results on
https://ci.chromium.org/buildbot/tryserver.blink/mac10.13_blink_rel/23

https://test-results.appspot.com/data/layout_results/mac10_13_blink_rel/23/layout-test-results/results.html

fails with a short text 
"Exceeded soft private memory limit of 1024 MB with 1361 MB after servicing 419 requests total"


martiniss@, could you take a look?

 
Status: Started (was: Assigned)
That's failing because there were ~1200 failed tests that it's trying to cache, which is probably taking too long and too much memory. 

The easiest thing to do is pick a limit for the number of tests to cache. I'l implement that first, and if it turns out we really need to cache all of them, I can add that later. 
My temporary fix works, demo at https://14080-59dd026-tainted-martiniss-dot-test-results-test-hrd.appspot.com/data/layout_results/mac10_13_blink_rel/23/layout-test-results/results.html

It won't be super fast because everything isn't cached. If that's a big issue let me know and I can figure out a way to make it work.

This fix should land and go live today hopefully.
Project Member

Comment 3 by bugdroid1@chromium.org, Feb 20 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/4eb8e34668c12b7bf53fdcfa55518b1c885c6d02

commit 4eb8e34668c12b7bf53fdcfa55518b1c885c6d02
Author: Stephen Martinis <martiniss@chromium.org>
Date: Tue Feb 20 19:26:44 2018

Test results zip: handle large number of failures

If we have too many failures, the test results server takes too long
trying to cache data and OOMs every time a user requests the data. Only
cache some of the failures, in case we have a huge number of them.

Bug:  813803 
Change-Id: I1b38ecf831b099809d6c66c7411afd69cdfea77c
Reviewed-on: https://chromium-review.googlesource.com/926908
Reviewed-by: Sean McCullough <seanmccullough@chromium.org>
Commit-Queue: Stephen Martinis <martiniss@chromium.org>

[modify] https://crrev.com/4eb8e34668c12b7bf53fdcfa55518b1c885c6d02/go/src/infra/appengine/test-results/frontend/zip.go

Comment 4 by drott@chromium.org, Feb 20 2018

Thank you for quickly attacking this problem.
Status: Fixed (was: Started)
Fix should be live

Comment 6 by agoode@chromium.org, Feb 26 2018

Status: Started (was: Fixed)
I am running out of memory still:
https://test-results.appspot.com/data/layout_results/linux_trusty_blink_rel/24383/layout-test-results/failing_results.json

This has 2859 failures, and it is difficult to do the needed rebaseline with the server unable to handle it.

Comment 7 by agoode@chromium.org, Feb 26 2018

I am not sure, but maybe replacing the count limit with a size limit would help.

Instead of:
if len(toPut) > 500 {
  break
}


Maybe:
var totalSize uint64
...
for _,f := ... {
 ...
 totalSize += f.UncompressedSize
 if totalSize > 200 * 1024 * 1024 {
   break
 }

??

Comment 8 by agoode@chromium.org, Feb 26 2018

Actually I'm a bit confused. It looks like this is uncompressing the same entry multiple times:

	for _, f := range zr.File {
		for _, test := range failedTests {
			if strings.Contains(f.Name, test) {
				freader, err := f.Open()
				res, err := ioutil.ReadAll(freader)
.....


Though I am likely misunderstanding the code.
This bug has nothing to do with why 
https://test-results.appspot.com/data/layout_results/linux_trusty_blink_rel/24383/layout-test-results/failing_results.json is failing.

This URL is failing to load because the zip file is 600 MB, and the test results server frontends are running out of memory when retrieving the zip file, as far as I can tell. 

This bug fixed an issue with retrieving the results.html file. This URL is retrieving the results json file; the logic you're talking about in #7 and #8 isn't triggered by this URL at all.

I'm not sure if there's a way to fix this... :/ 
So there are two different failure points behind the same symptom (OOM): too many entries (failing at warming up cache), and too large ZIP (failing at retrieving). The former has been fixed in crrev.com/c/926908 .

The latter is relatively rare but still concerning. For example, agoode@'s CL https://crrev.com/c/933805. Changes like this won't be able to land without fixing this bug.

Is there any kind of temporary disk storage on appengine to store the zip? Or is there a way to increase the memory limit? I'd guess the upper bound of the zip size is probably ~1GB, as agoode@'s CL is almost the worst scenario...

If neither is feasible, there is a theoretically possible but painful fix. IIRC, zip files can be indexed. So perhaps we can use HTTP range requests to get the metadata of the zip file first, and only download the necessary bits to extract the desired file... And we might need to hand write a custom zip library for this...
> So perhaps we can use HTTP range requests to get the metadata of the zip file 
> first, and only download the necessary bits to extract the desired file...
> And we might need to hand write a custom zip library for this...

Yup, it shouldn't be hard to write a custom reader to extract an individual member, it's something like three reads.
I'm fiddling with a CL to do the partial reads for zips. I have a working CL but it's really slow because the default go zip implementation reads 4096 byte chunks at a time, and the file we want to download would take ~3k of those reads, which is way too many RPCs. I'm looking to add a buffer which would speed this up.
Ok, I have a WIP CL which seems to work locally. I'll hopefully be able to land this today.
Thanks! And sorry for my confusion.
I'm still seeing some OOM 500s when browsing results in a ~200MB zip file.

https://test-results.appspot.com/data/layout_results/mac10_13_blink_rel/37/layout-test-results/results.html

Some failures contained messages like "Exceeded soft private memory limit of 1024 MB with 1329 MB after servicing 805 requests total", while the others were empty. All failures went away after retry, which suggests the zip file isn't too large to retrieve and unzip.

Is the threshold in crrev.com/c/926908 still a bit too high? Or is there some other sources of OOM?
I noticed this as well while doing my testing. The change I'm going to land seems to make these go away; it means the app engine servers don't need to keep the whole zip file in memory, which is probably why we're OOMing so much.
Project Member

Comment 17 by bugdroid1@chromium.org, Feb 27 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/6a350b45a2b612a1831885c8e74a8c6cbb8fe3ce

commit 6a350b45a2b612a1831885c8e74a8c6cbb8fe3ce
Author: Stephen Martinis <martiniss@chromium.org>
Date: Tue Feb 27 22:17:15 2018

Test results zip: Handle huge zip file

Bug:  813803 
Change-Id: I3982017e94c5c6e8ae0940853e69bf00fca45bbe
Reviewed-on: https://chromium-review.googlesource.com/938394
Commit-Queue: Stephen Martinis <martiniss@chromium.org>
Reviewed-by: Sean McCullough <seanmccullough@chromium.org>

[modify] https://crrev.com/6a350b45a2b612a1831885c8e74a8c6cbb8fe3ce/go/src/infra/appengine/test-results/frontend/zip.go

Status: Fixed (was: Started)
Fix should be live now.
It works, thanks!

Sign in to add a comment