New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 713196 link

Starred by 3 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug

Blocked on:
issue 735306
issue 717736
issue 735303
issue 758750

Blocking:
issue 790689



Sign in to add a comment

ANGLE CQ Win32/Release flakily triggers python exception when processing logs

Project Member Reported by jmad...@chromium.org, Apr 19 2017

Issue description

I believe what's happening here is the logs for this bot are so large that they break 32-bit Python's memory limit on upload. The bots have a step where they re-encode the whole log, and this seems to be where it crashes.

Link to waterfall:

https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_rel_ng?numbuilds=50

Link to failing builds:

https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_rel_ng/builds/4563
https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_rel_ng/builds/4557
https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_rel_ng/builds/4529

Several others are apparent on the waterfall currently.

The error in question:

Traceback (most recent call last):
  File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\util.py", line 150, in raises
    yield
  File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\run.py", line 255, in run_step
    open_step = self._step_runner.open_step(step_config)
  File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 176, in open_step
    step_config, recipe_test_api.DisabledTestData()
  File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 525, in render_step
    new_cmd.extend(item.render(tdata))
  File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 38, in render
    os.write(input_fd, self.encode(self.data))
  File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 67, in encode
    return data.decode('utf-8', 'replace').encode('utf-8')
MemoryError

 
Another thing to note is that the failing step is
"Upload to test-results [angle_deqp_gles3_d3d11_tests on ATI GPU on Windows (with patch) on Windows-2008ServerR2-SP1]",
however it fails only on ANGLE CQ, but
"Upload to test-results [angle_deqp_gles3_d3d11_tests on ATI GPU on Windows on Windows-2008ServerR2-SP1]"
succeeds on GPU.FYI waterfall, for example:
https://build.chromium.org/p/chromium.gpu.fyi/builders/Win7%20Release%20%28AMD%29/builds/2455
Cc: -iannucci@chromium.org ynovikov@chromium.org phajdan.jr@chromium.org
Owner: iannucci@chromium.org
This is the first occurrence I found of this problem:
https://luci-milo.appspot.com/buildbot/tryserver.chromium.angle/win_angle_rel_ng/4156,
dated "Wed, 2017-03-22 7:02:37 PM (local time)".

Not sure what to contribute it to.

iannucci@, could you please take a look, you seem to have touched this code recently?
Also ccing phajdan.jr@.

If the problem is with parsing stdout of these tests, we've switched Android to using XML output to produce JSON, maybe this will help here, too?
Cc: tandrii@chromium.org serg...@chromium.org
Robbie, Pawel, Sergiy, Andrii, can someone please look at this?
Project Member

Comment 4 by bugdroid1@chromium.org, Apr 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/angle/angle/+/de32c2e1a388b87d929bcd5bacf3189c370c8fa4

commit de32c2e1a388b87d929bcd5bacf3189c370c8fa4
Author: Jamie Madill <jmadill@chromium.org>
Date: Thu Apr 20 20:40:09 2017

dEQP: Remove shader compile/link timing output.

This should reduce our log sizes significantly. Will upstream the fix
to dEQP and then remove the custom workaround once we roll.

BUG=chromium:713196

Change-Id: I0ddb678bf5256bd5f05782ca40d49501e84e4cae
Reviewed-on: https://chromium-review.googlesource.com/483047
Reviewed-by: Geoff Lang <geofflang@chromium.org>
Commit-Queue: Jamie Madill <jmadill@chromium.org>

[add] https://crrev.com/de32c2e1a388b87d929bcd5bacf3189c370c8fa4/src/tests/deqp_support/qpTestLog.c
[modify] https://crrev.com/de32c2e1a388b87d929bcd5bacf3189c370c8fa4/src/tests/deqp.gypi

Comment 5 by iannu...@google.com, Apr 20 2017

Owner: phajdan.jr@chromium.org
MemoryError happens when you load too much data into the recipe; this means that you're likely ingesting a 500+MB log (i.e. with raw_io.output or json.output) into the recipe_engine.

There's not much I can do about this (besides adding safeguards to raw_io to truncate the log data to avoid a crash, which obviously has other downsides). I would recommend altering the test/test_runner to output less data to the files that recipe_engine is ingesting (it looks like that's what happened in #4?). Log data that goes to stdout/stderr doesn't have a size limit.

Pawel is most familiar with this recipe, I think, so perhaps he can comment more.
There are 12 shards, each one has ~15MB output.json.
That's 180 MB, much less than 500MB.

Also, I'm not sure the formula should be 12*15, as only 1/12th of each output.json is unique. Makes sense to combine the 12 into 1, which I estimate would be around 20MB.
Cc: -serg...@chromium.org
Project Member

Comment 8 by bugdroid1@chromium.org, Apr 24 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/cda13a9e3a043b37ecc7f28a60333091a80bdf3e

commit cda13a9e3a043b37ecc7f28a60333091a80bdf3e
Author: jmadill <jmadill@chromium.org>
Date: Mon Apr 24 16:42:09 2017

Roll ANGLE ba992ab..331f6db

https://chromium.googlesource.com/angle/angle.git/+log/ba992ab..331f6db

BUG= chromium:644033 ,chromium:713196

TBR=geofflang@chromium.org

TEST=bots

CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.android:android_optional_gpu_tests_rel

Review-Url: https://codereview.chromium.org/2842433003
Cr-Commit-Position: refs/heads/master@{#466660}

[modify] https://crrev.com/cda13a9e3a043b37ecc7f28a60333091a80bdf3e/DEPS

Cc: phajdan@google.com
Owner: ----
Status: Untriaged (was: Assigned)
I'd say we either need 64-bit python, or the amount of processed data needs to be greatly reduced.

Let me know if/how can I best further help here.
Owner: no...@chromium.org
Status: Assigned (was: Untriaged)
Nodir is working on rebuilding python from scratch. Since we only support 64 bits OSes now, it could be a good thing to settle for x64 python build.
Is it possible to check what is the data size?
As I mentioned in #6, I don't think it exceeds 180MB, and if the merge of JSONs from all shards is done correctly, it should be around 20MB.

I think verifying that we handle the data efficiently has some advantages to utilizing more memory.
Cc: no...@chromium.org
Owner: ----
Status: Available (was: Assigned)
Actually Nodir should not own *this* bug, but we could mark the python rebuild bug as a blocker here.

Comment 13 by no...@chromium.org, Apr 27 2017

i am building python only on linux and mac ( bug 714852 ). I don't build 64bit Python on Windows. It would be great to do, but it is outside of my scope.
This bug hasn't been updated in 4 days and it's still in the trooper queue.

maruel/nodir@: Who owns 64bit python on windows, then?
phajdan: How can ynovikov@chromium.org check the size of stdout for this step?

# 14
no one owns it. We never (?) had 64bit python on windows
Blockedon: 717736
I filed a bug to do this and marked it as a blocker. I still recommend trimming down the JSON data and I'll leave this issue focused on the JSON data or whatever workaround that can be implemented by Angle team.
Labels: -Infra-Troopers
Since this is blocked on a non-trooper issue, I'm removing the troopers label from it.
I think I don't fully understand what you're trying to do in your recipe steps here, but doing anything where you're trying to send > 1 MB of JSON directly back through the recipe engine and recipe logs is almost certainly the wrong thing to do (i.e., this isn't what we designed things to do and we're probably using it wrong).

So, rather than trying to work around things by getting a 64-bit python, we need to figure out if there's a different way to do what you're trying to do.
Cc: -tandrii@chromium.org
I'm not really familiar with the recipe myself, but I think what we are trying to send are the test results. angle_deqp_gles3_d3d11_tests suite has more than 40,000 tests. It's split into 12 shards, with each shard's output.json being ~15MB in size. I suspect it tries to upload 12*15=180MB of data, though most of every shard's output.json is duplicated in another shards - only the tests which ran in this shard are unique, so I think if all the shards are merged before uploading, the size should be much smaller, maybe around 20MB.

Can someone who is more familiar with recipes confirm the above? Ken? Jamie?

Comment 21 by kbr@chromium.org, May 4 2017

Owner: phajdan.jr@chromium.org
Status: Assigned (was: Available)
Here's one of the shards' output.

The bulk of the output seems to be a "test_locations" dictionary that seems to include the line number of every test in the binary, not just those run in the shard. (Note that for example:
functional_buffer_copy_single_buffer_transform_feedback_uniform
shows up in the "all_tests" list and in the "test_locations" dictionary, but not in the "per_iteration_data" which seems to contain the results from just the tests that ran in this shard.)

If the base/test/launcher harness either skipped writing the test_locations dictionary entirely, or filtered it so it only contained those tests that ran, that would save enough space that this wouldn't work.

Pawel, could you look into adding such an option?

angle_deqp_gles3_d3d11_tests_shard.json.gz
332 KB Download
Cc: -no...@chromium.org
Cc: serg...@chromium.org
I want to look into this for a bit; I think there's a few things we're doing here that we should probably change.

Also, +sergiyb for the test_locations issue, particularly since he claimed that using a trie for test names for this wasn't needed :).
Components: -Infra
Owner: dpranke@chromium.org
Components: -Internals>GPU>Testing -Internals>GPU>ANGLE Infra
Labels: -Pri-1 Pri-2
dpranke@ removing GPU categories and bumping priority to P2.  Please triage as appropriate.

Thanks.
Labels: -Pri-2 Pri-1
This is a bit of a bigger problem now. It seems that whenever a Windows ANGLE bot run has any kind of error, the logs seem to explode, and get large enough to trigger the Python exception. So, every failure becomes a purple exception, with the tests at the end being cut off, and the failing steps not being reported correctly.

Example failures:

https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_rel_ng/builds/5331

Here one step has a few flaky timeouts, which is a problem we need to fix, but it seems to crash the recipe, preventing the tests at the end from running. Upgrading priority because this does impact our ability to land changes through CQ flakiness.
Components: Internals>GPU>ANGLE
Labels: Hotlist-PixelWrangler
Adding a few tags.. I know this bug isn't really an ANGLE bug but want to keep it visible.
Seems to be failing on the ANGLE CQ for otherwise green CLs now as well, not sure what changed:

https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_rel_ng/builds/5365

Dirk would you or someone else have a chance to look at this? I'm concerned about the ability for our CQ to run on Windows (our most important platform). We could start removing tests to just get things going again.
Yes, sorry for the long delay :( I'll get on this today.
Blockedon: 735303 735306
Components: Infra>Platform>Recipes Infra>Client>Chrome
Owner: jmad...@chromium.org
My guess is that your test suite has gotten big enough so that when we're splitting and merging data you're running out of memory regardless of whether something failed or not.

Unfortunately, I haven't been able to reconstruct the context I had in my head when I made the comment in #23. I think there are a few things wrong:

1) as noted in comment #21, we're logging all of the tests we find, not just the ones in the shard. This is unnecessary and expensive.

2) we're re-rending the json output for reasons I don't fully understand (to try and prevent us from sending illegal utf8 data, I guess). 

3) We read the entire json results file into memory to look at a single field (the valid field). Maybe we should switch things to just trust the error code, or maybe we need to split the summary fields out from the detail fields, or call out to a helper script to just extract the summary fields.

@jmadill - I suggest you disable some swath of tests temporarily to get this green again. I'll bump this bug back to you to do so for now. I've filed bugs  735303  and 735306 and asked sergiyb@ and phajdan to look at the issues, respectively; we should be able to fix some of the underlying issues pretty quickly.

An alternative is to switch to a 64-bit python, but I think the bottom line is that we're doing stupid, expensive things here and we need to stop doing them rather than paper over them. All of this excessive logging makes other parts of the build output unusable as a result, and switching to 64-bit won't fix that.
Thanks Dirk. Geoff landed a CL to disable some tests in https://chromium-review.googlesource.com/c/541616/ .. can you re-assign to the appropriate owner?
Owner: geoffl...@chromium.org
Let's leave this open for a while and see if things green up, and then revert (re-enable) the tests and see if it stays that way, and then we can close this. 
Components: -Infra
Project Member

Comment 35 by bugdroid1@chromium.org, Jul 4 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/164eca3f4584601d870c2751229b0e9abf33eb15

commit 164eca3f4584601d870c2751229b0e9abf33eb15
Author: Geoff Lang <geofflang@chromium.org>
Date: Tue Jul 04 15:52:17 2017

Re-enable the dEQP ES3 tests for AMD on the GPU FYI waterfall.

TBR=kbr@chromium.org

BUG=713196

Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel
Change-Id: Ib0d3b1996325ebfa9342240b17393e4c2f4a84cd
Reviewed-on: https://chromium-review.googlesource.com/558807
Commit-Queue: Geoff Lang <geofflang@chromium.org>
Reviewed-by: Geoff Lang <geofflang@chromium.org>
Cr-Commit-Position: refs/heads/master@{#484118}
[modify] https://crrev.com/164eca3f4584601d870c2751229b0e9abf33eb15/content/test/gpu/generate_buildbot_json.py
[modify] https://crrev.com/164eca3f4584601d870c2751229b0e9abf33eb15/testing/buildbot/chromium.gpu.fyi.json

Project Member

Comment 36 by bugdroid1@chromium.org, Jul 5 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2e23bfcc4750f3ffc7945cac8a01c9b9c61c4081

commit 2e23bfcc4750f3ffc7945cac8a01c9b9c61c4081
Author: Jamie Madill <jmadill@chromium.org>
Date: Wed Jul 05 14:09:28 2017

Revert "Re-enable the dEQP ES3 tests for AMD on the GPU FYI waterfall."

This reverts commit 164eca3f4584601d870c2751229b0e9abf33eb15.

Reason for revert: Recipe engine bug still happening.

https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_rel_ng/builds/5598

BUG=713196

Original change's description:
> Re-enable the dEQP ES3 tests for AMD on the GPU FYI waterfall.
> 
> TBR=kbr@chromium.org
> 
> BUG=713196
> 
> Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel
> Change-Id: Ib0d3b1996325ebfa9342240b17393e4c2f4a84cd
> Reviewed-on: https://chromium-review.googlesource.com/558807
> Commit-Queue: Geoff Lang <geofflang@chromium.org>
> Reviewed-by: Geoff Lang <geofflang@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#484118}

TBR=geofflang@chromium.org,kbr@chromium.org

Change-Id: I4f4bfd177ea15d239bb2dcaca889cf65c56b1963
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug: 713196
Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel
Reviewed-on: https://chromium-review.googlesource.com/559449
Reviewed-by: Jamie Madill <jmadill@chromium.org>
Commit-Queue: Jamie Madill <jmadill@chromium.org>
Cr-Commit-Position: refs/heads/master@{#484263}
[modify] https://crrev.com/2e23bfcc4750f3ffc7945cac8a01c9b9c61c4081/content/test/gpu/generate_buildbot_json.py
[modify] https://crrev.com/2e23bfcc4750f3ffc7945cac8a01c9b9c61c4081/testing/buildbot/chromium.gpu.fyi.json

Cc: -serg...@chromium.org

Comment 38 by kbr@chromium.org, Aug 30 2017

Blockedon: 758750
Project Member

Comment 39 by bugdroid1@chromium.org, Sep 13 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2f45957366026d96f5d5d2e2137b65058978d2b2

commit 2f45957366026d96f5d5d2e2137b65058978d2b2
Author: Jamie Madill <jmadill@chromium.org>
Date: Wed Sep 13 01:14:41 2017

ANGLE: Re-enable dEQP GLES3 tests on AMD.

Now that we have enough CQ capacity, put the bots back in operation.

BUG=713196, 727437 
NOTRY=true

Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel
Change-Id: I797460bac0313027f2f01031f1a49d223d4887df
Reviewed-on: https://chromium-review.googlesource.com/663533
Commit-Queue: Kenneth Russell <kbr@chromium.org>
Reviewed-by: Kenneth Russell <kbr@chromium.org>
Reviewed-by: Jamie Madill <jmadill@chromium.org>
Cr-Commit-Position: refs/heads/master@{#501497}
[modify] https://crrev.com/2f45957366026d96f5d5d2e2137b65058978d2b2/content/test/gpu/generate_buildbot_json.py
[modify] https://crrev.com/2f45957366026d96f5d5d2e2137b65058978d2b2/testing/buildbot/chromium.gpu.fyi.json

Status: Fixed (was: Assigned)
Owner: ----
Status: Available (was: Fixed)
This started happening again today on the ANGLE CQ. I commented more in issue 735306, but any possible resolution would be welcome since it's blocking ANGLE jobs from landing currently.

Comment 42 by kbr@chromium.org, Oct 26 2017

One note: all of the tests redundantly print the test's name while running the test. For example, from this shard:
https://chromium-swarm.appspot.com/task?id=396f2a5486c4ff10&refresh=10&show_raw=1

which is shard #0 of:
angle_deqp_gles3_d3d11_tests on ATI GPU on Windows (with patch) on Windows-2008ServerR2-SP1

from this build:
https://ci.chromium.org/buildbot/tryserver.chromium.angle/win_angle_deqp_rel_ng/988

we can see this pattern throughout the logs:

[ RUN      ] dEQP_GLES3.Default/functional_state_query_boolean_scissor_test_getinteger64
dEQP-GLES3.functional.state_query.boolean.scissor_test_getinteger64
[       OK ] dEQP_GLES3.Default/functional_state_query_boolean_scissor_test_getinteger64 (66 ms)

Just eliminating that useless print would probably reduce the log size by 25%. Jamie, do you think you could look into doing that as a short-term mitigation?

Owner: jmad...@chromium.org
Status: Assigned (was: Available)
Ken, it's not a useless print, it's the only way we have of correlating the test name with the real dEQP name and the expectations file. We use it for updating expectations. If this would get the bots working again we could do it, but I'd rather just disable some AMD tests before removing something we depend on for normal operation.

I'll look at disabling the GLES3 dEQP tests in the interim.
GLES >AMD< dEQP tests. Those are mostly redundant.
Can infra people just tell us the location if the merging code? Reducing
the size complexity seems a trivial task, if they don't have time to do it,
we can do it ourselves.
Project Member

Comment 46 by bugdroid1@chromium.org, Oct 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4389564169d607678a653a3b5be327b58efdbe81

commit 4389564169d607678a653a3b5be327b58efdbe81
Author: Jamie Madill <jmadill@chromium.org>
Date: Thu Oct 26 18:57:17 2017

Temporarily disable ANGLE dEQP GLES3 AMD tests.

Between the WebGL 2 CTS and the NVIDIA GLES 3 tests, the AMD versions
of the GLES3 dEQP tests are mostly redundant. Currentlty the bots are
triggering a recipe engine bug due to very large log sizes, so disable
these (mostly redundant) tests until we get a fix for the recipe.

BUG=713196
TBR=kbr@chromium.org

NOTRY=true

Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel
Change-Id: Id0dc064891fead9be4026fa1dd723cebbdf015ec
Reviewed-on: https://chromium-review.googlesource.com/739468
Commit-Queue: Jamie Madill <jmadill@chromium.org>
Reviewed-by: Jamie Madill <jmadill@chromium.org>
Cr-Commit-Position: refs/heads/master@{#511901}
[modify] https://crrev.com/4389564169d607678a653a3b5be327b58efdbe81/content/test/gpu/generate_buildbot_json.py
[modify] https://crrev.com/4389564169d607678a653a3b5be327b58efdbe81/testing/buildbot/chromium.gpu.fyi.json

Owner: ----
Status: Available (was: Assigned)
Re-marking this as available, since the underlying bug is present and could trigger when more tests are (re-)added.

Comment 48 by kbr@chromium.org, Oct 26 2017

Jamie, thanks, I understand regarding the printing of the test names. The same issue exists for the WebGL conformance tests.

Actually, I'm not sure that's merging is the step causing the problem.

I've got an idea - the step that fails is uploading the test results.
That step is broken anyway because it tries to upload AMD results to the same file that NVIDIA results are uploaded to.
Also, I've looked here:
https://test-results.appspot.com/testfile?testtype=angle_deqp_gles3_d3d11_tests%20%28with%20patch%29
and the result files are around 15MB, not something that would cause running out of memory.

Can we maybe disable uploading the test results?

Meanwhile, someone should investigate the real problem of leaky memory, which I guess leaks from step to step until memory is exhausted. Maybe print used memory in the end of each step?
Blocking: 790689

Comment 52 by kbr@chromium.org, Dec 1 2017

Issue 790689 has been merged into this issue.
GPU triage: how should we proceed here? This looks important, but a P1 needs an owner.

Comment 54 by piman@chromium.org, Feb 12 2018

Labels: -Pri-1 Pri-2
GPU triage: lowering priority
Cc: iannu...@google.com

Sign in to add a comment