ANGLE CQ Win32/Release flakily triggers python exception when processing logs |
|||||||||||||||||||||||||||||||
Issue descriptionI believe what's happening here is the logs for this bot are so large that they break 32-bit Python's memory limit on upload. The bots have a step where they re-encode the whole log, and this seems to be where it crashes. Link to waterfall: https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_rel_ng?numbuilds=50 Link to failing builds: https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_rel_ng/builds/4563 https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_rel_ng/builds/4557 https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_rel_ng/builds/4529 Several others are apparent on the waterfall currently. The error in question: Traceback (most recent call last): File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\util.py", line 150, in raises yield File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\run.py", line 255, in run_step open_step = self._step_runner.open_step(step_config) File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 176, in open_step step_config, recipe_test_api.DisabledTestData() File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 525, in render_step new_cmd.extend(item.render(tdata)) File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 38, in render os.write(input_fd, self.encode(self.data)) File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 67, in encode return data.decode('utf-8', 'replace').encode('utf-8') MemoryError
,
Apr 19 2017
This is the first occurrence I found of this problem: https://luci-milo.appspot.com/buildbot/tryserver.chromium.angle/win_angle_rel_ng/4156, dated "Wed, 2017-03-22 7:02:37 PM (local time)". Not sure what to contribute it to. iannucci@, could you please take a look, you seem to have touched this code recently? Also ccing phajdan.jr@. If the problem is with parsing stdout of these tests, we've switched Android to using XML output to produce JSON, maybe this will help here, too?
,
Apr 20 2017
Robbie, Pawel, Sergiy, Andrii, can someone please look at this?
,
Apr 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/angle/angle/+/de32c2e1a388b87d929bcd5bacf3189c370c8fa4 commit de32c2e1a388b87d929bcd5bacf3189c370c8fa4 Author: Jamie Madill <jmadill@chromium.org> Date: Thu Apr 20 20:40:09 2017 dEQP: Remove shader compile/link timing output. This should reduce our log sizes significantly. Will upstream the fix to dEQP and then remove the custom workaround once we roll. BUG=chromium:713196 Change-Id: I0ddb678bf5256bd5f05782ca40d49501e84e4cae Reviewed-on: https://chromium-review.googlesource.com/483047 Reviewed-by: Geoff Lang <geofflang@chromium.org> Commit-Queue: Jamie Madill <jmadill@chromium.org> [add] https://crrev.com/de32c2e1a388b87d929bcd5bacf3189c370c8fa4/src/tests/deqp_support/qpTestLog.c [modify] https://crrev.com/de32c2e1a388b87d929bcd5bacf3189c370c8fa4/src/tests/deqp.gypi
,
Apr 20 2017
MemoryError happens when you load too much data into the recipe; this means that you're likely ingesting a 500+MB log (i.e. with raw_io.output or json.output) into the recipe_engine. There's not much I can do about this (besides adding safeguards to raw_io to truncate the log data to avoid a crash, which obviously has other downsides). I would recommend altering the test/test_runner to output less data to the files that recipe_engine is ingesting (it looks like that's what happened in #4?). Log data that goes to stdout/stderr doesn't have a size limit. Pawel is most familiar with this recipe, I think, so perhaps he can comment more.
,
Apr 20 2017
There are 12 shards, each one has ~15MB output.json. That's 180 MB, much less than 500MB. Also, I'm not sure the formula should be 12*15, as only 1/12th of each output.json is unique. Makes sense to combine the 12 into 1, which I estimate would be around 20MB.
,
Apr 21 2017
,
Apr 24 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/cda13a9e3a043b37ecc7f28a60333091a80bdf3e commit cda13a9e3a043b37ecc7f28a60333091a80bdf3e Author: jmadill <jmadill@chromium.org> Date: Mon Apr 24 16:42:09 2017 Roll ANGLE ba992ab..331f6db https://chromium.googlesource.com/angle/angle.git/+log/ba992ab..331f6db BUG= chromium:644033 ,chromium:713196 TBR=geofflang@chromium.org TEST=bots CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.android:android_optional_gpu_tests_rel Review-Url: https://codereview.chromium.org/2842433003 Cr-Commit-Position: refs/heads/master@{#466660} [modify] https://crrev.com/cda13a9e3a043b37ecc7f28a60333091a80bdf3e/DEPS
,
Apr 27 2017
I'd say we either need 64-bit python, or the amount of processed data needs to be greatly reduced. Let me know if/how can I best further help here.
,
Apr 27 2017
Nodir is working on rebuilding python from scratch. Since we only support 64 bits OSes now, it could be a good thing to settle for x64 python build.
,
Apr 27 2017
Is it possible to check what is the data size? As I mentioned in #6, I don't think it exceeds 180MB, and if the merge of JSONs from all shards is done correctly, it should be around 20MB. I think verifying that we handle the data efficiently has some advantages to utilizing more memory.
,
Apr 27 2017
Actually Nodir should not own *this* bug, but we could mark the python rebuild bug as a blocker here.
,
Apr 27 2017
i am building python only on linux and mac ( bug 714852 ). I don't build 64bit Python on Windows. It would be great to do, but it is outside of my scope.
,
May 2 2017
This bug hasn't been updated in 4 days and it's still in the trooper queue. maruel/nodir@: Who owns 64bit python on windows, then? phajdan: How can ynovikov@chromium.org check the size of stdout for this step?
,
May 2 2017
# 14 no one owns it. We never (?) had 64bit python on windows
,
May 2 2017
I filed a bug to do this and marked it as a blocker. I still recommend trimming down the JSON data and I'll leave this issue focused on the JSON data or whatever workaround that can be implemented by Angle team.
,
May 3 2017
Since this is blocked on a non-trooper issue, I'm removing the troopers label from it.
,
May 4 2017
I think I don't fully understand what you're trying to do in your recipe steps here, but doing anything where you're trying to send > 1 MB of JSON directly back through the recipe engine and recipe logs is almost certainly the wrong thing to do (i.e., this isn't what we designed things to do and we're probably using it wrong). So, rather than trying to work around things by getting a 64-bit python, we need to figure out if there's a different way to do what you're trying to do.
,
May 4 2017
,
May 4 2017
I'm not really familiar with the recipe myself, but I think what we are trying to send are the test results. angle_deqp_gles3_d3d11_tests suite has more than 40,000 tests. It's split into 12 shards, with each shard's output.json being ~15MB in size. I suspect it tries to upload 12*15=180MB of data, though most of every shard's output.json is duplicated in another shards - only the tests which ran in this shard are unique, so I think if all the shards are merged before uploading, the size should be much smaller, maybe around 20MB. Can someone who is more familiar with recipes confirm the above? Ken? Jamie?
,
May 4 2017
Here's one of the shards' output. The bulk of the output seems to be a "test_locations" dictionary that seems to include the line number of every test in the binary, not just those run in the shard. (Note that for example: functional_buffer_copy_single_buffer_transform_feedback_uniform shows up in the "all_tests" list and in the "test_locations" dictionary, but not in the "per_iteration_data" which seems to contain the results from just the tests that ran in this shard.) If the base/test/launcher harness either skipped writing the test_locations dictionary entirely, or filtered it so it only contained those tests that ran, that would save enough space that this wouldn't work. Pawel, could you look into adding such an option?
,
May 4 2017
,
May 4 2017
I want to look into this for a bit; I think there's a few things we're doing here that we should probably change. Also, +sergiyb for the test_locations issue, particularly since he claimed that using a trie for test names for this wasn't needed :).
,
May 5 2017
,
May 8 2017
,
Jun 9 2017
dpranke@ removing GPU categories and bumping priority to P2. Please triage as appropriate. Thanks.
,
Jun 19 2017
This is a bit of a bigger problem now. It seems that whenever a Windows ANGLE bot run has any kind of error, the logs seem to explode, and get large enough to trigger the Python exception. So, every failure becomes a purple exception, with the tests at the end being cut off, and the failing steps not being reported correctly. Example failures: https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_rel_ng/builds/5331 Here one step has a few flaky timeouts, which is a problem we need to fix, but it seems to crash the recipe, preventing the tests at the end from running. Upgrading priority because this does impact our ability to land changes through CQ flakiness.
,
Jun 19 2017
Adding a few tags.. I know this bug isn't really an ANGLE bug but want to keep it visible.
,
Jun 20 2017
Seems to be failing on the ANGLE CQ for otherwise green CLs now as well, not sure what changed: https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_rel_ng/builds/5365 Dirk would you or someone else have a chance to look at this? I'm concerned about the ability for our CQ to run on Windows (our most important platform). We could start removing tests to just get things going again.
,
Jun 20 2017
Yes, sorry for the long delay :( I'll get on this today.
,
Jun 21 2017
My guess is that your test suite has gotten big enough so that when we're splitting and merging data you're running out of memory regardless of whether something failed or not. Unfortunately, I haven't been able to reconstruct the context I had in my head when I made the comment in #23. I think there are a few things wrong: 1) as noted in comment #21, we're logging all of the tests we find, not just the ones in the shard. This is unnecessary and expensive. 2) we're re-rending the json output for reasons I don't fully understand (to try and prevent us from sending illegal utf8 data, I guess). 3) We read the entire json results file into memory to look at a single field (the valid field). Maybe we should switch things to just trust the error code, or maybe we need to split the summary fields out from the detail fields, or call out to a helper script to just extract the summary fields. @jmadill - I suggest you disable some swath of tests temporarily to get this green again. I'll bump this bug back to you to do so for now. I've filed bugs 735303 and 735306 and asked sergiyb@ and phajdan to look at the issues, respectively; we should be able to fix some of the underlying issues pretty quickly. An alternative is to switch to a 64-bit python, but I think the bottom line is that we're doing stupid, expensive things here and we need to stop doing them rather than paper over them. All of this excessive logging makes other parts of the build output unusable as a result, and switching to 64-bit won't fix that.
,
Jun 21 2017
Thanks Dirk. Geoff landed a CL to disable some tests in https://chromium-review.googlesource.com/c/541616/ .. can you re-assign to the appropriate owner?
,
Jun 21 2017
Let's leave this open for a while and see if things green up, and then revert (re-enable) the tests and see if it stays that way, and then we can close this.
,
Jun 22 2017
,
Jul 4 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/164eca3f4584601d870c2751229b0e9abf33eb15 commit 164eca3f4584601d870c2751229b0e9abf33eb15 Author: Geoff Lang <geofflang@chromium.org> Date: Tue Jul 04 15:52:17 2017 Re-enable the dEQP ES3 tests for AMD on the GPU FYI waterfall. TBR=kbr@chromium.org BUG=713196 Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel Change-Id: Ib0d3b1996325ebfa9342240b17393e4c2f4a84cd Reviewed-on: https://chromium-review.googlesource.com/558807 Commit-Queue: Geoff Lang <geofflang@chromium.org> Reviewed-by: Geoff Lang <geofflang@chromium.org> Cr-Commit-Position: refs/heads/master@{#484118} [modify] https://crrev.com/164eca3f4584601d870c2751229b0e9abf33eb15/content/test/gpu/generate_buildbot_json.py [modify] https://crrev.com/164eca3f4584601d870c2751229b0e9abf33eb15/testing/buildbot/chromium.gpu.fyi.json
,
Jul 5 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2e23bfcc4750f3ffc7945cac8a01c9b9c61c4081 commit 2e23bfcc4750f3ffc7945cac8a01c9b9c61c4081 Author: Jamie Madill <jmadill@chromium.org> Date: Wed Jul 05 14:09:28 2017 Revert "Re-enable the dEQP ES3 tests for AMD on the GPU FYI waterfall." This reverts commit 164eca3f4584601d870c2751229b0e9abf33eb15. Reason for revert: Recipe engine bug still happening. https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_rel_ng/builds/5598 BUG=713196 Original change's description: > Re-enable the dEQP ES3 tests for AMD on the GPU FYI waterfall. > > TBR=kbr@chromium.org > > BUG=713196 > > Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel > Change-Id: Ib0d3b1996325ebfa9342240b17393e4c2f4a84cd > Reviewed-on: https://chromium-review.googlesource.com/558807 > Commit-Queue: Geoff Lang <geofflang@chromium.org> > Reviewed-by: Geoff Lang <geofflang@chromium.org> > Cr-Commit-Position: refs/heads/master@{#484118} TBR=geofflang@chromium.org,kbr@chromium.org Change-Id: I4f4bfd177ea15d239bb2dcaca889cf65c56b1963 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 713196 Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel Reviewed-on: https://chromium-review.googlesource.com/559449 Reviewed-by: Jamie Madill <jmadill@chromium.org> Commit-Queue: Jamie Madill <jmadill@chromium.org> Cr-Commit-Position: refs/heads/master@{#484263} [modify] https://crrev.com/2e23bfcc4750f3ffc7945cac8a01c9b9c61c4081/content/test/gpu/generate_buildbot_json.py [modify] https://crrev.com/2e23bfcc4750f3ffc7945cac8a01c9b9c61c4081/testing/buildbot/chromium.gpu.fyi.json
,
Jul 5 2017
,
Aug 30 2017
,
Sep 13 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2f45957366026d96f5d5d2e2137b65058978d2b2 commit 2f45957366026d96f5d5d2e2137b65058978d2b2 Author: Jamie Madill <jmadill@chromium.org> Date: Wed Sep 13 01:14:41 2017 ANGLE: Re-enable dEQP GLES3 tests on AMD. Now that we have enough CQ capacity, put the bots back in operation. BUG=713196, 727437 NOTRY=true Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel Change-Id: I797460bac0313027f2f01031f1a49d223d4887df Reviewed-on: https://chromium-review.googlesource.com/663533 Commit-Queue: Kenneth Russell <kbr@chromium.org> Reviewed-by: Kenneth Russell <kbr@chromium.org> Reviewed-by: Jamie Madill <jmadill@chromium.org> Cr-Commit-Position: refs/heads/master@{#501497} [modify] https://crrev.com/2f45957366026d96f5d5d2e2137b65058978d2b2/content/test/gpu/generate_buildbot_json.py [modify] https://crrev.com/2f45957366026d96f5d5d2e2137b65058978d2b2/testing/buildbot/chromium.gpu.fyi.json
,
Oct 17 2017
,
Oct 26 2017
This started happening again today on the ANGLE CQ. I commented more in issue 735306, but any possible resolution would be welcome since it's blocking ANGLE jobs from landing currently.
,
Oct 26 2017
One note: all of the tests redundantly print the test's name while running the test. For example, from this shard: https://chromium-swarm.appspot.com/task?id=396f2a5486c4ff10&refresh=10&show_raw=1 which is shard #0 of: angle_deqp_gles3_d3d11_tests on ATI GPU on Windows (with patch) on Windows-2008ServerR2-SP1 from this build: https://ci.chromium.org/buildbot/tryserver.chromium.angle/win_angle_deqp_rel_ng/988 we can see this pattern throughout the logs: [ RUN ] dEQP_GLES3.Default/functional_state_query_boolean_scissor_test_getinteger64 dEQP-GLES3.functional.state_query.boolean.scissor_test_getinteger64 [ OK ] dEQP_GLES3.Default/functional_state_query_boolean_scissor_test_getinteger64 (66 ms) Just eliminating that useless print would probably reduce the log size by 25%. Jamie, do you think you could look into doing that as a short-term mitigation?
,
Oct 26 2017
Ken, it's not a useless print, it's the only way we have of correlating the test name with the real dEQP name and the expectations file. We use it for updating expectations. If this would get the bots working again we could do it, but I'd rather just disable some AMD tests before removing something we depend on for normal operation. I'll look at disabling the GLES3 dEQP tests in the interim.
,
Oct 26 2017
GLES >AMD< dEQP tests. Those are mostly redundant.
,
Oct 26 2017
Can infra people just tell us the location if the merging code? Reducing the size complexity seems a trivial task, if they don't have time to do it, we can do it ourselves.
,
Oct 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4389564169d607678a653a3b5be327b58efdbe81 commit 4389564169d607678a653a3b5be327b58efdbe81 Author: Jamie Madill <jmadill@chromium.org> Date: Thu Oct 26 18:57:17 2017 Temporarily disable ANGLE dEQP GLES3 AMD tests. Between the WebGL 2 CTS and the NVIDIA GLES 3 tests, the AMD versions of the GLES3 dEQP tests are mostly redundant. Currentlty the bots are triggering a recipe engine bug due to very large log sizes, so disable these (mostly redundant) tests until we get a fix for the recipe. BUG=713196 TBR=kbr@chromium.org NOTRY=true Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel Change-Id: Id0dc064891fead9be4026fa1dd723cebbdf015ec Reviewed-on: https://chromium-review.googlesource.com/739468 Commit-Queue: Jamie Madill <jmadill@chromium.org> Reviewed-by: Jamie Madill <jmadill@chromium.org> Cr-Commit-Position: refs/heads/master@{#511901} [modify] https://crrev.com/4389564169d607678a653a3b5be327b58efdbe81/content/test/gpu/generate_buildbot_json.py [modify] https://crrev.com/4389564169d607678a653a3b5be327b58efdbe81/testing/buildbot/chromium.gpu.fyi.json
,
Oct 26 2017
Re-marking this as available, since the underlying bug is present and could trigger when more tests are (re-)added.
,
Oct 26 2017
Jamie, thanks, I understand regarding the printing of the test names. The same issue exists for the WebGL conformance tests.
,
Oct 26 2017
#45: the default gtest merge logic is https://codesearch.chromium.org/chromium/build/scripts/slave/recipe_modules/swarming/resources/standard_gtest_merge.py
,
Oct 27 2017
Actually, I'm not sure that's merging is the step causing the problem. I've got an idea - the step that fails is uploading the test results. That step is broken anyway because it tries to upload AMD results to the same file that NVIDIA results are uploaded to. Also, I've looked here: https://test-results.appspot.com/testfile?testtype=angle_deqp_gles3_d3d11_tests%20%28with%20patch%29 and the result files are around 15MB, not something that would cause running out of memory. Can we maybe disable uploading the test results? Meanwhile, someone should investigate the real problem of leaky memory, which I guess leaks from step to step until memory is exhausted. Maybe print used memory in the end of each step?
,
Nov 30 2017
,
Dec 1 2017
Issue 790689 has been merged into this issue.
,
Feb 5 2018
GPU triage: how should we proceed here? This looks important, but a P1 needs an owner.
,
Feb 12 2018
GPU triage: lowering priority
,
Oct 19
|
|||||||||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||||||||
Comment 1 by ynovikov@chromium.org
, Apr 19 2017