figure out how to stop logging gigantic JSON files in the build output |
|||||||||||||||
Issue descriptionAs seen in bug 713196, it looks like when we have step results that contain json output, we're actually logging the contents of the files, which is (of course) rather expensive if the test result files are big. We should figure out if there is a way to avoid logging the file. In addition, it looks like we are even re-encoding the file if we think the json is invalid utf-8. this seems perhaps overly paranoid, and we should see if we can figure out how to avoid doing this, i.e. from the description of that bug the stack trace looks like: File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\util.py", line 150, in raises yield File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\run.py", line 255, in run_step open_step = self._step_runner.open_step(step_config) File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 176, in open_step step_config, recipe_test_api.DisabledTestData() File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 525, in render_step new_cmd.extend(item.render(tdata)) File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 38, in render os.write(input_fd, self.encode(self.data)) File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 67, in encode return data.decode('utf-8', 'replace').encode('utf-8') Paweł, can you look at this ASAP and see if there are places where we can eliminate touches of the data and/or copies of it, in the hopes of avoiding some of the out-of-memory issues? If necessary, maybe we need to only log truncated parts of the data, or something? I've also filed bug 735303 to try and reduce the size of the JSON files, but that seems like a separate issue.
,
Jun 21 2017
,
Jul 31 2017
(please ping me or use @google.com for urgent issues) See related issue 686280 , issue 664352 , issue 689886 , https://codereview.chromium.org/2672593002 , https://codereview.chromium.org/2678343010 . It's not obvious how to proceed. Where applicable, specific steps could disable logging. We could even consider flipping the defaults.
,
Jul 31 2017
,
Aug 1 2017
I think there are two things we should specifically look at. First, using https://luci-milo.appspot.com/buildbot/tryserver.chromium.angle/win_angle_rel_ng/6012 as an example build, if you look at the test results files for the webgl2_conformance_tests, they are quite large: https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.angle%2Fwin_angle_rel_ng%2F6012%2F%2B%2Frecipes%2Fsteps%2Fwebgl2_conformance_d3d11_passthrough_tests_on_NVIDIA_GPU_on_Windows__with_patch__on_Windows-2008ServerR2-SP1%2F0%2Fstdout https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.angle%2Fwin_angle_rel_ng%2F6012%2F%2B%2Frecipes%2Fsteps%2Fwebgl2_conformance_d3d11_passthrough_tests_on_NVIDIA_GPU_on_Windows__with_patch__on_Windows-2008ServerR2-SP1%2F0%2Flogs%2Fjson.output%2F0 If you look at the log output from the build itself, you can see that the contents of the test_results file is also being logged to stdout, in STEP_LOG_LINES: https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.angle%2Fwin_angle_rel_ng%2F6012%2F%2B%2Frecipes%2Fstdout It seems like the STEP_LOG_LINE doesn't really add anything useful here, though IIUC it is how we get the "json.output" link on step 94 of the build. So, next step #1 is: is there some other way to handle this so that we can not write all of the STEP_LOG_LINES? Next step #2 is: is there some way we can avoid the raw_io encode()/decode() pass that caused the OutOfMemory exception? It looks like that code was probably added in the CL you linked to. These are probably two different things, so maybe we should split one of them off into a separate thing, though maybe if we fix the first we don't have to worry about the second.
,
Sep 8 2017
My change at https://chromium.googlesource.com/chromium/tools/build/+/1f48a3e6f509fa9e8647e04efcfffdf0fff2c35d%5E%21/#F0 tried to mitigate some of this issue in the swarming/api.py module see - https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/swarming/api.py?type=cs&q=filter_outdir&sq=package:chromium&l=29 I have a generic solution at https://codereview.chromium.org/2768883004/ I would actually like to rebase that CL and land it.
,
Sep 8 2017
Marking this for retriage.
,
Oct 25 2017
Robbie, please triage as i don't understand based on your reply here https://codereview.chromium.org/2768883004/ if it should be landed and/or abandoned.
,
Oct 26 2017
This error is popping up again on the ANGLE CQ: https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_deqp_rel_ng Unsure what is making it trip the exception now. We might have just added one or two more tests than the (?) maximum. First failing build: https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_deqp_rel_ng/builds/988 Every build is now failing with an exception: Traceback (most recent call last): File "E:\b\rr\tmpawvt0l\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\util.py", line 150, in raises yield File "E:\b\rr\tmpawvt0l\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\run.py", line 273, in run_step open_step = self._step_runner.open_step(step_config) File "E:\b\rr\tmpawvt0l\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 178, in open_step step_config, recipe_test_api.DisabledTestData() File "E:\b\rr\tmpawvt0l\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 601, in render_step new_cmd.extend(item.render(tdata)) File "E:\b\rr\tmpawvt0l\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 40, in render os.write(input_fd, self.encode(self.data)) File "E:\b\rr\tmpawvt0l\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 69, in encode return data.decode('utf-8', 'replace').encode('utf-8') MemoryError Leaving this classified as Pri-1, this is currently causing the ANGLE CQ to fail almost every time on this step. Any possible resolution would be welcome. Tagging this with ANGLE to keep it visible.
,
Oct 26 2017
,
Feb 26 2018
Ping on P1 issue that has been untouched for 4 months. Is the CL mentioned in comments #6/#8 useful for a resolution?
,
Aug 20
,
Aug 20
https://ci.chromium.org/buildbot/chromium.clang/CrWinAsan/1159 is purple because of this. All the tests pass, but we get https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium.clang%2FCrWinAsan%2F1159%2F%2B%2Frecipes%2Fsteps%2FRecipe_engine_bug%2F0%2Flogs%2Fexception%2F0 Traceback (most recent call last): File "C:\b\rr\tmpqsv4gm\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\util.py", line 151, in raises yield File "C:\b\rr\tmpqsv4gm\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\run.py", line 226, in run_step open_step = self._step_runner.open_step(step_config) File "C:\b\rr\tmpqsv4gm\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 178, in open_step step_config, recipe_test_api.DisabledTestData() File "C:\b\rr\tmpqsv4gm\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 612, in render_step new_cmd.extend(item.render(tdata)) File "C:\b\rr\tmpqsv4gm\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 88, in render os.write(input_fd, self.encode(self.data)) File "C:\b\rr\tmpqsv4gm\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 118, in encode return data.decode('utf-8', 'replace').encode('utf-8') MemoryError A very similar bot (https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/win-asan) is completely green. What's the status here? Looks like this is affecting several builders.
,
Aug 20
,
Aug 20
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/recipes-py/+/53053ef702fea3816a322b4ecc697348e7002555 commit 53053ef702fea3816a322b4ecc697348e7002555 Author: Nico Weber <thakis@chromium.org> Date: Mon Aug 20 17:47:53 2018 Don't keep multiple copies of data in memory to sanitize utf-8. I locally made a 30MB file containing many lines that each contained 60 'a' characters, and put an invalid utf-8 character (0x80) in the middle of the file. Reading just the file made python use 36MB according to /usr/bin/time -r (mac) / /usr/bin/time -f %M (linux). Converting the data to sanitized utf-8 using open('out.txt', 'w').write(data.decode('utf-8', 'replace').encode('utf-8')) made the program take 160MB. Using the streaming approach from this CL made the program take 40MB while taking the same amount of time, so from 400% memory overhead to 10% overhead. So hopefully this will help with the bug below. Bug: chromium:735306 Change-Id: I39f5933e8b9ba22ea7c7d67a137ee45451d47e0c Reviewed-on: https://chromium-review.googlesource.com/1181246 Reviewed-by: Robbie Iannucci <iannucci@chromium.org> Reviewed-by: Stephen Martinis <martiniss@chromium.org> Commit-Queue: Nico Weber <thakis@chromium.org> [modify] https://crrev.com/53053ef702fea3816a322b4ecc697348e7002555/recipe_modules/raw_io/api.py
,
Aug 20
,
Aug 21
Do I need to do anything to deploy that recipes change everywhere? My fix landed 25h ago, but e.g. https://ci.chromium.org/buildbot/chromium.clang/ToTWin64%28dbg%29/1030 happened several hours after it and the stack still shows the pre-patch code in the backtrace.
,
Aug 21
I would've expected that to be deployed already. Also, per the comments on https://crrev.com/c/1181246 - do we know *which* step is trying to log this data, and whether there's something we should possibly be turning off (per iannucci's comment "Logs of JSON files are totally controllable (and can always be turned off),") .
,
Aug 21
Yes the latest recipes from build should already include that change. It landed in the build repo here: https://chromium.googlesource.com/chromium/tools/build/+/9266bb6a12842a4a4c0e8da0ac994e1bdb5509c0
,
Aug 21
Hm, weird. https://build.chromium.org/deprecated/chromium.clang/builders/ToTWin(dbg)/builds/1161/steps/update_scripts/logs/stdio says: build (Elapsed: 0:00:02) ---------------------------------------- [0:00:01] Started. _____ build at refs/remotes/origin/master ________ running 'git -c core.deltaBaseCacheLimit=512m fetch origin --prune --verbose' in 'C:\b\build' [0:00:02] From https://chromium.googlesource.com/a/chromium/tools/build [0:00:02] = [up to date] infra/config -> origin/infra/config [0:00:02] = [up to date] master -> origin/master [0:00:02] = [up to date] recipe-package-canary -> origin/recipe-package-canary [0:00:03] Fetching origin ________ running 'git reset --hard HEAD' in 'C:\b\build' [0:00:03] HEAD is now at be05c2358b Use the 'generated' named cache in CS gen recipe. According to https://chromium.googlesource.com/chromium/tools/build/+log , that's past the revision you linked to. Yet, at that build: https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium.clang%2FToTWin_dbg_%2F1161%2F%2B%2Frecipes%2Fsteps%2FRecipe_engine_bug%2F0%2Flogs%2Fexception%2F0 File "C:\b\rr\tmphvmek0\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 198, in decode else result.decode('utf-8', 'replace').encode('utf-8')) ...oh, I failed to update the code around line 198, I only touched line 118. CL coming up... (Re: "do we know *which* step...": I don't; is there some easy way to tell?)
,
Aug 21
Actually, this one is a bit harder to fix since OutputPlaceholder.result() currently also returns the string data, which means the streaming approach I used for InputPlaceholder doesn't just work. I'll look at it a bit more, but maybe someone who actually knows all this recipes stuff might have to take over.
,
Aug 21
It sure looks like the failing step in question is viz_browser_tests. The way that output placeholder works is that it returns a string (the whole file). It's not meant for huge files. Do we know how large of a gtest result viz_browser_tests is producing?
,
Aug 21
The stack trace indicates that this is the placeholder which is running the engine out of memory https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/test_utils/api.py?q=%22test_utils/api.py%22&sq=package:chromium&dr&l=24
,
Aug 21
Here's the exception on a run without any test failures: https://ci.chromium.org/buildbot/chromium.clang/ToTWin%28dll%29/1144 (But that was before my fix above; maybe now the bot is only purple if something fails? If so, that'd be a huge progression already.) Looks like the code you linked to was added in 2015 in https://chromium.googlesource.com/chromium/tools/build/+/9ef8797bdd687558b57d22f467ac8f3c56510dc9 The output can be seen here (I think?) https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium.clang%2FToTWin_dll_%2F1147%2F%2B%2Frecipes%2Fsteps%2Ftest_pre_run%2F0%2Fsteps%2Fs__trigger__viz_browser_tests%2F0%2Flogs%2Fjson.output%2F0 As far as I can tell it's around 6kB, so not all that large. Or I'm looking at the wrong json output.
,
Aug 21
Yeah, 6kb should be fine; however that's a successful run. Is it possible that the test produces tons of output when it fails?
,
Aug 21
I'm also working on a change to drop logs from memory after they're emitted to logdog/buildbot. Currently they're kept in memory for the length of the recipe run, which could cause similar issues (if there are many large files).
,
Aug 21
the browser_test results and webkit_layout_tests results are just gonna be large (10's of MB of data), because we have 10's of thousands of tests. Even the test names by themselves end up being large. We need a way for that data to not be read in or touched by the interpreter more times than strictly needed.
,
Aug 21
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/recipes-py/+/2310fc75c6b6cd672558934d92f04e6ccd449140 commit 2310fc75c6b6cd672558934d92f04e6ccd449140 Author: Nico Weber <thakis@chromium.org> Date: Tue Aug 21 20:48:15 2018 Keep fewer copies of data to sanitize utf-8 in OutputPlaceholders too. This does for OutputPlaceholders what https://chromium-review.googlesource.com/c/1181246/ did for InputPlaceholders. In the other CL, we had a string that was maybe utf-8 and wanted to write it to a file. Here we have a file that maybe contains utf-8 and want to convert it to a string. We can read the file in chunks so we don't have to keep the whole contents in memory, but we then need to combine all the converted strings to one full string. That requires one copy, so this only brings us from 400% memory overhead to around 100% overhead. Still much better, but not as much better as the other change. With a larger change in recipes (e.g. by making result() return a callable instead of a string) larger savings are possible, but let's go with the smaller patch for now. Bug: chromium:735306 Change-Id: I8eaac86d2526119c28a61dd36cf244b32a30950c Reviewed-on: https://chromium-review.googlesource.com/1184059 Reviewed-by: Robbie Iannucci <iannucci@chromium.org> Commit-Queue: Nico Weber <thakis@chromium.org> [modify] https://crrev.com/2310fc75c6b6cd672558934d92f04e6ccd449140/recipe_modules/raw_io/api.py
,
Aug 22
Our bots are now no longer purple (but red :-D -- but for unrelated reasons), so for us this is fixed after the two changes I made.
,
Aug 22
Thanks Nico! Should help the ANGLE bots as well I assume.
,
Aug 22
Yes, probably.
,
Aug 27
>> Looks like the code you linked to was added in 2015 in https://chromium.googlesource.com/chromium/tools/build/+/9ef8797bdd687558b57d22f467ac8f3c56510dc9 It was added earlier in https://chromium.googlesource.com/chromium/tools/build/+/909cb8303196a9db4013b4168fa088ba2872a0af by phajdan.jr@chromium.org. I just moved it to a different module in that commit. Neither me nor Pawel are working on ChOps team anymore though...
,
Oct 5
The following revision refers to this bug: https://chrome-internal.googlesource.com/chrome/tools/build_limited/scripts/slave/+/ae7409a088bda7db0ce99bea46fa9e3a66dab5c1 commit ae7409a088bda7db0ce99bea46fa9e3a66dab5c1 Author: Robert Iannucci <iannucci@chromium.org> Date: Fri Oct 05 17:41:46 2018
,
Oct 5
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/recipes-py/+/acbe9e3142610d958181460254565ef27ad47dc0 commit acbe9e3142610d958181460254565ef27ad47dc0 Author: Robert Iannucci <iannucci@chromium.org> Date: Fri Oct 05 17:59:10 2018 [StepPresentation] Remove all logs from memory during finalization. Previously the recipe engine kept all log content in memory for the duration of the recipe execution. No recipes have been found that actually need to read this data (which can't read it in some alternate mechanism). Since the logs can be quite long, this should reduce the memory consumption of most recipe executions, sometimes by a lot. Because the logs are no longer kept in memory, the recipe engine will now pop an assertion if a recipe attempts to read/write a log after the step has been finalized. R=martiniss@chromium.org, nodir@chromium.org Bug: 833539 ,735306 Change-Id: I8450b473eeaaafcdc0a87c7c542f961848327050 Reviewed-on: https://chromium-review.googlesource.com/c/1171206 Commit-Queue: Robbie Iannucci <iannucci@chromium.org> Reviewed-by: Stephen Martinis <martiniss@chromium.org> Reviewed-by: Nodir Turakulov <nodir@chromium.org> [modify] https://crrev.com/acbe9e3142610d958181460254565ef27ad47dc0/recipe_engine/recipe_test_api.py [modify] https://crrev.com/acbe9e3142610d958181460254565ef27ad47dc0/recipe_engine/types.py
,
Oct 18
,
Oct 18
,
Dec 14
[GPU Triage Council] |
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by dpranke@chromium.org
, Jun 21 2017