New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 735306 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 713196



Sign in to add a comment

figure out how to stop logging gigantic JSON files in the build output

Project Member Reported by dpranke@chromium.org, Jun 21 2017

Issue description

As seen in bug 713196, it looks like when we have step results that contain json output, we're actually logging the contents of the files, which is (of course) rather expensive if the test result files are big. We should figure out if there is a way to avoid logging the file.

In addition, it looks like we are even re-encoding the file if we think the json is invalid utf-8. this seems perhaps overly paranoid, and we should see if we can figure out how to avoid doing this, i.e. from the description of that bug the stack trace looks like:

 File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\util.py", line 150, in raises
    yield
  File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\run.py", line 255, in run_step
    open_step = self._step_runner.open_step(step_config)
  File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 176, in open_step
    step_config, recipe_test_api.DisabledTestData()
  File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 525, in render_step
    new_cmd.extend(item.render(tdata))
  File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 38, in render
    os.write(input_fd, self.encode(self.data))
  File "E:\b\rr\tmps320xb\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 67, in encode
    return data.decode('utf-8', 'replace').encode('utf-8')

Paweł, can you look at this ASAP and see if there are places where we can eliminate touches of the data and/or copies of it, in the hopes of avoiding some of the out-of-memory issues? If necessary, maybe we need to only log truncated parts of the data, or something?

I've also filed  bug 735303  to try and reduce the size of the JSON files, but that seems like a separate issue.
 
Components: Infra>Platform>Recipes
Blocking: 713196
Cc: phajdan@google.com no...@chromium.org martiniss@chromium.org
(please ping me or use @google.com for urgent issues)

See related  issue 686280  , issue 664352 , issue 689886 , https://codereview.chromium.org/2672593002 , https://codereview.chromium.org/2678343010 .

It's not obvious how to proceed.

Where applicable, specific steps could disable logging. We could even consider flipping the defaults.
Cc: -serg...@chromium.org
I think there are two things we should specifically look at.

First, using https://luci-milo.appspot.com/buildbot/tryserver.chromium.angle/win_angle_rel_ng/6012 as an example build, 

if you look at the test results files for the webgl2_conformance_tests, they are quite large:

https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.angle%2Fwin_angle_rel_ng%2F6012%2F%2B%2Frecipes%2Fsteps%2Fwebgl2_conformance_d3d11_passthrough_tests_on_NVIDIA_GPU_on_Windows__with_patch__on_Windows-2008ServerR2-SP1%2F0%2Fstdout

https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.angle%2Fwin_angle_rel_ng%2F6012%2F%2B%2Frecipes%2Fsteps%2Fwebgl2_conformance_d3d11_passthrough_tests_on_NVIDIA_GPU_on_Windows__with_patch__on_Windows-2008ServerR2-SP1%2F0%2Flogs%2Fjson.output%2F0

If you look at the log output from the build itself, you can see that the contents of the test_results file is also being logged to stdout, in STEP_LOG_LINES:

https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.angle%2Fwin_angle_rel_ng%2F6012%2F%2B%2Frecipes%2Fstdout

It seems like the STEP_LOG_LINE doesn't really add anything useful here, though IIUC it is how we get the "json.output" link on step 94 of the build. 

So, next step #1 is: is there some other way to handle this so that we can not write all of the STEP_LOG_LINES?

Next step #2 is: is there some way we can avoid the raw_io encode()/decode() pass that caused the OutOfMemory exception? It looks like that code was probably added in the CL you linked to.

These are probably two different things, so maybe we should split one of them off into a separate thing, though maybe if we fix the first we don't have to worry about the second.
Cc: tansell@chromium.org
My change at https://chromium.googlesource.com/chromium/tools/build/+/1f48a3e6f509fa9e8647e04efcfffdf0fff2c35d%5E%21/#F0 tried to mitigate some of this issue in the swarming/api.py module see - https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/swarming/api.py?type=cs&q=filter_outdir&sq=package:chromium&l=29

I have a generic solution at https://codereview.chromium.org/2768883004/

I would actually like to rebase that CL and land it.
Cc: phajdan.jr@chromium.org
Owner: ----
Status: Untriaged (was: Assigned)
Marking this for retriage.
Owner: iannucci@chromium.org
Status: Assigned (was: Untriaged)
Robbie, please triage as i don't understand based on your reply here https://codereview.chromium.org/2768883004/ if it should be landed and/or abandoned.
Components: Internals>GPU>ANGLE
Labels: Hotlist-PixelWrangler
This error is popping up again on the ANGLE CQ:

https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_deqp_rel_ng

Unsure what is making it trip the exception now. We might have just added one or two more tests than the (?) maximum.

First failing build:

https://build.chromium.org/p/tryserver.chromium.angle/builders/win_angle_deqp_rel_ng/builds/988

Every build is now failing with an exception:

Traceback (most recent call last):
  File "E:\b\rr\tmpawvt0l\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\util.py", line 150, in raises
    yield
  File "E:\b\rr\tmpawvt0l\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\run.py", line 273, in run_step
    open_step = self._step_runner.open_step(step_config)
  File "E:\b\rr\tmpawvt0l\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 178, in open_step
    step_config, recipe_test_api.DisabledTestData()
  File "E:\b\rr\tmpawvt0l\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 601, in render_step
    new_cmd.extend(item.render(tdata))
  File "E:\b\rr\tmpawvt0l\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 40, in render
    os.write(input_fd, self.encode(self.data))
  File "E:\b\rr\tmpawvt0l\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 69, in encode
    return data.decode('utf-8', 'replace').encode('utf-8')
MemoryError

Leaving this classified as Pri-1, this is currently causing the ANGLE CQ to fail almost every time on this step. Any possible resolution would be welcome. Tagging this with ANGLE to keep it visible.

Comment 10 by kbr@chromium.org, Oct 26 2017

Cc: kbr@chromium.org

Comment 11 by piman@chromium.org, Feb 26 2018

Ping on P1 issue that has been untouched for 4 months. Is the CL mentioned in comments #6/#8 useful for a resolution?
Cc: jbudorick@chromium.org
 Issue 833539  has been merged into this issue.
https://ci.chromium.org/buildbot/chromium.clang/CrWinAsan/1159 is purple because of this. All the tests pass, but we get https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium.clang%2FCrWinAsan%2F1159%2F%2B%2Frecipes%2Fsteps%2FRecipe_engine_bug%2F0%2Flogs%2Fexception%2F0

Traceback (most recent call last):
  File "C:\b\rr\tmpqsv4gm\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\util.py", line 151, in raises
    yield
  File "C:\b\rr\tmpqsv4gm\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\run.py", line 226, in run_step
    open_step = self._step_runner.open_step(step_config)
  File "C:\b\rr\tmpqsv4gm\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 178, in open_step
    step_config, recipe_test_api.DisabledTestData()
  File "C:\b\rr\tmpqsv4gm\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_engine\step_runner.py", line 612, in render_step
    new_cmd.extend(item.render(tdata))
  File "C:\b\rr\tmpqsv4gm\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 88, in render
    os.write(input_fd, self.encode(self.data))
  File "C:\b\rr\tmpqsv4gm\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 118, in encode
    return data.decode('utf-8', 'replace').encode('utf-8')
MemoryError


A very similar bot (https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/win-asan) is completely green.

What's the status here? Looks like this is affecting several builders.
Project Member

Comment 15 by bugdroid1@chromium.org, Aug 20

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/recipes-py/+/53053ef702fea3816a322b4ecc697348e7002555

commit 53053ef702fea3816a322b4ecc697348e7002555
Author: Nico Weber <thakis@chromium.org>
Date: Mon Aug 20 17:47:53 2018

Don't keep multiple copies of data in memory to sanitize utf-8.

I locally made a 30MB file containing many lines that each contained 60 'a'
characters, and put an invalid utf-8 character (0x80) in the middle of the file.
Reading just the file made python use 36MB according to /usr/bin/time -r (mac)
/ /usr/bin/time -f %M (linux).

Converting the data to sanitized utf-8 using
  open('out.txt', 'w').write(data.decode('utf-8', 'replace').encode('utf-8'))
made the program take 160MB.

Using the streaming approach from this CL made the program take 40MB while
taking the same amount of time, so from 400% memory overhead to 10% overhead.

So hopefully this will help with the bug below.

Bug: chromium:735306
Change-Id: I39f5933e8b9ba22ea7c7d67a137ee45451d47e0c
Reviewed-on: https://chromium-review.googlesource.com/1181246
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>
Reviewed-by: Stephen Martinis <martiniss@chromium.org>
Commit-Queue: Nico Weber <thakis@chromium.org>

[modify] https://crrev.com/53053ef702fea3816a322b4ecc697348e7002555/recipe_modules/raw_io/api.py

Cc: -no...@chromium.org
Do I need to do anything to deploy that recipes change everywhere? My fix landed 25h ago, but e.g. https://ci.chromium.org/buildbot/chromium.clang/ToTWin64%28dbg%29/1030 happened several hours after it and the stack still shows the pre-patch code in the backtrace.
I would've expected that to be deployed already. 

Also, per the comments on https://crrev.com/c/1181246 - do we know *which* step is trying to log this data, and whether there's something we should possibly be turning off (per iannucci's comment "Logs of JSON files are totally controllable (and can always be turned off),") .
Yes the latest recipes from build should already include that change. It landed in the build repo here: https://chromium.googlesource.com/chromium/tools/build/+/9266bb6a12842a4a4c0e8da0ac994e1bdb5509c0
Hm, weird.

https://build.chromium.org/deprecated/chromium.clang/builders/ToTWin(dbg)/builds/1161/steps/update_scripts/logs/stdio says:

build (Elapsed: 0:00:02)
----------------------------------------
[0:00:01] Started.
_____ build at refs/remotes/origin/master

________ running 'git -c core.deltaBaseCacheLimit=512m fetch origin --prune --verbose' in 'C:\b\build'
[0:00:02] From https://chromium.googlesource.com/a/chromium/tools/build
[0:00:02]  = [up to date]            infra/config          -> origin/infra/config
[0:00:02]  = [up to date]            master                -> origin/master
[0:00:02]  = [up to date]            recipe-package-canary -> origin/recipe-package-canary
[0:00:03] Fetching origin

________ running 'git reset --hard HEAD' in 'C:\b\build'
[0:00:03] HEAD is now at be05c2358b Use the 'generated' named cache in CS gen recipe.


According to https://chromium.googlesource.com/chromium/tools/build/+log , that's past the revision you linked to. Yet, at that build:

https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium.clang%2FToTWin_dbg_%2F1161%2F%2B%2Frecipes%2Fsteps%2FRecipe_engine_bug%2F0%2Flogs%2Fexception%2F0

  File "C:\b\rr\tmphvmek0\rw\checkout\scripts\slave\.recipe_deps\recipe_engine\recipe_modules\raw_io\api.py", line 198, in decode
    else result.decode('utf-8', 'replace').encode('utf-8'))


...oh, I failed to update the code around line 198, I only touched line 118. CL coming up...


(Re: "do we know *which* step...": I don't; is there some easy way to tell?)
Actually, this one is a bit harder to fix since OutputPlaceholder.result() currently also returns the string data, which means the streaming approach I used for InputPlaceholder doesn't just work. I'll look at it a bit more, but maybe someone who actually knows all this recipes stuff might have to take over.
It sure looks like the failing step in question is viz_browser_tests.

The way that output placeholder works is that it returns a string (the whole file). It's not meant for huge files. Do we know how large of a gtest result viz_browser_tests is producing?
The stack trace indicates that this is the placeholder which is running the engine out of memory https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/test_utils/api.py?q=%22test_utils/api.py%22&sq=package:chromium&dr&l=24
Cc: serg...@chromium.org
Here's the exception on a run without any test failures: https://ci.chromium.org/buildbot/chromium.clang/ToTWin%28dll%29/1144

(But that was before my fix above; maybe now the bot is only purple if something fails? If so, that'd be a huge progression already.)

Looks like the code you linked to was added in 2015 in https://chromium.googlesource.com/chromium/tools/build/+/9ef8797bdd687558b57d22f467ac8f3c56510dc9

The output can be seen here (I think?) https://logs.chromium.org/v/?s=chromium%2Fbb%2Fchromium.clang%2FToTWin_dll_%2F1147%2F%2B%2Frecipes%2Fsteps%2Ftest_pre_run%2F0%2Fsteps%2Fs__trigger__viz_browser_tests%2F0%2Flogs%2Fjson.output%2F0 As far as I can tell it's around 6kB, so not all that large. Or I'm looking at the wrong json output.
Yeah, 6kb should be fine; however that's a successful run. Is it possible that the test produces tons of output when it fails?
I'm also working on a change to drop logs from memory after they're emitted to logdog/buildbot. Currently they're kept in memory for the length of the recipe run, which could cause similar issues (if there are many large files).
the browser_test results and webkit_layout_tests results are just gonna be large (10's of MB of data), because we have 10's of thousands of tests. Even the test names by themselves end up being large. 

We need a way for that data to not be read in or touched by the interpreter more times than strictly needed.
Project Member

Comment 28 by bugdroid1@chromium.org, Aug 21

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/recipes-py/+/2310fc75c6b6cd672558934d92f04e6ccd449140

commit 2310fc75c6b6cd672558934d92f04e6ccd449140
Author: Nico Weber <thakis@chromium.org>
Date: Tue Aug 21 20:48:15 2018

Keep fewer copies of data to sanitize utf-8 in OutputPlaceholders too.

This does for OutputPlaceholders what https://chromium-review.googlesource.com/c/1181246/
did for InputPlaceholders. In the other CL, we had a string that was maybe utf-8
and wanted to write it to a file.

Here we have a file that maybe contains utf-8 and want to convert it to a string.

We can read the file in chunks so we don't have to keep the whole contents
in memory, but we then need to combine all the converted strings to one full
string. That requires one copy, so this only brings us from 400% memory overhead
to around 100% overhead. Still much better, but not as much better as the
other change.

With a larger change in recipes (e.g. by making result() return a callable
instead of a string) larger savings are possible, but let's go with the
smaller patch for now.

Bug: chromium:735306
Change-Id: I8eaac86d2526119c28a61dd36cf244b32a30950c
Reviewed-on: https://chromium-review.googlesource.com/1184059
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>
Commit-Queue: Nico Weber <thakis@chromium.org>

[modify] https://crrev.com/2310fc75c6b6cd672558934d92f04e6ccd449140/recipe_modules/raw_io/api.py

Our bots are now no longer purple (but red :-D -- but for unrelated reasons), so for us this is fixed after the two changes I made.
Thanks Nico! Should help the ANGLE bots as well I assume.
Yes, probably.
Cc: -serg...@chromium.org -phajdan.jr@chromium.org -phajdan@google.com
>> Looks like the code you linked to was added in 2015 in https://chromium.googlesource.com/chromium/tools/build/+/9ef8797bdd687558b57d22f467ac8f3c56510dc9

It was added earlier in https://chromium.googlesource.com/chromium/tools/build/+/909cb8303196a9db4013b4168fa088ba2872a0af by phajdan.jr@chromium.org. I just moved it to a different module in that commit. Neither me nor Pawel are working on ChOps team anymore though...
Project Member

Comment 33 by bugdroid1@chromium.org, Oct 5

Project Member

Comment 34 by bugdroid1@chromium.org, Oct 5

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/recipes-py/+/acbe9e3142610d958181460254565ef27ad47dc0

commit acbe9e3142610d958181460254565ef27ad47dc0
Author: Robert Iannucci <iannucci@chromium.org>
Date: Fri Oct 05 17:59:10 2018

[StepPresentation] Remove all logs from memory during finalization.

Previously the recipe engine kept all log content in memory for
the duration of the recipe execution. No recipes have been found
that actually need to read this data (which can't read it in some
alternate mechanism). Since the logs can be quite long, this should
reduce the memory consumption of most recipe executions, sometimes
by a lot.

Because the logs are no longer kept in memory, the recipe engine
will now pop an assertion if a recipe attempts to read/write a log
after the step has been finalized.

R=martiniss@chromium.org, nodir@chromium.org

Bug:  833539 ,735306
Change-Id: I8450b473eeaaafcdc0a87c7c542f961848327050
Reviewed-on: https://chromium-review.googlesource.com/c/1171206
Commit-Queue: Robbie Iannucci <iannucci@chromium.org>
Reviewed-by: Stephen Martinis <martiniss@chromium.org>
Reviewed-by: Nodir Turakulov <nodir@chromium.org>

[modify] https://crrev.com/acbe9e3142610d958181460254565ef27ad47dc0/recipe_engine/recipe_test_api.py
[modify] https://crrev.com/acbe9e3142610d958181460254565ef27ad47dc0/recipe_engine/types.py

Owner: iannu...@google.com
Cc: -iannucci@chromium.org iannu...@google.com
Labels: -Pri-1 Pri-2
[GPU Triage Council]

Sign in to add a comment