New issue
Advanced search Search tips

Issue 665159 link

Starred by 2 users

Issue metadata

Status: Available
Merged: issue 660713
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Oversized output.json causes infra failure

Project Member Reported by katthomas@chromium.org, Nov 14 2016

Issue description

Build: https://luci-milo.appspot.com/buildbot/tryserver.chromium.linux/linux_chromium_tsan_rel_ng/328

Step 32 has a missing shard because output.json is too big, causing a purple infra failure. 

Doesn't it make sense for this step to be red? If the test launcher is barfing this much output, it's likely due to an issue with the patch, right?


 
@maruel can you comment on the questions raised in #1?
Ping, m-a?
Oh sorry I thought I had commented.

The short answer is "Yes". But I don't plan to work on this.
Labels: Infra-Failures
Cc: mar...@chromium.org
Owner: ----
Removing M-A as owner, since he clearly says he isn't working on it ;)
Cc: djd@chromium.org mcgreevy@chromium.org
+dave and micheal, in case they want to pick up something else swarming related?

We are looking for an owner on this, as it is a P1 and contributing to issues with our infra failure rates.
Cc: phajdan.jr@chromium.org iannucci@chromium.org
 Issue 660713  has been merged into this issue.
@iannucci gave some suggestions in  crbug.com/660713 

Comment 9 by iannu...@google.com, Dec 16 2016

Mergedinto: 660713
Status: Duplicate (was: Assigned)
dedup'ing
Status: Available (was: Duplicate)
Un-de-duping ... it's not clear to me that this is particularly a duplicate of 660713; I think whatever is going on here is probably in important ways different from what is going on there.
Owner: mcgreevy@chromium.org
I've taken a look at this today.  Here's what I've found:


collect_gtest_task refuses to load output.json files larger than 100MB :

  https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/swarming/resources/collect_gtest_task.py?rcl=0&l=126


The output.json in the build which is linked from this bug was 127MB.  The bulk of this came from "output_snippet" fields (total of 54MB) and "output_snippets_base64" (which is simply a base64 encoding of output_snippet; total of 72MB).  There are 1669 individual tests, which means that the average output_snippet is 33KB, and the average output_snippet_base64 is 43KB.  collect_gtest_task refuses to load this file as 127MB > 100MB.


There is code here, which is presumably run before rendering the output.json data --

  https://cs.chromium.org/chromium/src/base/test/launcher/test_results_tracker.cc?rcl=0&l=355 

-- which seems to truncate output_snippet strings if they are > 300KB. However, I'm not sure where that code is called from, and in any event it would not have any effect here (the max output_snippet in the output.json was 101KB, below the 300KB threshold).  (Aside: it looks like this trimming, were it to happen, would be reflected in the output_snippet_base64 data too.)


The output.json is actually rendered here --

  https://cs.chromium.org/chromium/src/base/test/launcher/test_results_tracker.cc?rcl=0&l=304

-- and no checking is done on its total length at this point.


Now, it would be nice to simply report instances of very large output.json as test failures, rather than infrastructure failures, but 

(a) It's possible that we could have infrastructure failure that generate >100MB output.json files, so I don't think we can just assume that a large output.json indicates test failures.

(b) IIUC the code which actually checks the tests statuses uses merged output.json data produced by collect_gtest_task, so if collect_gtest_task refuses to merge the data, we can't tell whether an error was a test or infra failure.


I propose: trimming the output snippets in SaveSummaryAsJson, to ensure that output.json is <100MB and the downstream infrastructure can load this file and consequently report the test failures in the usual way. It might be enough just to replace output_snippets with a string like "lengthy output elided" once we pass 50MB (arbitrarily chosen threshold) of test output.  I think there's precedent for not including all output_snippet data in the output.json, as the test_results_tracker code linked above truncates snippets over 300KB. 


Have I missed something? Is there anything which assumes that output_snippet data is not elided?

Does anyone have an alternative proposal?


Incidentally, although collect_gtest_task place a limit of 100MB on each output.json shard that it loads, there does not seem to be any limit on the size of the merged output that it writes.  So this could be very large in the case of many shards.
Status: Assigned (was: Available)
There is an owner on this bug, but the status was not "Assigned" or "Started". Fixing. If you do not own this bug, please remove yourself as the owner and make the status "Available".

Comment 13 by djd@chromium.org, Dec 20 2016

@mcgreevy: truncating output_snippet seems like the best approach in the absence of any reason for keeping it all.

It seems really odd that we're sending both plain+base64 copies of this data too. Since this is all happening in the JSON, which is already going to be encoding non-ASCII data, is there a good reason for having both?
Likely historical reasons. I wouldn't be surprised it started with plain text, then realized that the output is not necessarily UTF-8 encoded then the base64 was added as a stop gap, but the unencoded form was never removed.

Worth looking at https://chromium.googlesource.com/chromium/src/+/master/base/test/launcher

I recommend removing the unencoded form, that's a low hanging fruit.

Comment 15 by djd@chromium.org, Dec 20 2016

M-A, do you see any potential problems with mcgreevy's other suggestion (truncating too-long output_snippets)? It would be good to clean that up too.
Totally fine with that too. It'd recommend middle-out compression. :D Trim the middle lines of the logs, no the trailing lines.
I didn't state my proposal very clearly in comment 11. I was not suggesting truncating individual output_snippet lines, because figuring out how severely to truncate them would require knowledge of the total output size (so, probably doing two passes), unless we want to be super aggressive in truncating.

What I actually suggest is something simpler: render the output_snippets in full (the 300KB limit @ test_results_tracker.cc:355 notwithstanding) until the output size reaches 50MB.  Then stop rendering output_snippets entirely, instead rendering some short text like "lengthy output elided".

So, the output from the firsts tests would be included in full, and that of later tests would be completely dropped, but only in cases where the total output is >50MB.

This would be simple to code, efficient, and would not affect the test output except in cases where it really is excessive.
Re: output_snippet vs output_snippet_base64: the claimed reason for including the base64 version is:

// Also include the raw version (base64-encoded so that it can be safely
// JSON-serialized - there are no guarantees about character encoding
// of the snippet). This can be very useful piece of information when
// debugging a test failure related to character encoding.

(note: when the snippet contains non-utf8-encoded data, the output_snippet value is set to "<non-UTF-8 snippet, see output_snippet_base64>")

My guess was that the output_snippet was retained for humans to read, but I don't know if people do look at it or not. I also wouldn't be surprised if some code uses output_snippet instead of output_snippet_base64, just so they don't have to decode it, without realizing that the output_snippet is not reliable.  I am concerned that removing it wholesale would break a bunch of things which depend on it, and I don't know how I'd check what those dependencies are.
Project Member

Comment 19 by bugdroid1@chromium.org, Dec 21 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/df3cf941f58d305637209df35f378401e7bd8a4a

commit df3cf941f58d305637209df35f378401e7bd8a4a
Author: Michael McGreevy <mcgreevy@chromium.org>
Date: Tue Dec 20 23:47:52 2016

Distinguish between missing vs large gtest output.

collect_gtest_task outputs a STEP_WARNING when output.json files fail
to load, including cases where it refuses to load them because they are
too large. This change outputs a separate error message for the
too-large case.

BUG=665159

Change-Id: I7e2852df74213d3924747e54cf53f475ad3f8bf5
Reviewed-on: https://chromium-review.googlesource.com/422687
Commit-Queue: Michael McGreevy <mcgreevy@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/df3cf941f58d305637209df35f378401e7bd8a4a/scripts/slave/recipe_modules/swarming/resources/collect_gtest_task.py
[modify] https://crrev.com/df3cf941f58d305637209df35f378401e7bd8a4a/scripts/slave/recipe_modules/swarming/tests/collect_gtest_task_test.py

#c17: whatever that works.
Labels: -Pri-1 Pri-2
I'm waiting for review of https://codereview.chromium.org/2592923002/ , which should fix the underlying issue.

Now that df3cf94 has landed, the cause of this issue is better communicated to the users, so I don't think this is P1 any more; dropping it to P2.
bump - let's see if we can unblock the CL and make some progress here ...
Labels: Hotlist-Infra-Failures
Owner: tansell@chromium.org
tansell@ says he has a change that fixes this for any large json file.
Cc: -phajdan.jr@chromium.org -mcgreevy@chromium.org -djd@chromium.org -katthomas@chromium.org
Components: -Infra>Platform>Swarming Infra>Platform>Swarming>Admin
Owner: ----
Status: Available (was: Assigned)
Cc: -iannucci@chromium.org iannu...@google.com

Sign in to add a comment