Change test_launcher to write output.json at the start of the execution and on-going |
||||||||
Issue description
Current state:
- Chromium recipe isolate and triggers a test as a swarming task.
- In the Swarming task:
- test_launcher starts.
- When it runs in parallel (bot) mode, it starts N child processes.
- Each child processes runs their chunk of test cases.
- They write the results in a temporary file upon exit as instructed by the master. See test_launcher.cc:DoLaunchChildTestProcess()
- The main test_launcher processes gather results and writes it to ${ISOLATED_OUTDIR}/output.json.
- The recipe collects the results.
Relevant code: //base/test/launcher
Problems:
Source:
- When the master test_launcher process is forcibly killed mid-execution, ${ISOLATED_OUTDIR}/output.json is not written at all.
Effects:
- When the recipe collects the task results and there is no output.json, it doesn't have enough information to report anything else than InfraFailure.
- We lose all information about which test cases had time to succeed, fail and which one weren't run.
Notes:
- It doesn't really matter why the process was killed in the context of this issue, the goal here is to know that at least some test cases ran or not.
- The query written by Katie shown that this is happening on a continuous basis.
- This doesn't fix the problem when the bot reboots mid-task, this will be addressed separately in issue 645280 .
AI:
- Change //base/test/launcher to write output.json right at the start and mark it clear that there are test cases pending execution.
- This likely require change in the format.
- Optional, needs discussion: every time there's more information (or buffer for N seconds), update the file. This could cause a small (large?) perf hit but would make the data much more useful. This is trickier since it needs collaboration between the children and master processes.
Changes to the recipes are outside the scope of this bug, will file another one.
,
Sep 23 2016
,
Sep 23 2016
What should we do about Telemetry based tests (SwarmingIsolatedScriptTest)? The same policy change should take effect there as well? I assume webkit_tests should do the same once they're Swarmed?
,
Sep 23 2016
Hmm ... I'm not sure I agree with this basic idea, though I'd have to think about it more. Maybe it would be better if the test launcher was streaming results back over a socket or a pipe? Alternatively maybe we should just make the individual jobs small enough that we don't have to wait too long for them to die and it's not that painful to retry them? Generally in my test harnesses I've tried hard to make sure that the harness itself never dies, even though the underlying tests might, but of course we can't make that always work.
,
Sep 23 2016
json format is not very streaming friendly. They way we do it in telemetry is for each page (~ test) run in a benchmark, we stream gtest output that looks s.t like: [ Run ] text_10000_pixels_per_second .... [ OK ] text_10000_pixels_per_second (18481 ms) [ Run ] text_15000_pixels_per_second ..... [ OK ] text_15000_pixels_per_second (14745 ms) ... (example log: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%208%20Perf%20%281%29/builds/7742/steps/thread_times.tough_scrolling_cases/logs/stdio) Then we output json file at the end that contains detailed info of tests which is more bot friendly. I think this is enough to figure out: which tests have run, for how long, and which ones are currently stuck.
,
Sep 23 2016
The JSON format was not designed to be streamable, but it wouldn't be too hard to come up with a variant that was.
,
Oct 6 2016
- We lose all information about which test cases had time to succeed, fail and which one weren't run. This isn't totally true, correct? We can see that information in the log.
,
Oct 6 2016
Right, but the whole point of writing the output json is to provide the data in a known format that gets uploaded somewhere, so that we don't have to parse logs (that are potentially in ad-hoc or varying formats).
,
Oct 10 2016
,
Oct 10 2016
Why did you mark this Restrict-View-Google? There's nothing that needs to be private in this ...
,
Oct 11 2016
Removing RVG, there's no reason for this flag AFAIK. Re comment #4: streaming may become an option eventually but this is longer term anyway. This issue is about short term improvement. This is the wrong channel to talk about this, ping me directly for more details. Re comment #6: in the proposal I note that test_launcher could write periodically. It's up to Pawel to decide on this.
,
Oct 11 2016
Uploaded https://codereview.chromium.org/2411003003 . We could make it more sophisticated later.
,
Oct 11 2016
,
Oct 12 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/5dc1ee3fccf13062e1ee9b1fbb25189f617c74c8 commit 5dc1ee3fccf13062e1ee9b1fbb25189f617c74c8 Author: phajdan.jr <phajdan.jr@chromium.org> Date: Wed Oct 12 10:46:14 2016 Save an early test summary in case the test launcher crashes or gets killed This may turn some infra failures (purple) into INVALID_TEST_RESULTS (red) in case the test launcher is for example early terminated on swarming. See bug for more details. BUG= 649831 Review-Url: https://codereview.chromium.org/2411003003 Cr-Commit-Position: refs/heads/master@{#424709} [modify] https://crrev.com/5dc1ee3fccf13062e1ee9b1fbb25189f617c74c8/base/test/launcher/test_launcher.cc [modify] https://crrev.com/5dc1ee3fccf13062e1ee9b1fbb25189f617c74c8/base/test/launcher/test_launcher.h [modify] https://crrev.com/5dc1ee3fccf13062e1ee9b1fbb25189f617c74c8/base/test/launcher/test_results_tracker.cc [modify] https://crrev.com/5dc1ee3fccf13062e1ee9b1fbb25189f617c74c8/base/test/launcher/test_results_tracker.h
,
Oct 22 2016
Issue 654074 has been merged into this issue.
,
Oct 22 2016
,
Oct 27 2016
Closing. We can revisit this later in case a more sophisticated fix is needed. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by mar...@chromium.org
, Sep 23 2016