Dashboard Upload step skipped if benchmark step is red |
||||||
Issue descriptionIt appears that we don't run the Dashboard Upload step at all if a benchmark is red. This means that we're losing good data from those runs. Unless I'm missing something, if only one story out of 25 fails, we should still upload the data for the other 24 stories, right? Here's an example from the perf waterfall where blink_perf.dom fails and there's no Dashboard Upload step https://ci.chromium.org/buildbot/chromium.perf/Android%20Nexus5X%20Perf/897 This is also a problem because we can't get a signal from benchmarks that fail on the histogram FYI bot.
,
Dec 8 2017
Friendly ping. Emily, could you look into this? Being able to get data from even benchmarks that have failing stories is blocking our correctness testing and rollout of the histogram pipeline.
,
Dec 8 2017
Yes I will take a look this afternoon.
,
Dec 8 2017
Ok, I don't know any perf specific logic that says if any fail, don't upload any. If you look here where we upload: https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/chromium_tests/steps.py?l=1620 There is no check for that. Given that we don't even see a "blink_perf.dom Dashboard Upload" step on the build page you referenced (https://ci.chromium.org/buildbot/chromium.perf/Android%20Nexus5X%20Perf/897), it is not making it to this part of the recipe. That being said, when I look at the logs of the failing step (https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus5X_Perf%2F897%2F%2B%2Frecipes%2Fsteps%2Fblink_perf.dom_on_Android%2F0%2Fstdout) I see the following line: "WARNING:root:collect_cmd had non-zero return code: 255" So maybe something is failing on the collect step of the bot? John is more familiar with the swarming collect recipe so maybe he can point us to where to investigate next what is going wrong with perf.
,
Dec 8 2017
,
Dec 11 2017
It looks like the issue isn't in the swarming collection logic but in this weird bit of code in validate_task_results: https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/chromium_tests/steps.py?rcl=2b868edbb217a1fcda0bfe85b30a9619942b1f61&l=1600 (This if block appears to have been motivated by the layout tests. I'm not sure I agree with the decision to treat valid-looking failures with exit codes >101 as invalid; revisiting that (likely by changing RWT) may be necessary here.) In any event, telemetry appears to be exiting w/ -1, which gets handled as 255. That's bigger than 101, so the recipe doesn't think that the results are valid and doesn't attempt to upload them.
,
Dec 11 2017
Ethan is the return code something that has changed with histogram set? Maybe that is a design discussion we need to have on failures and what we return. I am not as familiar with the telemetry code to know what valid return codes are.
,
Dec 11 2017
No, we don't use the return code for anything to do with histograms. I also don't really know anything about Telemetry's return codes. Could Ashley or Ned be the right person to look at this?
,
Dec 11 2017
The return_code logic of Telemetry is in https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/story_runner.py?rcl=665b8b46209c1de8a1878c40b5cbb29d3e06c922&l=367 It return 255 for the case of unhandleable exception (line 374) . If that mean s.t special in recipe, we can always adjust the return code of Telemetry. There is no other existing clients that rely on the return code of Telemetry that I know of (except non zero means something bad happened)
,
Jan 4 2018
Ping on this -- can we adjust the return code of Telemetry? It seems like anything <= 101 should be fine according to jbudorick's comment.
,
Jan 4 2018
I can help with changing Telemetry return code
,
Jan 5 2018
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/ccd8621a0389fea575a8b929b00629c9bea453f6 commit ccd8621a0389fea575a8b929b00629c9bea453f6 Author: Nghia Nguyen <nednguyen@google.com> Date: Fri Jan 05 12:12:59 2018 Simplify Telemetry return code Chromium service has trouble with handling return_code larger than 101. Since there is no use in having return_code reflecting the number of test failures, we simplify Telemetry return code to: 1 - if the failures are handleable (recorded in telemetry failure results) 2 - if the failure is unhandleable Bug: chromium:792681 Change-Id: Ide651bba4ce3bded0a35d4b4efa9a707355149cf Reviewed-on: https://chromium-review.googlesource.com/850745 Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org> Commit-Queue: Ned Nguyen <nednguyen@google.com> [modify] https://crrev.com/ccd8621a0389fea575a8b929b00629c9bea453f6/telemetry/telemetry/internal/story_runner.py
,
Jan 10 2018
This should be fixed
,
Jan 10 2018
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/21f0b3554207e6e8178fa4e8f0c91b1249aeec79 commit 21f0b3554207e6e8178fa4e8f0c91b1249aeec79 Author: Nghia Nguyen <nednguyen@google.com> Date: Wed Jan 10 14:27:16 2018 Add test coverage for Telemetry benchmark return_code This also update documentation of story_runner.RunBenchmark Bug: chromium:792681 Change-Id: Iec50e507f0e72c98b7b5c9a0371ad3ee8796f5a8 Reviewed-on: https://chromium-review.googlesource.com/859877 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org> [modify] https://crrev.com/21f0b3554207e6e8178fa4e8f0c91b1249aeec79/telemetry/telemetry/internal/story_runner.py [modify] https://crrev.com/21f0b3554207e6e8178fa4e8f0c91b1249aeec79/telemetry/telemetry/internal/story_runner_unittest.py
,
Jan 10 2018
Thanks a lot, Ned! |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by nedngu...@google.com
, Dec 6 2017Status: Assigned (was: Untriaged)