Improve retry logic in results_dashboard.py |
||||||||
Issue descriptionhttps://cs.chromium.org/chromium/src/tools/perf/core/results_dashboard.py The existing retry logic is pretty basic, if the upload fails under certain conditions the data is serialized to disk and retried again and again forever. This causes a few problems, first of which is that if you get some "bad" data into that path, the waterfall goes red until someone goes in and manually purges every bot. On things like timeouts which result in a 500, we don't retry since appengine also 500's on internal errors, so we have to be conservative and just not retry in those cases even though they may succeed on subsequent attempts. Maybe it could be extended with some things like retry_count, and then we could retry on 500's.
,
Jul 23
Starting the work by simplifying results_dashboard.py script first
,
Jul 24
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0f9b32a0e4878013110937ab0dd08ca54dd09dec commit 0f9b32a0e4878013110937ab0dd08ca54dd09dec Author: Ned Nguyen <nednguyen@google.com> Date: Tue Jul 24 00:08:19 2018 Refactor results_dashboard script to get rid of file caching logic The reason is that logic was never activated, so getting rid of it for simplification. Later CLs will add the retry logic from clean state. Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi Change-Id: Iad93ad54c0c5f30c8a207c9b703b10a562f9f04f BUG: chromium:864565 Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi Change-Id: Iad93ad54c0c5f30c8a207c9b703b10a562f9f04f Reviewed-on: https://chromium-review.googlesource.com/1146145 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Simon Hatch <simonhatch@chromium.org> Cr-Commit-Position: refs/heads/master@{#577356} [modify] https://crrev.com/0f9b32a0e4878013110937ab0dd08ca54dd09dec/tools/perf/core/results_dashboard.py [modify] https://crrev.com/0f9b32a0e4878013110937ab0dd08ca54dd09dec/tools/perf/core/upload_results_to_perf_dashboard.py [modify] https://crrev.com/0f9b32a0e4878013110937ab0dd08ca54dd09dec/tools/perf/process_perf_results.py
,
Jul 24
This is actually P1 because most of the perf waterfall failure I see these days are due to 500 server error in uploading
,
Jul 24
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/881f1317306d626dbfe1de1a8c9aa113ef44eb63 commit 881f1317306d626dbfe1de1a8c9aa113ef44eb63 Author: Ned Nguyen <nednguyen@google.com> Date: Tue Jul 24 21:10:09 2018 Add retry logic to results_dashboard uploading In this first attempt, we retry upload the whole perf dashboard data (JSONnizable) for any recoverable error. Subsequent CL may improve this further by splitting up the perf dashboard data to small parts and only try to reupload the parts which the upload failed. This CL also increase the expiration of the oath token generated for each perf benchmark's data to 30 minutes (used to be 1 minutes by default) Bug: 864565 Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi Change-Id: If4cb959cc62c5da3baa4a36a5ae91c10ca6d4d02 Reviewed-on: https://chromium-review.googlesource.com/1148228 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Simon Hatch <simonhatch@chromium.org> Cr-Commit-Position: refs/heads/master@{#577682} [modify] https://crrev.com/881f1317306d626dbfe1de1a8c9aa113ef44eb63/tools/perf/core/oauth_api.py [modify] https://crrev.com/881f1317306d626dbfe1de1a8c9aa113ef44eb63/tools/perf/core/results_dashboard.py [add] https://crrev.com/881f1317306d626dbfe1de1a8c9aa113ef44eb63/tools/perf/core/results_dashboard_unittest.py [modify] https://crrev.com/881f1317306d626dbfe1de1a8c9aa113ef44eb63/tools/perf/process_perf_results.py
,
Jul 25
Another upload in https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus5_Perf%2F2097%2F%2B%2Frecipes%2Fsteps%2Fperformance_test_suite_on_Android_device_Nexus_5%2F0%2Flogs%2FMerge_script_log%2F0 failed. It's because 500 error is still unrecoverable. I am switching perf dashboard upload to retry on 500s now.
,
Jul 25
,
Jul 25
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9310bd64689bd0875556215e7149c7d844a06d64 commit 9310bd64689bd0875556215e7149c7d844a06d64 Author: Ned Nguyen <nednguyen@google.com> Date: Wed Jul 25 16:11:47 2018 Refactor token generation logic into tools/perf/core/results_dashboard.py This has several benefits: 1) No longer generating the authentication token for each benchmark results too early. Before this CL, we generate authentication for every benchmark results first, then process each perf results & upload them. With such logic, the last benchmark to be uploaded may have token generated for a long period of time. 2) Enables generating a new token for every retry attempt in core/results_dashboard.py. Bug:864565 Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi Change-Id: I8b2d681efdaa241a8af19a7ffbf126523caa12a9 NOTRY=true # CQ flakes Change-Id: I8b2d681efdaa241a8af19a7ffbf126523caa12a9 Reviewed-on: https://chromium-review.googlesource.com/1150080 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Simon Hatch <simonhatch@chromium.org> Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org> Cr-Commit-Position: refs/heads/master@{#577916} [delete] https://crrev.com/e975a3e7a531508aa41248f4ae21a11c7ed667d7/tools/perf/core/oauth_api.py [modify] https://crrev.com/9310bd64689bd0875556215e7149c7d844a06d64/tools/perf/core/results_dashboard.py [modify] https://crrev.com/9310bd64689bd0875556215e7149c7d844a06d64/tools/perf/core/results_dashboard_unittest.py [modify] https://crrev.com/9310bd64689bd0875556215e7149c7d844a06d64/tools/perf/core/upload_results_to_perf_dashboard.py [modify] https://crrev.com/9310bd64689bd0875556215e7149c7d844a06d64/tools/perf/process_perf_results.py [modify] https://crrev.com/9310bd64689bd0875556215e7149c7d844a06d64/tools/perf/process_perf_results_unittest.py
,
Jul 25
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/b42bce5039412dc6215539a365bb6182ead45d32 commit b42bce5039412dc6215539a365bb6182ead45d32 Author: Ned Nguyen <nednguyen@google.com> Date: Wed Jul 25 17:18:01 2018 Retry uploading to perf dashboard in cases server reponses 500 Bug: 864565 Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi Change-Id: I8ae19f0318802404f354fc474a62299aad713afd Reviewed-on: https://chromium-review.googlesource.com/1149582 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Simon Hatch <simonhatch@chromium.org> Cr-Commit-Position: refs/heads/master@{#577953} [modify] https://crrev.com/b42bce5039412dc6215539a365bb6182ead45d32/tools/perf/core/results_dashboard.py
,
Jul 25
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4c5e7afaaf0586e6e5f33a8ca4c303643e74edb4 commit 4c5e7afaaf0586e6e5f33a8ca4c303643e74edb4 Author: Ned Nguyen <nednguyen@google.com> Date: Wed Jul 25 22:16:06 2018 Refactor results_dashboard code to be cleaner Bug: 864565 Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi Change-Id: I1fbea8b4afd7f6be773db2ea5f83e0a0aa549b41 Reviewed-on: https://chromium-review.googlesource.com/1150431 Reviewed-by: Simon Hatch <simonhatch@chromium.org> Commit-Queue: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#578095} [modify] https://crrev.com/4c5e7afaaf0586e6e5f33a8ca4c303643e74edb4/tools/perf/core/results_dashboard.py [modify] https://crrev.com/4c5e7afaaf0586e6e5f33a8ca4c303643e74edb4/tools/perf/core/results_dashboard_unittest.py
,
Aug 5
This bug is done. Though we still have problems with uploading to perf dashboard due to timedout in internal dashboard server (tracked in issue 867379)
,
Aug 15
Talking to Ethan & Simon. There are still further improvements we can make to the retry logic. Right now we retry right after the failure, whereas it generally takes perf dashboard server at least 10 seconds to spin up a new instance. For now, I will add s.t like: wait_time_to_retry = 30s for _ in max_num_retry: sleep(wait_time_to_retry) upload_to_perf_dashboard() wait_time_to_retry *= 2
,
Aug 15
,
Aug 15
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4a023983ea91852ee5ea3a6b0bdc3749ac4a7258 commit 4a023983ea91852ee5ea3a6b0bdc3749ac4a7258 Author: Ned Nguyen <nednguyen@google.com> Date: Wed Aug 15 22:39:57 2018 Add wait time between retry attempts to upload to perf dashboard This implements exponential backoff to increase the wait time after each retry attempt. This CL also increase the default number of retry attempts to 4. Bug:864565 Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi Change-Id: I99c167e1329bbb83d86d310e2acf164cd3f3e7cf NOTRY=true # linux-chromeos-rel flake Change-Id: I99c167e1329bbb83d86d310e2acf164cd3f3e7cf Reviewed-on: https://chromium-review.googlesource.com/1176089 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Annie Sullivan <sullivan@chromium.org> Cr-Commit-Position: refs/heads/master@{#583422} [modify] https://crrev.com/4a023983ea91852ee5ea3a6b0bdc3749ac4a7258/tools/perf/core/results_dashboard.py [modify] https://crrev.com/4a023983ea91852ee5ea3a6b0bdc3749ac4a7258/tools/perf/core/results_dashboard_unittest.py [modify] https://crrev.com/4a023983ea91852ee5ea3a6b0bdc3749ac4a7258/tools/perf/core/upload_results_to_perf_dashboard.py [modify] https://crrev.com/4a023983ea91852ee5ea3a6b0bdc3749ac4a7258/tools/perf/process_perf_results.py
,
Sep 6
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by simonhatch@chromium.org
, Jul 23