New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 864565 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Sep 6
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 867379



Sign in to add a comment

Improve retry logic in results_dashboard.py

Project Member Reported by simonhatch@chromium.org, Jul 17

Issue description

https://cs.chromium.org/chromium/src/tools/perf/core/results_dashboard.py

The existing retry logic is pretty basic, if the upload fails under certain conditions the data is serialized to disk and retried again and again forever. This causes a few problems, first of which is that if you get some "bad" data into that path, the waterfall goes red until someone goes in and manually purges every bot. On things like timeouts which result in a 500, we don't retry since appengine also 500's on internal errors, so we have to be conservative and just not retry in those cases even though they may succeed on subsequent attempts.

Maybe it could be extended with some things like retry_count, and then we could retry on 500's.
 
So while reviewing https://chromium-review.googlesource.com/c/chromium/src/+/1146145 it turns out that the OBBS script never had working retries to begin with. When it was forked from the original, the cache file started going into a temp directory that got purged immediately after the script was called.
Owner: nednguyen@chromium.org
Status: Started (was: Untriaged)
Starting the work by simplifying results_dashboard.py script first
Project Member

Comment 3 by bugdroid1@chromium.org, Jul 24

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0f9b32a0e4878013110937ab0dd08ca54dd09dec

commit 0f9b32a0e4878013110937ab0dd08ca54dd09dec
Author: Ned Nguyen <nednguyen@google.com>
Date: Tue Jul 24 00:08:19 2018

Refactor results_dashboard script to get rid of file caching logic

The reason is that logic was never activated, so getting rid of it for simplification. Later CLs will add the retry logic from clean state.

Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi
Change-Id: Iad93ad54c0c5f30c8a207c9b703b10a562f9f04f

BUG:  chromium:864565 
Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi
Change-Id: Iad93ad54c0c5f30c8a207c9b703b10a562f9f04f
Reviewed-on: https://chromium-review.googlesource.com/1146145
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Simon Hatch <simonhatch@chromium.org>
Cr-Commit-Position: refs/heads/master@{#577356}
[modify] https://crrev.com/0f9b32a0e4878013110937ab0dd08ca54dd09dec/tools/perf/core/results_dashboard.py
[modify] https://crrev.com/0f9b32a0e4878013110937ab0dd08ca54dd09dec/tools/perf/core/upload_results_to_perf_dashboard.py
[modify] https://crrev.com/0f9b32a0e4878013110937ab0dd08ca54dd09dec/tools/perf/process_perf_results.py

Labels: -Pri-3 Pri-1
This is actually P1 because most of the perf waterfall failure I see these days are due to 500 server error in uploading
Project Member

Comment 5 by bugdroid1@chromium.org, Jul 24

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/881f1317306d626dbfe1de1a8c9aa113ef44eb63

commit 881f1317306d626dbfe1de1a8c9aa113ef44eb63
Author: Ned Nguyen <nednguyen@google.com>
Date: Tue Jul 24 21:10:09 2018

Add retry logic to results_dashboard uploading

In this first attempt, we retry upload the whole perf dashboard data (JSONnizable)
for any recoverable error.

Subsequent CL may improve this further by splitting up the perf dashboard data
to small parts and only try to reupload the parts which the upload failed.

This CL also increase the expiration of the oath token generated for each perf benchmark's data to 30 minutes (used to be 1 minutes by default)

Bug:  864565 
Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi
Change-Id: If4cb959cc62c5da3baa4a36a5ae91c10ca6d4d02
Reviewed-on: https://chromium-review.googlesource.com/1148228
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Simon Hatch <simonhatch@chromium.org>
Cr-Commit-Position: refs/heads/master@{#577682}
[modify] https://crrev.com/881f1317306d626dbfe1de1a8c9aa113ef44eb63/tools/perf/core/oauth_api.py
[modify] https://crrev.com/881f1317306d626dbfe1de1a8c9aa113ef44eb63/tools/perf/core/results_dashboard.py
[add] https://crrev.com/881f1317306d626dbfe1de1a8c9aa113ef44eb63/tools/perf/core/results_dashboard_unittest.py
[modify] https://crrev.com/881f1317306d626dbfe1de1a8c9aa113ef44eb63/tools/perf/process_perf_results.py

Another upload in https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus5_Perf%2F2097%2F%2B%2Frecipes%2Fsteps%2Fperformance_test_suite_on_Android_device_Nexus_5%2F0%2Flogs%2FMerge_script_log%2F0 failed. It's because 500 error is still unrecoverable. I am switching perf dashboard upload to retry on 500s now.
Blockedon: 867379
Project Member

Comment 8 by bugdroid1@chromium.org, Jul 25

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/9310bd64689bd0875556215e7149c7d844a06d64

commit 9310bd64689bd0875556215e7149c7d844a06d64
Author: Ned Nguyen <nednguyen@google.com>
Date: Wed Jul 25 16:11:47 2018

Refactor token generation logic into tools/perf/core/results_dashboard.py

This has several benefits:
1) No longer generating the authentication token for each benchmark results
too early. Before this CL, we generate authentication for every benchmark
results first, then process each perf results & upload them. With such logic,
the last benchmark to be uploaded may have token generated for a long period of
time.
2) Enables generating a new token for every retry attempt in
core/results_dashboard.py.

Bug:864565
Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi
Change-Id: I8b2d681efdaa241a8af19a7ffbf126523caa12a9

NOTRY=true # CQ flakes

Change-Id: I8b2d681efdaa241a8af19a7ffbf126523caa12a9
Reviewed-on: https://chromium-review.googlesource.com/1150080
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Simon Hatch <simonhatch@chromium.org>
Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>
Cr-Commit-Position: refs/heads/master@{#577916}
[delete] https://crrev.com/e975a3e7a531508aa41248f4ae21a11c7ed667d7/tools/perf/core/oauth_api.py
[modify] https://crrev.com/9310bd64689bd0875556215e7149c7d844a06d64/tools/perf/core/results_dashboard.py
[modify] https://crrev.com/9310bd64689bd0875556215e7149c7d844a06d64/tools/perf/core/results_dashboard_unittest.py
[modify] https://crrev.com/9310bd64689bd0875556215e7149c7d844a06d64/tools/perf/core/upload_results_to_perf_dashboard.py
[modify] https://crrev.com/9310bd64689bd0875556215e7149c7d844a06d64/tools/perf/process_perf_results.py
[modify] https://crrev.com/9310bd64689bd0875556215e7149c7d844a06d64/tools/perf/process_perf_results_unittest.py

Project Member

Comment 9 by bugdroid1@chromium.org, Jul 25

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b42bce5039412dc6215539a365bb6182ead45d32

commit b42bce5039412dc6215539a365bb6182ead45d32
Author: Ned Nguyen <nednguyen@google.com>
Date: Wed Jul 25 17:18:01 2018

Retry uploading to perf dashboard in cases server reponses 500

Bug:  864565 
Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi
Change-Id: I8ae19f0318802404f354fc474a62299aad713afd
Reviewed-on: https://chromium-review.googlesource.com/1149582
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Simon Hatch <simonhatch@chromium.org>
Cr-Commit-Position: refs/heads/master@{#577953}
[modify] https://crrev.com/b42bce5039412dc6215539a365bb6182ead45d32/tools/perf/core/results_dashboard.py

Project Member

Comment 10 by bugdroid1@chromium.org, Jul 25

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4c5e7afaaf0586e6e5f33a8ca4c303643e74edb4

commit 4c5e7afaaf0586e6e5f33a8ca4c303643e74edb4
Author: Ned Nguyen <nednguyen@google.com>
Date: Wed Jul 25 22:16:06 2018

Refactor results_dashboard code to be cleaner

Bug:  864565 
Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi
Change-Id: I1fbea8b4afd7f6be773db2ea5f83e0a0aa549b41
Reviewed-on: https://chromium-review.googlesource.com/1150431
Reviewed-by: Simon Hatch <simonhatch@chromium.org>
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Cr-Commit-Position: refs/heads/master@{#578095}
[modify] https://crrev.com/4c5e7afaaf0586e6e5f33a8ca4c303643e74edb4/tools/perf/core/results_dashboard.py
[modify] https://crrev.com/4c5e7afaaf0586e6e5f33a8ca4c303643e74edb4/tools/perf/core/results_dashboard_unittest.py

Blockedon: -867379
Cc: perezju@chromium.org simonhatch@chromium.org
Status: Fixed (was: Started)
This bug is done. Though we still have problems with uploading to perf dashboard due to  timedout in internal dashboard server (tracked in issue 867379)
Status: Started (was: Fixed)
Talking to Ethan & Simon. There are still further improvements we can make to the retry logic. Right now we retry right after the failure, whereas it generally takes perf dashboard server at least 10 seconds to spin up a new instance. 

For now, I will add s.t like:

wait_time_to_retry = 30s
for _ in max_num_retry:
  sleep(wait_time_to_retry)
  upload_to_perf_dashboard()
  wait_time_to_retry *= 2
Blocking: 867379
Project Member

Comment 14 by bugdroid1@chromium.org, Aug 15

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4a023983ea91852ee5ea3a6b0bdc3749ac4a7258

commit 4a023983ea91852ee5ea3a6b0bdc3749ac4a7258
Author: Ned Nguyen <nednguyen@google.com>
Date: Wed Aug 15 22:39:57 2018

Add wait time between retry attempts to upload to perf dashboard

This implements exponential backoff to increase the wait time after each
retry attempt.

This CL also increase the default number of retry attempts to 4.

Bug:864565
Cq-Include-Trybots: master.tryserver.chromium.perf:obbs_fyi
Change-Id: I99c167e1329bbb83d86d310e2acf164cd3f3e7cf

NOTRY=true # linux-chromeos-rel flake

Change-Id: I99c167e1329bbb83d86d310e2acf164cd3f3e7cf
Reviewed-on: https://chromium-review.googlesource.com/1176089
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Annie Sullivan <sullivan@chromium.org>
Cr-Commit-Position: refs/heads/master@{#583422}
[modify] https://crrev.com/4a023983ea91852ee5ea3a6b0bdc3749ac4a7258/tools/perf/core/results_dashboard.py
[modify] https://crrev.com/4a023983ea91852ee5ea3a6b0bdc3749ac4a7258/tools/perf/core/results_dashboard_unittest.py
[modify] https://crrev.com/4a023983ea91852ee5ea3a6b0bdc3749ac4a7258/tools/perf/core/upload_results_to_perf_dashboard.py
[modify] https://crrev.com/4a023983ea91852ee5ea3a6b0bdc3749ac4a7258/tools/perf/process_perf_results.py

Status: Fixed (was: Started)

Sign in to add a comment