New issue
Advanced search Search tips

Issue 669160 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Improvements to perf dashboard upload error handling

Project Member Reported by sullivan@chromium.org, Nov 28 2016

Issue description

The way the perf recipe does uploading right now:
1) Write the current chartjson to a file in a specific directory
2) Upload all the files in that directory
3) Only remove files if upload succeeds, so that failed uploads are retried

The good thing about this is that if there is a transient error, an IP that wasn't whitelisted, or a bug that got fixed dashboard-side, we don't lose any data, it just retries on each run.

The bad thing about this is that if there is an error generating data, the error will persist until someone logs onto the bot and deletes the bad data.

I think there are a few things we could do to make this work better:
1) Store the results files in directories based on known error types. 403 and 503 are for IP whitelist denials and quota overages, so those are transient. Other errors are much more likely to be permanent (i.e. corrupted chartjson).
2) Always upload the most recent run first. 
  2a) If it fails, set the step to purple, store the failed chartjson in the appropriate directory, and stop.
  2b) If it succeeds, upload any files that had transient errors. If that succeeds, attempt uploading files that had non-transient errors.
3) Delete files which had permanent errors after some number of retries.


This is something we hit every few months. Last seen here: https://github.com/catapult-project/catapult/issues/3033


Filing as P3 and assigning to Emily as an improvement to the recipe post-swarming.
 
+1 to this overall request. The case in 3033 was a 500 because of too-large JSON, which is a bug that we plan to fix (the inability to upload sufficiently large JSON), but uploading to the perf dashboard in general should not break just because that bug has not been fixed yet.
Components: Speed>Dashboard
Labels: -Performance-Dashboard
Components: Speed>Benchmarks>Waterfall
Labels: -Performance-Waterfall
Status: Assigned (was: Untriaged)
Cc: -eakuefner@chromium.org

Sign in to add a comment