The way the perf recipe does uploading right now:
1) Write the current chartjson to a file in a specific directory
2) Upload all the files in that directory
3) Only remove files if upload succeeds, so that failed uploads are retried
The good thing about this is that if there is a transient error, an IP that wasn't whitelisted, or a bug that got fixed dashboard-side, we don't lose any data, it just retries on each run.
The bad thing about this is that if there is an error generating data, the error will persist until someone logs onto the bot and deletes the bad data.
I think there are a few things we could do to make this work better:
1) Store the results files in directories based on known error types. 403 and 503 are for IP whitelist denials and quota overages, so those are transient. Other errors are much more likely to be permanent (i.e. corrupted chartjson).
2) Always upload the most recent run first.
2a) If it fails, set the step to purple, store the failed chartjson in the appropriate directory, and stop.
2b) If it succeeds, upload any files that had transient errors. If that succeeds, attempt uploading files that had non-transient errors.
3) Delete files which had permanent errors after some number of retries.
This is something we hit every few months. Last seen here: https://github.com/catapult-project/catapult/issues/3033
Filing as P3 and assigning to Emily as an improvement to the recipe post-swarming.
Comment 1 by eakuefner@chromium.org
, Nov 28 2016