New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 829962 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

chromium.perf and chromium.perf.fyi test upload failures

Project Member Reported by seanmccullough@chromium.org, Apr 6 2018

Issue description

This is one of the "build number conflict" errors.  From Emily's email:

"""
for a few days we have been seeing a dashboard upload turn yellow on our bots: 

https://ci.chromium.org/buildbot/chromium.perf.fyi/Mac%2010.12%20Laptop%20Low%20End/681

With the following error stack trace: 

Traceback (most recent call last):
  File "/b/rr/tmpsv2pD9/rw/checkout/scripts/slave/recipe_modules/test_results/resources/upload_test_results.py", line 216, in <module>
    sys.exit(main(sys.argv[1:]))
  File "/b/rr/tmpsv2pD9/rw/checkout/scripts/slave/recipe_modules/test_results/resources/upload_test_results.py", line 207, in main
    options.test_results_server, attrs, files, 120)
  File "/b/rr/tmpsv2pD9/rw/checkout/scripts/slave/recipe_modules/test_results/resources/test_results_uploader.py", line 51, in upload_test_results
    timeout_secs)
  File "/b/rr/tmpsv2pD9/rw/checkout/scripts/slave/recipe_modules/test_results/resources/test_results_uploader.py", line 135, in _retry_exp_backoff
    % (e.code, e.filename, e.read()))
test_results_uploader.PermanentError: Received HTTP status 409 loading "https://test-results.appspot.com/testfile/upload": build number conflict
step returned non-zero exit code: 1


But it does look like some data at least is getting to the dashboard: 

https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=performance_test_suite&builder=chromium.perf.fyi%3AMac%2010.12%20Laptop%20Low%20End

Digging into the code you can see this is where it fails, but we definitely don't have any duplicate step names as we only have two :)

Here is a link to our merged json: 

https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf.fyi%2FMac_10.12_Laptop_Low_End%2F681%2F%2B%2Frecipes%2Fsteps%2Fperformance_test_suite_on_Intel_GPU_on_Mac_on_Mac-10.12%2F0%2Flogs%2Fjson.output%2F0

but it sounds like it is more an issue with a previous upload and duplicate entries?  
"""

 
The biggest question here is whether or not this is a failure on the perf side or the dashboard side.  It appears to be that the data is still getting uploaded, but whether or not it all is there is hard to reason about.  

Is this something you have seen before on the dashboard side?
Re #1, yes we have seen things like this on the server before.

IIRC the test-results server does some "normalization" or "cleaning" on the step name to derive a "test name" - and this is what it uses to look up the set of previous test runs to add the new batch to. If it normalizes two different test steps to the same "test name" then it can fool the code into thinking we already uploaded the results for this step/build number pair. Hence the error.

The original reason for normalizing step names is not totally clear to me, but I believe it has to do with tests that embed configuration settings in the step name itself.

Some example unit tests for what this step-name-cleaning tries to do: https://cs.chromium.org/chromium/infra/go/src/infra/appengine/test-results/frontend/builders_test.go?type=cs&q=cleanTestType+file:%5Einfra/go/src/infra/appengine/test-results/+package:%5Echromium$&l=29

I'm not yet sure how we can fix this or why it's happening to this particular builder/test.
Components: Speed>Benchmarks>Waterfall
Labels: -Pri-2 Pri-1
Ok so this is happening on the main waterfall as well now across several testers.

One example is Win 10 Perf which has been stable.  If you look at Win 10 perf there are several failures here: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20Perf/builds/2345

with the same log statement if you look at the log: https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FWin_10_Perf%2F2345%2F%2B%2Frecipes%2Fsteps%2FUpload_to_test-results__oortonline_tbmv2_on_Intel_GPU_on_Windows_%2F0%2Fstdout

Our names are pretty short and don't conflict
Summary: chromium.perf and chromium.perf.fyi test upload failures (was: chromium.perf.fyi test upload failures)
That Win 10 Perf builder is actually failing because it's trying to upload a test results file that doesn't include required information:

"test_results_uploader.PermanentError: Received HTTP status 400 loading "https://test-results.appspot.com/testfile/upload": uploaded file is missing required num_failures_by_type field"

So that's a different issue and can probably be fixed on the perf recipe side.
Cc: nednguyen@chromium.org ashleymarie@chromium.org
Ah interesting, that is a separate error.  

We are seeing it on some builders though like Win 7 ATI GPU Perf: 

https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FWin_7_ATI_GPU_Perf%2F2067%2F%2B%2Frecipes%2Fsteps%2FUpload_to_test-results__smoothness.key_desktop_move_cases_on_ATI_GPU_on_Windows_%2F0%2Fstdout

Same build number conflict.  

Is there a way to debug these further?  Do you log anything further in the flakiness dashboard?
Status: Started (was: Available)
Current theory of what's going on: 

0. Test uploader attempts an upload.
1. Server does *some* work and manages to save some part of the results for that build number, but then errors out with a 500 before finishing.
2. Uploader dutifully retries on 500s, so it re-uploads same results file
3. Server checks and sees that it has *something* from this build already and errors with 409

Example recipe-side log:

2018-04-05 04:25:51,648 - root - INFO - Input JSON file probably has full json results format
2018-04-05 04:25:51,692 - root - INFO - Uploading JSON files for builder "Mac 10.12 Laptop Low End" to server "test-results.appspot.com"
2018-04-05 04:25:51,694 - root - INFO - Sending request to https://test-results.appspot.com/testfile/upload at 2018-04-05 11:25:51.694169 UTC
2018-04-05 04:25:52,895 - root - WARNING - Received HTTP status 500 loading "https://test-results.appspot.com/testfile/upload".  Retrying in 10 seconds...
2018-04-05 04:26:02,900 - root - INFO - Sending request to https://test-results.appspot.com/testfile/upload at 2018-04-05 11:26:02.900506 UTC
Traceback (most recent call last):
  File "/b/rr/tmp1urfch/rw/checkout/scripts/slave/recipe_modules/test_results/resources/upload_test_results.py", line 216, in <module>
    sys.exit(main(sys.argv[1:]))
  File "/b/rr/tmp1urfch/rw/checkout/scripts/slave/recipe_modules/test_results/resources/upload_test_results.py", line 207, in main
    options.test_results_server, attrs, files, 120)
  File "/b/rr/tmp1urfch/rw/checkout/scripts/slave/recipe_modules/test_results/resources/test_results_uploader.py", line 51, in upload_test_results
    timeout_secs)
  File "/b/rr/tmp1urfch/rw/checkout/scripts/slave/recipe_modules/test_results/resources/test_results_uploader.py", line 135, in _retry_exp_backoff
    % (e.code, e.filename, e.read()))
test_results_uploader.PermanentError: Received HTTP status 409 loading "https://test-results.appspot.com/testfile/upload": build number conflict
step returned non-zero exit code: 1

And I see a panic (which 500s) in some of the corresponding server logs.  Will now attempt to repro in unit tests...

Comment 8 by eyaich@chromium.org, Apr 12 2018

Ok so it sounds like the retry logic needs to be more robust to either overwrite the data or pick up where it left off on a retry.

Is there any more information on the server side as to the initial 500?  Trying to determine if there is a perf error on the data.
It looks like a nil pointer dereference somewhere in the event logging (to bigquery) code.  Still debugging (with one of perf's 13059 line JSON files :) on my workstation.

 
I think this will be a server-side fix, so your stuff probably doesn't need to change. I *think*.

Retrying on 500s is perfectly reasonable.
Pushed this change to prod.  Will check back in a few to see if uploads are succeeding.
Status: Fixed (was: Started)
https://ci.chromium.org/buildbot/chromium.perf/Win%207%20ATI%20GPU%20Perf/2081 succeeded on the test upload step. 

Please re-open if you're still seeing 500s followed by a retry that gets a 409 build number conflict.

Sign in to add a comment