chromium.perf and chromium.perf.fyi test upload failures |
||||||
Issue descriptionThis is one of the "build number conflict" errors. From Emily's email: """ for a few days we have been seeing a dashboard upload turn yellow on our bots: https://ci.chromium.org/buildbot/chromium.perf.fyi/Mac%2010.12%20Laptop%20Low%20End/681 With the following error stack trace: Traceback (most recent call last): File "/b/rr/tmpsv2pD9/rw/checkout/scripts/slave/recipe_modules/test_results/resources/upload_test_results.py", line 216, in <module> sys.exit(main(sys.argv[1:])) File "/b/rr/tmpsv2pD9/rw/checkout/scripts/slave/recipe_modules/test_results/resources/upload_test_results.py", line 207, in main options.test_results_server, attrs, files, 120) File "/b/rr/tmpsv2pD9/rw/checkout/scripts/slave/recipe_modules/test_results/resources/test_results_uploader.py", line 51, in upload_test_results timeout_secs) File "/b/rr/tmpsv2pD9/rw/checkout/scripts/slave/recipe_modules/test_results/resources/test_results_uploader.py", line 135, in _retry_exp_backoff % (e.code, e.filename, e.read())) test_results_uploader.PermanentError: Received HTTP status 409 loading "https://test-results.appspot.com/testfile/upload": build number conflict step returned non-zero exit code: 1 But it does look like some data at least is getting to the dashboard: https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=performance_test_suite&builder=chromium.perf.fyi%3AMac%2010.12%20Laptop%20Low%20End Digging into the code you can see this is where it fails, but we definitely don't have any duplicate step names as we only have two :) Here is a link to our merged json: https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf.fyi%2FMac_10.12_Laptop_Low_End%2F681%2F%2B%2Frecipes%2Fsteps%2Fperformance_test_suite_on_Intel_GPU_on_Mac_on_Mac-10.12%2F0%2Flogs%2Fjson.output%2F0 but it sounds like it is more an issue with a previous upload and duplicate entries? """
,
Apr 9 2018
Re #1, yes we have seen things like this on the server before. IIRC the test-results server does some "normalization" or "cleaning" on the step name to derive a "test name" - and this is what it uses to look up the set of previous test runs to add the new batch to. If it normalizes two different test steps to the same "test name" then it can fool the code into thinking we already uploaded the results for this step/build number pair. Hence the error. The original reason for normalizing step names is not totally clear to me, but I believe it has to do with tests that embed configuration settings in the step name itself. Some example unit tests for what this step-name-cleaning tries to do: https://cs.chromium.org/chromium/infra/go/src/infra/appengine/test-results/frontend/builders_test.go?type=cs&q=cleanTestType+file:%5Einfra/go/src/infra/appengine/test-results/+package:%5Echromium$&l=29 I'm not yet sure how we can fix this or why it's happening to this particular builder/test.
,
Apr 9 2018
Ok so this is happening on the main waterfall as well now across several testers. One example is Win 10 Perf which has been stable. If you look at Win 10 perf there are several failures here: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%2010%20Perf/builds/2345 with the same log statement if you look at the log: https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FWin_10_Perf%2F2345%2F%2B%2Frecipes%2Fsteps%2FUpload_to_test-results__oortonline_tbmv2_on_Intel_GPU_on_Windows_%2F0%2Fstdout Our names are pretty short and don't conflict
,
Apr 9 2018
,
Apr 9 2018
That Win 10 Perf builder is actually failing because it's trying to upload a test results file that doesn't include required information: "test_results_uploader.PermanentError: Received HTTP status 400 loading "https://test-results.appspot.com/testfile/upload": uploaded file is missing required num_failures_by_type field" So that's a different issue and can probably be fixed on the perf recipe side.
,
Apr 9 2018
Ah interesting, that is a separate error. We are seeing it on some builders though like Win 7 ATI GPU Perf: https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FWin_7_ATI_GPU_Perf%2F2067%2F%2B%2Frecipes%2Fsteps%2FUpload_to_test-results__smoothness.key_desktop_move_cases_on_ATI_GPU_on_Windows_%2F0%2Fstdout Same build number conflict. Is there a way to debug these further? Do you log anything further in the flakiness dashboard?
,
Apr 12 2018
Current theory of what's going on: 0. Test uploader attempts an upload. 1. Server does *some* work and manages to save some part of the results for that build number, but then errors out with a 500 before finishing. 2. Uploader dutifully retries on 500s, so it re-uploads same results file 3. Server checks and sees that it has *something* from this build already and errors with 409 Example recipe-side log: 2018-04-05 04:25:51,648 - root - INFO - Input JSON file probably has full json results format 2018-04-05 04:25:51,692 - root - INFO - Uploading JSON files for builder "Mac 10.12 Laptop Low End" to server "test-results.appspot.com" 2018-04-05 04:25:51,694 - root - INFO - Sending request to https://test-results.appspot.com/testfile/upload at 2018-04-05 11:25:51.694169 UTC 2018-04-05 04:25:52,895 - root - WARNING - Received HTTP status 500 loading "https://test-results.appspot.com/testfile/upload". Retrying in 10 seconds... 2018-04-05 04:26:02,900 - root - INFO - Sending request to https://test-results.appspot.com/testfile/upload at 2018-04-05 11:26:02.900506 UTC Traceback (most recent call last): File "/b/rr/tmp1urfch/rw/checkout/scripts/slave/recipe_modules/test_results/resources/upload_test_results.py", line 216, in <module> sys.exit(main(sys.argv[1:])) File "/b/rr/tmp1urfch/rw/checkout/scripts/slave/recipe_modules/test_results/resources/upload_test_results.py", line 207, in main options.test_results_server, attrs, files, 120) File "/b/rr/tmp1urfch/rw/checkout/scripts/slave/recipe_modules/test_results/resources/test_results_uploader.py", line 51, in upload_test_results timeout_secs) File "/b/rr/tmp1urfch/rw/checkout/scripts/slave/recipe_modules/test_results/resources/test_results_uploader.py", line 135, in _retry_exp_backoff % (e.code, e.filename, e.read())) test_results_uploader.PermanentError: Received HTTP status 409 loading "https://test-results.appspot.com/testfile/upload": build number conflict step returned non-zero exit code: 1 And I see a panic (which 500s) in some of the corresponding server logs. Will now attempt to repro in unit tests...
,
Apr 12 2018
Ok so it sounds like the retry logic needs to be more robust to either overwrite the data or pick up where it left off on a retry. Is there any more information on the server side as to the initial 500? Trying to determine if there is a perf error on the data.
,
Apr 12 2018
It looks like a nil pointer dereference somewhere in the event logging (to bigquery) code. Still debugging (with one of perf's 13059 line JSON files :) on my workstation.
,
Apr 12 2018
I think this will be a server-side fix, so your stuff probably doesn't need to change. I *think*. Retrying on 500s is perfectly reasonable.
,
Apr 12 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/b7f7266244e8f425e44ca6063eccf4fa04e01702 commit b7f7266244e8f425e44ca6063eccf4fa04e01702 Author: Sean McCullough <seanmccullough@chromium.org> Date: Thu Apr 12 18:45:55 2018 [test-results] add nil checks for runtimes Bug: 829962 Change-Id: I95f60698d2e92eaacac0c316d719b8114966d045 Reviewed-on: https://chromium-review.googlesource.com/1010826 Reviewed-by: Tiffany Zhang <zhangtiff@chromium.org> Commit-Queue: Sean McCullough <seanmccullough@chromium.org> [modify] https://crrev.com/b7f7266244e8f425e44ca6063eccf4fa04e01702/go/src/infra/appengine/test-results/frontend/event_logging.go [modify] https://crrev.com/b7f7266244e8f425e44ca6063eccf4fa04e01702/go/src/infra/appengine/test-results/frontend/event_logging_test.go
,
Apr 12 2018
Pushed this change to prod. Will check back in a few to see if uploads are succeeding.
,
Apr 12 2018
https://ci.chromium.org/buildbot/chromium.perf/Win%207%20ATI%20GPU%20Perf/2081 succeeded on the test upload step. Please re-open if you're still seeing 500s followed by a retry that gets a 409 build number conflict. |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by eyaich@chromium.org
, Apr 9 2018