Upload to test-results steps are slow |
||
Issue descriptionIn many CQ builds, total time of upload test result steps become more than 10 minute. https://pantheon.corp.google.com/bigquery?project=cr-buildbucket&j=bquxjob_5ce5df07_1651df71751&page=queryresults test upload steps consumed 10~35% of linux_chromium_rel_ng builder's cycle time. This is the slowest test upload steps. https://pantheon.corp.google.com/bigquery?project=cr-buildbucket&j=bquxjob_15036e85_1651df8da2a&page=queryresults Why many upload steps took more than 10 seconds? Cannot be optimized?
,
Aug 9
@martiniss - as long as the longest test step takes longer than (# of test steps) * 10s, that's true :). It would be good to be sure we understood where the time was being spent: how much of it is server-side in test-results vs. the network vs. client-side, at least.
,
Aug 9
I added a bit of logging to look at this. https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8938643118147300432/+/steps/Upload_to_test-results__content_browsertests__with_patch__/0/stdout is a random longer upload. I've attached a screenshot showing the server side log of this upload call. It takes roughly 100 ms for the request to start (469 - 376). So I don't think network latency is a big issue. The majority of the time is spent reading stuff from datastore, and I think changing that requires re architecting the webapp.
,
Aug 9
I worked on this a bit in https://crbug.com/795118, but never really got anywhere.
,
Aug 9
I wonder improving the time for datastore requires large change like re-architecturing webapp. Can we parallelize some api calls by using go routine in this file? https://cs.chromium.org/chromium/infra/go/src/infra/appengine/test-results/model/test_file.go https://pantheon.corp.google.com/traces/traces?project=test-results-hrd&tid=1183cc1173c7a1302e0caacb6a52867f Also this trace suggests us to use batch put request.
,
Aug 10
Let me take a look some api calls.
,
Aug 10
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/6699ca0cc494721279554bd5a957b81a551a3eb5 commit 6699ca0cc494721279554bd5a957b81a551a3eb5 Author: Takuto Ikuta <tikuta@chromium.org> Date: Fri Aug 10 21:53:22 2018 [bq] Upload data to bigquery concurrently This will make upload test step in CQ faster for the test results having many records like below. https://pantheon.corp.google.com/traces/details/b40b8c2e81f4d402f99d67b7bc936189?spanId=-4956720531504612275&project=test-results-hrd#TIMELINE Bug: 872645 Change-Id: I5eb57fe2ef788fd57253f1b78343a71091a0837a Reviewed-on: https://chromium-review.googlesource.com/1170725 Commit-Queue: Takuto Ikuta <tikuta@chromium.org> Reviewed-by: Nodir Turakulov <nodir@chromium.org> [modify] https://crrev.com/6699ca0cc494721279554bd5a957b81a551a3eb5/common/bq/eventupload.go
,
Aug 10
Nice find.
,
Aug 16
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/a728fea2174b22ec068c1b7cd59f38173e495a29 commit a728fea2174b22ec068c1b7cd59f38173e495a29 Author: Takuto Ikuta <tikuta@chromium.org> Date: Thu Aug 16 21:59:43 2018 [test-results] Parallelize datastore Put in writeDataEntries This will make upload test step in CQ faster for the test results having many records like below. https://pantheon.corp.google.com/traces/details/b40b8c2e81f4d402f99d67b7bc936189?spanId=-4956720531504612275&project=test-results-hrd#TIMELINE Bug: 872645 Change-Id: Id7c9058258f5602a7b50d09cc1d3bb413a17d762 Reviewed-on: https://chromium-review.googlesource.com/1170730 Reviewed-by: Sean McCullough <seanmccullough@chromium.org> Reviewed-by: Stephen Martinis <martiniss@chromium.org> Commit-Queue: Takuto Ikuta <tikuta@chromium.org> [modify] https://crrev.com/a728fea2174b22ec068c1b7cd59f38173e495a29/go/src/infra/appengine/test-results/model/test_file.go
,
Aug 16
I updated test-results service including 2 my patches.
,
Aug 16
Updated version seems to cause many 400 error. https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_chromium_rel_ng/166407 I rollbacked to 15929-d0934b2.
,
Aug 16
I see the increased 400s in viceroy: http://shortn/_Z0Yk3SgSle Thanks for the quick rollback. Did you find out what was causing the 400s?
,
Aug 16
I guess that due to re-using buffer. https://chromium-review.googlesource.com/c/infra/infra/+/1170730#message-c4b2d3d4f50ff523552249540f399a30670837a6 Made fix in https://chromium-review.googlesource.com/c/infra/infra/+/1179102
,
Aug 17
Updated version to 17119-a728fea-tainted-tikuta from 17021-9d1383b-tainted-tikuta in staging. Will land the patch next week. https://chromium-review.googlesource.com/c/infra/infra/+/1179102
,
Aug 17
Broken data keeps causing 400. http://shortn/_hhAbo6sRM1 I guess, we need to try deletion of corrupted data when we failed to parse json around here. https://cs.chromium.org/chromium/infra/go/src/infra/appengine/test-results/frontend/upload.go?l=441&rcl=39cd8185c585ff84bade1638d80808386f5d2788
,
Aug 17
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/6f3fc80143378fedb46ba663cc972f6985d817a0 commit 6f3fc80143378fedb46ba663cc972f6985d817a0 Author: Takuto Ikuta <tikuta@chromium.org> Date: Fri Aug 17 08:17:17 2018 Revert "[test-results] Parallelize datastore Put in writeDataEntries" This reverts commit a728fea2174b22ec068c1b7cd59f38173e495a29. Reason for revert: https://bugs.chromium.org/p/chromium/issues/detail?id=872645#c15 I will make fix for the comment. Original change's description: > [test-results] Parallelize datastore Put in writeDataEntries > > This will make upload test step in CQ faster for the test results having many records like below. > https://pantheon.corp.google.com/traces/details/b40b8c2e81f4d402f99d67b7bc936189?spanId=-4956720531504612275&project=test-results-hrd#TIMELINE > > Bug: 872645 > Change-Id: Id7c9058258f5602a7b50d09cc1d3bb413a17d762 > Reviewed-on: https://chromium-review.googlesource.com/1170730 > Reviewed-by: Sean McCullough <seanmccullough@chromium.org> > Reviewed-by: Stephen Martinis <martiniss@chromium.org> > Commit-Queue: Takuto Ikuta <tikuta@chromium.org> TBR=seanmccullough@chromium.org,martiniss@chromium.org,tikuta@chromium.org Change-Id: I204ee3c6a747a5c141af7a21453a29b27c76dc35 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 872645 Reviewed-on: https://chromium-review.googlesource.com/1179581 Reviewed-by: Takuto Ikuta <tikuta@chromium.org> Commit-Queue: Takuto Ikuta <tikuta@chromium.org> [modify] https://crrev.com/6f3fc80143378fedb46ba663cc972f6985d817a0/go/src/infra/appengine/test-results/model/test_file.go
,
Aug 17
Are we reporting errors somewhere for data that can't be parsed? That shouldn't be happening, and we should be looking into those errors as bugs.
,
Aug 17
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/fa5bceb38f7a1ec6a0cd6338a4d93f5be01704b3 commit fa5bceb38f7a1ec6a0cd6338a4d93f5be01704b3 Author: Takuto Ikuta <tikuta@chromium.org> Date: Fri Aug 17 21:53:03 2018 [test-results] Delete test result when failed to parse json When json is corrupted, we'd like to delete such entries. Bug: 872645 Change-Id: I64d8b68824ecd2ed8d5b8e92d377006da5e8df22 Reviewed-on: https://chromium-review.googlesource.com/1179547 Reviewed-by: Sean McCullough <seanmccullough@chromium.org> Commit-Queue: Takuto Ikuta <tikuta@chromium.org> [modify] https://crrev.com/fa5bceb38f7a1ec6a0cd6338a4d93f5be01704b3/go/src/infra/appengine/test-results/frontend/upload_test.go [modify] https://crrev.com/fa5bceb38f7a1ec6a0cd6338a4d93f5be01704b3/go/src/infra/appengine/test-results/frontend/upload.go
,
Aug 17
#17, I don't know whether there is some alert for that type of failure. But if we set alert for high qps of 4xx, that can cover such things. Anyway I deployed fix to prod. So 4xx will be reduced gradually.
,
Aug 21
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/172fe908840d73b83d424b51a0bdb6f44867de2c commit 172fe908840d73b83d424b51a0bdb6f44867de2c Author: Takuto Ikuta <tikuta@chromium.org> Date: Tue Aug 21 04:15:49 2018 [test-results] Parallelize datastore Put in writeDataEntries This is re-land of https://chromium-review.googlesource.com/c/infra/infra/+/1170730 This contains fix for the bug pointed out in https://chromium-review.googlesource.com/c/infra/infra/+/1170730#message-c4b2d3d4f50ff523552249540f399a30670837a6 We cannot use buf without copying. Also we need to keep order of keys returned by writeDataEntries. Bug: 872645 Change-Id: I65d972dd57347edfe9ea9a0b9134d80793ed8c43 Reviewed-on: https://chromium-review.googlesource.com/1179102 Commit-Queue: Takuto Ikuta <tikuta@chromium.org> Reviewed-by: Sean McCullough <seanmccullough@chromium.org> [modify] https://crrev.com/172fe908840d73b83d424b51a0bdb6f44867de2c/go/src/infra/appengine/test-results/model/test_file.go [modify] https://crrev.com/172fe908840d73b83d424b51a0bdb6f44867de2c/go/src/infra/appengine/test-results/model/test_file_test.go
,
Aug 21
I updated prod with #20. Anyway, latency for /testfile/upload is improved largely by parallelized bigquery access. https://screenshot.googleplex.com/huJxK4E1fgQ https://pantheon.corp.google.com/traces/tasks/5466587825796156494?project=test-results-hrd&folder&organizationId=433637338589
,
Aug 21
You mean parallelized datastore (not bigquery) access, correct? Also, nice savings!
,
Aug 21
#22, I actually mean parallelized data upload to bigquery by #7. Currently some requests seem taking time on json decode around below yet. https://cs.chromium.org/chromium/infra/go/src/infra/appengine/test-results/frontend/upload.go?l=438&rcl=70aa28683e96030d62d3d9e7e671ee33c0b1258c https://pantheon.corp.google.com/traces/traces?folder=&organizationId=433637338589&project=test-results-hrd&q=%2Ftestfile%2Fupload&tid=13bc8d13204dd0e50e0a727b420ad0bc
,
Aug 24
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/32f7dc2b7fa9e330b6bbcf37c9fed82330f111eb commit 32f7dc2b7fa9e330b6bbcf37c9fed82330f111eb Author: Takuto Ikuta <tikuta@chromium.org> Date: Fri Aug 24 04:39:52 2018 [test-results] add some log to know slow part The trace with log is shown like below. https://screenshot.googleplex.com/ociJUkn8hhz I will use this information for future optimization. Bug: 872645 Change-Id: I27c51e1ca8f04f06a2e8c504b6715781edabcb77 Reviewed-on: https://chromium-review.googlesource.com/1184946 Reviewed-by: Sean McCullough <seanmccullough@chromium.org> [modify] https://crrev.com/32f7dc2b7fa9e330b6bbcf37c9fed82330f111eb/go/src/infra/appengine/test-results/frontend/event_logging.go [modify] https://crrev.com/32f7dc2b7fa9e330b6bbcf37c9fed82330f111eb/go/src/infra/appengine/test-results/frontend/upload.go
,
Aug 24
#24, I deployed the change to prod. |
||
►
Sign in to add a comment |
||
Comment 1 by martiniss@chromium.org
, Aug 9