New issue
Advanced search Search tips

Issue 872645 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Upload to test-results steps are slow

Project Member Reported by tikuta@chromium.org, Aug 9

Issue description

In many CQ builds, total time of upload test result steps become more than 10 minute.
https://pantheon.corp.google.com/bigquery?project=cr-buildbucket&j=bquxjob_5ce5df07_1651df71751&page=queryresults

test upload steps consumed 10~35% of linux_chromium_rel_ng builder's cycle time.

This is the slowest test upload steps.
https://pantheon.corp.google.com/bigquery?project=cr-buildbucket&j=bquxjob_15036e85_1651df8da2a&page=queryresults

Why many upload steps took more than 10 seconds?
Cannot be optimized?
 
I looked into this a while ago, and it just looks like there's a lot of data being sent and dumped into the datastore. I don't know if there's much you can do about this. I know that the upload times should be much better when we move to test trove, which sean would know more about.

Re-ordering the test collection times would help with this, because while it would take 10 seconds to upload tests, it theoretically wouldn't affect overall cycle time.
@martiniss - as long as the longest test step takes longer than (# of test steps) * 10s, that's true :).

It would be good to be sure we understood where the time was being spent: how much of it is server-side in test-results vs. the network vs. client-side, at least.
I added a bit of logging to look at this.

https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8938643118147300432/+/steps/Upload_to_test-results__content_browsertests__with_patch__/0/stdout is a random longer upload. I've attached a screenshot showing the server side log of this upload call. 

It takes roughly 100 ms for the request to start (469 - 376). So I don't think network latency is a big issue.
The majority of the time is spent reading stuff from datastore, and I think changing that requires re architecting the webapp.
O48bbomogAu.png
241 KB View Download
I worked on this a bit in https://crbug.com/795118, but never really got anywhere.
I wonder improving the time for datastore requires large change like re-architecturing webapp.

Can we parallelize some api calls by using go routine in this file?
https://cs.chromium.org/chromium/infra/go/src/infra/appengine/test-results/model/test_file.go

https://pantheon.corp.google.com/traces/traces?project=test-results-hrd&tid=1183cc1173c7a1302e0caacb6a52867f
Also this trace suggests us to use batch put request.

Owner: tikuta@chromium.org
Status: Assigned (was: Untriaged)
Let me take a look some api calls.
Project Member

Comment 7 by bugdroid1@chromium.org, Aug 10

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-go.git/+/6699ca0cc494721279554bd5a957b81a551a3eb5

commit 6699ca0cc494721279554bd5a957b81a551a3eb5
Author: Takuto Ikuta <tikuta@chromium.org>
Date: Fri Aug 10 21:53:22 2018

[bq] Upload data to bigquery concurrently

This will make upload test step in CQ faster for the test results having many records like below.
https://pantheon.corp.google.com/traces/details/b40b8c2e81f4d402f99d67b7bc936189?spanId=-4956720531504612275&project=test-results-hrd#TIMELINE

Bug: 872645
Change-Id: I5eb57fe2ef788fd57253f1b78343a71091a0837a
Reviewed-on: https://chromium-review.googlesource.com/1170725
Commit-Queue: Takuto Ikuta <tikuta@chromium.org>
Reviewed-by: Nodir Turakulov <nodir@chromium.org>

[modify] https://crrev.com/6699ca0cc494721279554bd5a957b81a551a3eb5/common/bq/eventupload.go

Nice find.
Project Member

Comment 9 by bugdroid1@chromium.org, Aug 16

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/a728fea2174b22ec068c1b7cd59f38173e495a29

commit a728fea2174b22ec068c1b7cd59f38173e495a29
Author: Takuto Ikuta <tikuta@chromium.org>
Date: Thu Aug 16 21:59:43 2018

[test-results] Parallelize datastore Put in writeDataEntries

This will make upload test step in CQ faster for the test results having many records like below.
https://pantheon.corp.google.com/traces/details/b40b8c2e81f4d402f99d67b7bc936189?spanId=-4956720531504612275&project=test-results-hrd#TIMELINE

Bug: 872645
Change-Id: Id7c9058258f5602a7b50d09cc1d3bb413a17d762
Reviewed-on: https://chromium-review.googlesource.com/1170730
Reviewed-by: Sean McCullough <seanmccullough@chromium.org>
Reviewed-by: Stephen Martinis <martiniss@chromium.org>
Commit-Queue: Takuto Ikuta <tikuta@chromium.org>

[modify] https://crrev.com/a728fea2174b22ec068c1b7cd59f38173e495a29/go/src/infra/appengine/test-results/model/test_file.go

I updated test-results service including 2 my patches.

Updated version seems to cause many 400 error.
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_chromium_rel_ng/166407

I rollbacked to 15929-d0934b2.
I see the increased 400s in viceroy: http://shortn/_Z0Yk3SgSle

Thanks for the quick rollback. Did you find out what was causing the 400s?
Updated version to 17119-a728fea-tainted-tikuta from 17021-9d1383b-tainted-tikuta in staging.
Will land the patch next week.
https://chromium-review.googlesource.com/c/infra/infra/+/1179102
Broken data keeps causing 400.
http://shortn/_hhAbo6sRM1

I guess, we need to try deletion of corrupted data when we failed to parse json around here.
https://cs.chromium.org/chromium/infra/go/src/infra/appengine/test-results/frontend/upload.go?l=441&rcl=39cd8185c585ff84bade1638d80808386f5d2788
Project Member

Comment 16 by bugdroid1@chromium.org, Aug 17

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/6f3fc80143378fedb46ba663cc972f6985d817a0

commit 6f3fc80143378fedb46ba663cc972f6985d817a0
Author: Takuto Ikuta <tikuta@chromium.org>
Date: Fri Aug 17 08:17:17 2018

Revert "[test-results] Parallelize datastore Put in writeDataEntries"

This reverts commit a728fea2174b22ec068c1b7cd59f38173e495a29.

Reason for revert: https://bugs.chromium.org/p/chromium/issues/detail?id=872645#c15
I will make fix for the comment.

Original change's description:
> [test-results] Parallelize datastore Put in writeDataEntries
> 
> This will make upload test step in CQ faster for the test results having many records like below.
> https://pantheon.corp.google.com/traces/details/b40b8c2e81f4d402f99d67b7bc936189?spanId=-4956720531504612275&project=test-results-hrd#TIMELINE
> 
> Bug: 872645
> Change-Id: Id7c9058258f5602a7b50d09cc1d3bb413a17d762
> Reviewed-on: https://chromium-review.googlesource.com/1170730
> Reviewed-by: Sean McCullough <seanmccullough@chromium.org>
> Reviewed-by: Stephen Martinis <martiniss@chromium.org>
> Commit-Queue: Takuto Ikuta <tikuta@chromium.org>

TBR=seanmccullough@chromium.org,martiniss@chromium.org,tikuta@chromium.org

Change-Id: I204ee3c6a747a5c141af7a21453a29b27c76dc35
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug: 872645
Reviewed-on: https://chromium-review.googlesource.com/1179581
Reviewed-by: Takuto Ikuta <tikuta@chromium.org>
Commit-Queue: Takuto Ikuta <tikuta@chromium.org>

[modify] https://crrev.com/6f3fc80143378fedb46ba663cc972f6985d817a0/go/src/infra/appengine/test-results/model/test_file.go

Are we reporting errors somewhere for data that can't be parsed? That shouldn't be happening, and we should be looking into those errors as bugs.
Project Member

Comment 18 by bugdroid1@chromium.org, Aug 17

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/fa5bceb38f7a1ec6a0cd6338a4d93f5be01704b3

commit fa5bceb38f7a1ec6a0cd6338a4d93f5be01704b3
Author: Takuto Ikuta <tikuta@chromium.org>
Date: Fri Aug 17 21:53:03 2018

[test-results] Delete test result when failed to parse json

When json is corrupted, we'd like to delete such entries.

Bug: 872645
Change-Id: I64d8b68824ecd2ed8d5b8e92d377006da5e8df22
Reviewed-on: https://chromium-review.googlesource.com/1179547
Reviewed-by: Sean McCullough <seanmccullough@chromium.org>
Commit-Queue: Takuto Ikuta <tikuta@chromium.org>

[modify] https://crrev.com/fa5bceb38f7a1ec6a0cd6338a4d93f5be01704b3/go/src/infra/appengine/test-results/frontend/upload_test.go
[modify] https://crrev.com/fa5bceb38f7a1ec6a0cd6338a4d93f5be01704b3/go/src/infra/appengine/test-results/frontend/upload.go

#17, I don't know whether there is some alert for that type of failure.
But if we set alert for high qps of 4xx, that can cover such things.

Anyway I deployed fix to prod. So 4xx will be reduced gradually.

Project Member

Comment 20 by bugdroid1@chromium.org, Aug 21

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/172fe908840d73b83d424b51a0bdb6f44867de2c

commit 172fe908840d73b83d424b51a0bdb6f44867de2c
Author: Takuto Ikuta <tikuta@chromium.org>
Date: Tue Aug 21 04:15:49 2018

[test-results] Parallelize datastore Put in writeDataEntries

This is re-land of
https://chromium-review.googlesource.com/c/infra/infra/+/1170730

This contains fix for the bug pointed out in
https://chromium-review.googlesource.com/c/infra/infra/+/1170730#message-c4b2d3d4f50ff523552249540f399a30670837a6

We cannot use buf without copying.

Also we need to keep order of keys returned by writeDataEntries.

Bug: 872645
Change-Id: I65d972dd57347edfe9ea9a0b9134d80793ed8c43
Reviewed-on: https://chromium-review.googlesource.com/1179102
Commit-Queue: Takuto Ikuta <tikuta@chromium.org>
Reviewed-by: Sean McCullough <seanmccullough@chromium.org>

[modify] https://crrev.com/172fe908840d73b83d424b51a0bdb6f44867de2c/go/src/infra/appengine/test-results/model/test_file.go
[modify] https://crrev.com/172fe908840d73b83d424b51a0bdb6f44867de2c/go/src/infra/appengine/test-results/model/test_file_test.go

I updated prod with #20.

Anyway, latency for /testfile/upload is improved largely by parallelized bigquery access.
https://screenshot.googleplex.com/huJxK4E1fgQ
https://pantheon.corp.google.com/traces/tasks/5466587825796156494?project=test-results-hrd&folder&organizationId=433637338589
You mean parallelized datastore (not bigquery) access, correct? Also, nice savings!
Project Member

Comment 24 by bugdroid1@chromium.org, Aug 24

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/32f7dc2b7fa9e330b6bbcf37c9fed82330f111eb

commit 32f7dc2b7fa9e330b6bbcf37c9fed82330f111eb
Author: Takuto Ikuta <tikuta@chromium.org>
Date: Fri Aug 24 04:39:52 2018

[test-results] add some log to know slow part

The trace with log is shown like below.
https://screenshot.googleplex.com/ociJUkn8hhz

I will use this information for future optimization.

Bug: 872645
Change-Id: I27c51e1ca8f04f06a2e8c504b6715781edabcb77
Reviewed-on: https://chromium-review.googlesource.com/1184946
Reviewed-by: Sean McCullough <seanmccullough@chromium.org>

[modify] https://crrev.com/32f7dc2b7fa9e330b6bbcf37c9fed82330f111eb/go/src/infra/appengine/test-results/frontend/event_logging.go
[modify] https://crrev.com/32f7dc2b7fa9e330b6bbcf37c9fed82330f111eb/go/src/infra/appengine/test-results/frontend/upload.go

#24, I deployed the change to prod.

Sign in to add a comment