New issue
Advanced search Search tips

Issue 850130 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Nov 19
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Feature



Sign in to add a comment

Make PreCQ Launcher resilient to Gerrit/GoB flakiness

Project Member Reported by dhanyaganesh@chromium.org, Jun 6 2018

Issue description

The CL was picked up by PreCQ at June 05 03:55 PM. The runs seem to have passed without any issues. But there was no verification from the preCQ till 08:45 AM next day.

Gerrit: https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1087776

preCQ run: https://chromeos-cl-viewer-ui.googleplex.com/cl_status/chromium-review.googlesource.com/1087776/1
 
Labels: -Pri-3 Pri-1
Another CL affected with same symptoms: https://chromium-review.googlesource.com/c/chromiumos/overlays/portage-stable/+/1080279

It seems that commenting on an affected CL causes the PreCQ launcher to realize that the testing is completed.
Yep. Seems like a trend. But why would the PreCQ test infra care about Gerrit comments?
Potentially related to https://b.corp.google.com/issues/68258327; investigating.
Need to pull the CL Action history table.
Issue 810760 has been merged into this issue.
Status: Started (was: Assigned)
Components: Infra>Client>ChromeOS>CI
Here's the CL Action table:

mysql> SELECT change_source, action, reason, timestamp FROM clActionTable WHERE change_number = '1087776' AND timestamp <= '2018-06-06 14:45:24';
+---------------+---------------------------+--------------------------------+---------------------+
| change_source | action                    | reason                         | timestamp           |
+---------------+---------------------------+--------------------------------+---------------------+
| external      | speculative               | NULL                           | 2018-06-05 21:36:13 |
| external      | validation_pending_pre_cq | daisy_spring-no-vmtest-pre-cq  | 2018-06-05 21:36:29 |
| external      | validation_pending_pre_cq | betty-pre-cq                   | 2018-06-05 21:36:29 |
| external      | validation_pending_pre_cq | cyan-no-vmtest-pre-cq          | 2018-06-05 21:36:29 |
| external      | validation_pending_pre_cq | nyan_blaze-no-vmtest-pre-cq    | 2018-06-05 21:36:29 |
| external      | validation_pending_pre_cq | eve-no-vmtest-pre-cq           | 2018-06-05 21:36:29 |
| external      | validation_pending_pre_cq | reef-no-vmtest-pre-cq          | 2018-06-05 21:36:29 |
| external      | validation_pending_pre_cq | whirlwind-no-vmtest-pre-cq     | 2018-06-05 21:36:29 |
| external      | validation_pending_pre_cq | zako-no-vmtest-pre-cq          | 2018-06-05 21:36:29 |
| external      | validation_pending_pre_cq | samus-no-vmtest-pre-cq         | 2018-06-05 21:36:29 |
| external      | validation_pending_pre_cq | kevin-arcnext-no-vmtest-pre-cq | 2018-06-05 21:36:29 |
| external      | validation_pending_pre_cq | chromite-pre-cq                | 2018-06-05 21:36:29 |
| external      | screened_for_pre_cq       | NULL                           | 2018-06-05 21:36:29 |
| external      | trybot_launching          | daisy_spring-no-vmtest-pre-cq  | 2018-06-05 21:38:55 |
| external      | trybot_launching          | betty-pre-cq                   | 2018-06-05 21:38:55 |
| external      | trybot_launching          | cyan-no-vmtest-pre-cq          | 2018-06-05 21:38:55 |
| external      | trybot_launching          | nyan_blaze-no-vmtest-pre-cq    | 2018-06-05 21:38:55 |
| external      | trybot_launching          | eve-no-vmtest-pre-cq           | 2018-06-05 21:38:55 |
| external      | trybot_launching          | reef-no-vmtest-pre-cq          | 2018-06-05 21:38:55 |
| external      | trybot_launching          | whirlwind-no-vmtest-pre-cq     | 2018-06-05 21:38:55 |
| external      | trybot_launching          | zako-no-vmtest-pre-cq          | 2018-06-05 21:38:55 |
| external      | trybot_launching          | samus-no-vmtest-pre-cq         | 2018-06-05 21:38:55 |
| external      | trybot_launching          | kevin-arcnext-no-vmtest-pre-cq | 2018-06-05 21:38:55 |
| external      | trybot_launching          | chromite-pre-cq                | 2018-06-05 21:38:55 |
| external      | picked_up                 | NULL                           | 2018-06-05 21:43:32 |
| external      | picked_up                 | NULL                           | 2018-06-05 21:50:10 |
| external      | picked_up                 | NULL                           | 2018-06-05 21:50:38 |
| external      | picked_up                 | NULL                           | 2018-06-05 21:50:46 |
| external      | picked_up                 | NULL                           | 2018-06-05 21:50:49 |
| external      | picked_up                 | NULL                           | 2018-06-05 21:51:13 |
| external      | picked_up                 | NULL                           | 2018-06-05 21:51:13 |
| external      | picked_up                 | NULL                           | 2018-06-05 21:51:20 |
| external      | picked_up                 | NULL                           | 2018-06-05 21:51:26 |
| external      | picked_up                 | NULL                           | 2018-06-05 21:51:26 |
| external      | picked_up                 | NULL                           | 2018-06-05 21:51:31 |
| external      | pre_cq_inflight           | NULL                           | 2018-06-05 21:53:18 |
| external      | verified                  | NULL                           | 2018-06-05 22:01:01 |
| external      | verified                  | NULL                           | 2018-06-05 23:05:04 |
| external      | verified                  | NULL                           | 2018-06-05 23:10:14 |
| external      | verified                  | NULL                           | 2018-06-05 23:11:25 |
| external      | verified                  | NULL                           | 2018-06-05 23:14:24 |
| external      | verified                  | NULL                           | 2018-06-05 23:17:15 |
| external      | verified                  | NULL                           | 2018-06-05 23:20:06 |
| external      | verified                  | NULL                           | 2018-06-05 23:25:37 |
| external      | verified                  | NULL                           | 2018-06-05 23:25:41 |
| external      | verified                  | NULL                           | 2018-06-05 23:34:56 |
| external      | verified                  | NULL                           | 2018-06-06 00:04:03 |
| external      | requeued                  | NULL                           | 2018-06-06 14:45:16 |
| external      | pre_cq_fully_verified     | NULL                           | 2018-06-06 14:45:18 |
| external      | pre_cq_passed             | NULL                           | 2018-06-06 14:45:24 |
+---------------+---------------------------+--------------------------------+---------------------+
50 rows in set (0.02 sec)

Next step is to probably pull the PreCQ Launcher logs for this time frame.

Labels: -Pri-1 -Type-Bug Pri-2 Type-Feature
Owner: ----
Status: Available (was: Started)
It looks like there were Gerrit/GoB problems around that time (PDT): 

16:19:18: INFO: RunCommand: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n in /b/swarming/w/ir/cache/cbuild/repository
...
fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-kefka-private/': The requested URL returned error: 502
[W git.go:283] Transient error string identified in STDERR: "fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-kefka-private/': The requested URL returned error: 502\n"
[W git.go:294] Retrying after 3s (rc=128): transient error string encountered
fatal: unable to access 'https://chromium.googlesource.com/a/chromiumos/third_party/bluez/': The requested URL returned error: 502
[W git.go:283] Transient error string identified in STDERR: "fatal: unable to access 'https://chromium.googlesource.com/a/chromiumos/third_party/bluez/': The requested URL returned error: 502\n"
[W git.go:294] Retrying after 3s (rc=128): transient error string encountered
Fetching project chromiumos/third_party/kernel
[W git.go:317] Command completed with rc 0 after 1 transient failure(s).
[W git.go:317] Command completed with rc 0 after 1 transient failure(s).
Fetching project chromiumos/third_party/kernel
fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-arkham-private/': The requested URL returned error: 502
[W git.go:283] Transient error string identified in STDERR: "fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-arkham-private/': The requested URL returned error: 502\n"
[W git.go:294] Retrying after 3s (rc=128): transient error string encountered
fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-cranky-private/': The requested URL returned error: 502
[W git.go:283] Transient error string identified in STDERR: "fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-cranky-private/': The requested URL returned error: 502\n"
[W git.go:294] Retrying after 3s (rc=128): transient error string encountered
fatal: unable to access 'https://chromium.googlesource.com/a/chromiumos/platform/factory/': The requested URL returned error: 502
[W git.go:283] Transient error string identified in STDERR: "fatal: unable to access 'https://chromium.googlesource.com/a/chromiumos/platform/factory/': The requested URL returned error: 502\n"
[W git.go:294] Retrying after 3s (rc=128): transient error string encountered
[W git.go:317] Command completed with rc 0 after 1 transient failure(s).
[W git.go:317] Command completed with rc 0 after 1 transient failure(s).
[W git.go:317] Command completed with rc 0 after 1 transient failure(s).
fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-cave-private/': The requested URL returned error: 502
[W git.go:283] Transient error string identified in STDERR: "fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-cave-private/': The requested URL returned error: 502\n"
[W git.go:294] Retrying after 3s (rc=128): transient error string encountered

problems continue until almost an hour later:

17:13:25: INFO: RunCommand: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n in /b/swarming/w/ir/cache/cbui
ld/repository
...
fatal: remote error: Git repository not found
fatal: remote error: Git repository not found
fatal: remote error: Git repository not found
fatal: remote error: Git repository not found
...
ESC[1;33m17:16:53: WARNING: A transient error occured while querying chrome-internal-review.googlesource.com:
GET /a/changes/?q=commit:f0fcbaf927527c17e78e1044f5f8d04fb6ae6627&amp;n=500&amp;o=DETAILED_ACCOUNTS&amp;o=ALL_REVISIONS&amp;o=DETAILED_LABELS&amp;o=CURRENT_COMMIT&amp;o=CURRENT_REVISION HTTP/1.1
HTTP/1.1 500 Internal Server Error
Response body: 'Internal server error\n'ESC[0m
...
17:29:18: INFO: RunCommand: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n in /b/swarming/w/ir/cache/cbuild/repository
fatal: unable to access 'https://chromium.googlesource.com/a/chromiumos/platform/firmware/': The requested URL returned error: 502
[W git.go:283] Transient error string identified in STDERR: "fatal: unable to access 'https://chromium.googlesource.com/a/chromiumos/platform/firmware/': The requested URL returned error: 502\n"

Need to make PreCQ launcher more resilient to this kind of problem with retries.
Summary: Make PreCQ Launcher resilient to Gerrit/GoB flakiness (was: PreCQ run was not verified for 17 hours)
Owner: dhanyaganesh@chromium.org
Status: Assigned (was: Available)
Dhanya, let's try to implement a quick fix here while we're working on the other design work.
If you start to delve deeply into the code, I have a lot of suggestions about ways to simplify it....
The error 520s seem to have worked themselves out after retries.
Do we get "fatal: remote error: Git repository not found" errors because of GoB quota issues? Also, does it make sense to retry for this one?
"Git repository not found" is due to infrastructure failures on the GoB side, not quota.
https://docs.google.com/spreadsheets/d/1GOLrmt6R_8qiN39K77vL0TJpt0dbJFYM6SDdquo_JKI/edit?usp=sharing

Adding retry messages based on this spreadsheet[@google-restricted].
Issue 744569 has been merged into this issue.
Project Member

Comment 17 by bugdroid1@chromium.org, Nov 15

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/c9c8a52bfeaf8bc00ece22fdfd447822c8fcad77

commit c9c8a52bfeaf8bc00ece22fdfd447822c8fcad77
Author: Dhanya Ganesh <dhanyaganesh@chromium.org>
Date: Thu Nov 15 21:29:51 2018

git-wrapper: Additional retry messages for GoB flakiness

Added 4 additional messages that have been observed during
GoB flakiness and quota issues. Increased the number of retries
from 10 to 12. This should increase the retry time from 5.6
minutes to 12+ minutes.

BUG= chromium:850130 
TEST=test.py

Change-Id: I92846c1ab9ff2a30a87c247fb62c9f4e899c8a0d
Reviewed-on: https://chromium-review.googlesource.com/c/1332328
Commit-Queue: Robbie Iannucci <iannucci@chromium.org>
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>
Cr-Commit-Position: refs/heads/master@{#19028}
[modify] https://crrev.com/c9c8a52bfeaf8bc00ece22fdfd447822c8fcad77/go/src/infra/tools/git/retry_regexp_test.go
[modify] https://crrev.com/c9c8a52bfeaf8bc00ece22fdfd447822c8fcad77/go/src/infra/tools/git/retry_regexp.go
[modify] https://crrev.com/c9c8a52bfeaf8bc00ece22fdfd447822c8fcad77/go/src/infra/tools/git/main.go

Project Member

Comment 18 by bugdroid1@chromium.org, Nov 16

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/da5491be675c79d05750cef9051038aa905ddaff

commit da5491be675c79d05750cef9051038aa905ddaff
Author: Robert Iannucci <iannucci@chromium.org>
Date: Fri Nov 16 21:57:25 2018

Status: Fixed (was: Assigned)
Closing this. If any new messages need to be added, add me to the new bug.
Status: Assigned (was: Fixed)
This is actually waiting for one more roll to make it to production: 

https://chrome-internal-review.googlesource.com/c/infradata/config/+/719059
Project Member

Comment 21 by bugdroid1@chromium.org, Nov 19

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/fef0eb079005f3a89c9580528f30849c26f20cb8

commit fef0eb079005f3a89c9580528f30849c26f20cb8
Author: Robert Iannucci <iannucci@chromium.org>
Date: Mon Nov 19 21:51:32 2018

Status: Fixed (was: Assigned)

Sign in to add a comment