Make PreCQ Launcher resilient to Gerrit/GoB flakiness |
|||||||||
Issue descriptionThe CL was picked up by PreCQ at June 05 03:55 PM. The runs seem to have passed without any issues. But there was no verification from the preCQ till 08:45 AM next day. Gerrit: https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1087776 preCQ run: https://chromeos-cl-viewer-ui.googleplex.com/cl_status/chromium-review.googlesource.com/1087776/1
,
Jun 6 2018
Yep. Seems like a trend. But why would the PreCQ test infra care about Gerrit comments?
,
Jun 6 2018
Potentially related to https://b.corp.google.com/issues/68258327; investigating.
,
Jun 6 2018
Need to pull the CL Action history table.
,
Jun 6 2018
Issue 810760 has been merged into this issue.
,
Jun 6 2018
,
Jun 8 2018
Here's the CL Action table: mysql> SELECT change_source, action, reason, timestamp FROM clActionTable WHERE change_number = '1087776' AND timestamp <= '2018-06-06 14:45:24'; +---------------+---------------------------+--------------------------------+---------------------+ | change_source | action | reason | timestamp | +---------------+---------------------------+--------------------------------+---------------------+ | external | speculative | NULL | 2018-06-05 21:36:13 | | external | validation_pending_pre_cq | daisy_spring-no-vmtest-pre-cq | 2018-06-05 21:36:29 | | external | validation_pending_pre_cq | betty-pre-cq | 2018-06-05 21:36:29 | | external | validation_pending_pre_cq | cyan-no-vmtest-pre-cq | 2018-06-05 21:36:29 | | external | validation_pending_pre_cq | nyan_blaze-no-vmtest-pre-cq | 2018-06-05 21:36:29 | | external | validation_pending_pre_cq | eve-no-vmtest-pre-cq | 2018-06-05 21:36:29 | | external | validation_pending_pre_cq | reef-no-vmtest-pre-cq | 2018-06-05 21:36:29 | | external | validation_pending_pre_cq | whirlwind-no-vmtest-pre-cq | 2018-06-05 21:36:29 | | external | validation_pending_pre_cq | zako-no-vmtest-pre-cq | 2018-06-05 21:36:29 | | external | validation_pending_pre_cq | samus-no-vmtest-pre-cq | 2018-06-05 21:36:29 | | external | validation_pending_pre_cq | kevin-arcnext-no-vmtest-pre-cq | 2018-06-05 21:36:29 | | external | validation_pending_pre_cq | chromite-pre-cq | 2018-06-05 21:36:29 | | external | screened_for_pre_cq | NULL | 2018-06-05 21:36:29 | | external | trybot_launching | daisy_spring-no-vmtest-pre-cq | 2018-06-05 21:38:55 | | external | trybot_launching | betty-pre-cq | 2018-06-05 21:38:55 | | external | trybot_launching | cyan-no-vmtest-pre-cq | 2018-06-05 21:38:55 | | external | trybot_launching | nyan_blaze-no-vmtest-pre-cq | 2018-06-05 21:38:55 | | external | trybot_launching | eve-no-vmtest-pre-cq | 2018-06-05 21:38:55 | | external | trybot_launching | reef-no-vmtest-pre-cq | 2018-06-05 21:38:55 | | external | trybot_launching | whirlwind-no-vmtest-pre-cq | 2018-06-05 21:38:55 | | external | trybot_launching | zako-no-vmtest-pre-cq | 2018-06-05 21:38:55 | | external | trybot_launching | samus-no-vmtest-pre-cq | 2018-06-05 21:38:55 | | external | trybot_launching | kevin-arcnext-no-vmtest-pre-cq | 2018-06-05 21:38:55 | | external | trybot_launching | chromite-pre-cq | 2018-06-05 21:38:55 | | external | picked_up | NULL | 2018-06-05 21:43:32 | | external | picked_up | NULL | 2018-06-05 21:50:10 | | external | picked_up | NULL | 2018-06-05 21:50:38 | | external | picked_up | NULL | 2018-06-05 21:50:46 | | external | picked_up | NULL | 2018-06-05 21:50:49 | | external | picked_up | NULL | 2018-06-05 21:51:13 | | external | picked_up | NULL | 2018-06-05 21:51:13 | | external | picked_up | NULL | 2018-06-05 21:51:20 | | external | picked_up | NULL | 2018-06-05 21:51:26 | | external | picked_up | NULL | 2018-06-05 21:51:26 | | external | picked_up | NULL | 2018-06-05 21:51:31 | | external | pre_cq_inflight | NULL | 2018-06-05 21:53:18 | | external | verified | NULL | 2018-06-05 22:01:01 | | external | verified | NULL | 2018-06-05 23:05:04 | | external | verified | NULL | 2018-06-05 23:10:14 | | external | verified | NULL | 2018-06-05 23:11:25 | | external | verified | NULL | 2018-06-05 23:14:24 | | external | verified | NULL | 2018-06-05 23:17:15 | | external | verified | NULL | 2018-06-05 23:20:06 | | external | verified | NULL | 2018-06-05 23:25:37 | | external | verified | NULL | 2018-06-05 23:25:41 | | external | verified | NULL | 2018-06-05 23:34:56 | | external | verified | NULL | 2018-06-06 00:04:03 | | external | requeued | NULL | 2018-06-06 14:45:16 | | external | pre_cq_fully_verified | NULL | 2018-06-06 14:45:18 | | external | pre_cq_passed | NULL | 2018-06-06 14:45:24 | +---------------+---------------------------+--------------------------------+---------------------+ 50 rows in set (0.02 sec) Next step is to probably pull the PreCQ Launcher logs for this time frame.
,
Jun 12 2018
It looks like there were Gerrit/GoB problems around that time (PDT): 16:19:18: INFO: RunCommand: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n in /b/swarming/w/ir/cache/cbuild/repository ... fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-kefka-private/': The requested URL returned error: 502 [W git.go:283] Transient error string identified in STDERR: "fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-kefka-private/': The requested URL returned error: 502\n" [W git.go:294] Retrying after 3s (rc=128): transient error string encountered fatal: unable to access 'https://chromium.googlesource.com/a/chromiumos/third_party/bluez/': The requested URL returned error: 502 [W git.go:283] Transient error string identified in STDERR: "fatal: unable to access 'https://chromium.googlesource.com/a/chromiumos/third_party/bluez/': The requested URL returned error: 502\n" [W git.go:294] Retrying after 3s (rc=128): transient error string encountered Fetching project chromiumos/third_party/kernel [W git.go:317] Command completed with rc 0 after 1 transient failure(s). [W git.go:317] Command completed with rc 0 after 1 transient failure(s). Fetching project chromiumos/third_party/kernel fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-arkham-private/': The requested URL returned error: 502 [W git.go:283] Transient error string identified in STDERR: "fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-arkham-private/': The requested URL returned error: 502\n" [W git.go:294] Retrying after 3s (rc=128): transient error string encountered fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-cranky-private/': The requested URL returned error: 502 [W git.go:283] Transient error string identified in STDERR: "fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-cranky-private/': The requested URL returned error: 502\n" [W git.go:294] Retrying after 3s (rc=128): transient error string encountered fatal: unable to access 'https://chromium.googlesource.com/a/chromiumos/platform/factory/': The requested URL returned error: 502 [W git.go:283] Transient error string identified in STDERR: "fatal: unable to access 'https://chromium.googlesource.com/a/chromiumos/platform/factory/': The requested URL returned error: 502\n" [W git.go:294] Retrying after 3s (rc=128): transient error string encountered [W git.go:317] Command completed with rc 0 after 1 transient failure(s). [W git.go:317] Command completed with rc 0 after 1 transient failure(s). [W git.go:317] Command completed with rc 0 after 1 transient failure(s). fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-cave-private/': The requested URL returned error: 502 [W git.go:283] Transient error string identified in STDERR: "fatal: unable to access 'https://chrome-internal.googlesource.com/chromeos/overlays/overlay-cave-private/': The requested URL returned error: 502\n" [W git.go:294] Retrying after 3s (rc=128): transient error string encountered problems continue until almost an hour later: 17:13:25: INFO: RunCommand: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n in /b/swarming/w/ir/cache/cbui ld/repository ... fatal: remote error: Git repository not found fatal: remote error: Git repository not found fatal: remote error: Git repository not found fatal: remote error: Git repository not found ... ESC[1;33m17:16:53: WARNING: A transient error occured while querying chrome-internal-review.googlesource.com: GET /a/changes/?q=commit:f0fcbaf927527c17e78e1044f5f8d04fb6ae6627&n=500&o=DETAILED_ACCOUNTS&o=ALL_REVISIONS&o=DETAILED_LABELS&o=CURRENT_COMMIT&o=CURRENT_REVISION HTTP/1.1 HTTP/1.1 500 Internal Server Error Response body: 'Internal server error\n'ESC[0m ... 17:29:18: INFO: RunCommand: repo --time sync --force-sync '--cache-dir=/b/swarming/w/ir/cache/git' -n in /b/swarming/w/ir/cache/cbuild/repository fatal: unable to access 'https://chromium.googlesource.com/a/chromiumos/platform/firmware/': The requested URL returned error: 502 [W git.go:283] Transient error string identified in STDERR: "fatal: unable to access 'https://chromium.googlesource.com/a/chromiumos/platform/firmware/': The requested URL returned error: 502\n" Need to make PreCQ launcher more resilient to this kind of problem with retries.
,
Jun 12 2018
,
Oct 19
Dhanya, let's try to implement a quick fix here while we're working on the other design work.
,
Oct 19
If you start to delve deeply into the code, I have a lot of suggestions about ways to simplify it....
,
Oct 31
The error 520s seem to have worked themselves out after retries. Do we get "fatal: remote error: Git repository not found" errors because of GoB quota issues? Also, does it make sense to retry for this one?
,
Oct 31
"Git repository not found" is due to infrastructure failures on the GoB side, not quota.
,
Nov 12
https://docs.google.com/spreadsheets/d/1GOLrmt6R_8qiN39K77vL0TJpt0dbJFYM6SDdquo_JKI/edit?usp=sharing Adding retry messages based on this spreadsheet[@google-restricted].
,
Nov 13
Issue 744569 has been merged into this issue.
,
Nov 15
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/c9c8a52bfeaf8bc00ece22fdfd447822c8fcad77 commit c9c8a52bfeaf8bc00ece22fdfd447822c8fcad77 Author: Dhanya Ganesh <dhanyaganesh@chromium.org> Date: Thu Nov 15 21:29:51 2018 git-wrapper: Additional retry messages for GoB flakiness Added 4 additional messages that have been observed during GoB flakiness and quota issues. Increased the number of retries from 10 to 12. This should increase the retry time from 5.6 minutes to 12+ minutes. BUG= chromium:850130 TEST=test.py Change-Id: I92846c1ab9ff2a30a87c247fb62c9f4e899c8a0d Reviewed-on: https://chromium-review.googlesource.com/c/1332328 Commit-Queue: Robbie Iannucci <iannucci@chromium.org> Reviewed-by: Robbie Iannucci <iannucci@chromium.org> Cr-Commit-Position: refs/heads/master@{#19028} [modify] https://crrev.com/c9c8a52bfeaf8bc00ece22fdfd447822c8fcad77/go/src/infra/tools/git/retry_regexp_test.go [modify] https://crrev.com/c9c8a52bfeaf8bc00ece22fdfd447822c8fcad77/go/src/infra/tools/git/retry_regexp.go [modify] https://crrev.com/c9c8a52bfeaf8bc00ece22fdfd447822c8fcad77/go/src/infra/tools/git/main.go
,
Nov 16
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/da5491be675c79d05750cef9051038aa905ddaff commit da5491be675c79d05750cef9051038aa905ddaff Author: Robert Iannucci <iannucci@chromium.org> Date: Fri Nov 16 21:57:25 2018
,
Nov 16
Closing this. If any new messages need to be added, add me to the new bug.
,
Nov 19
This is actually waiting for one more roll to make it to production: https://chrome-internal-review.googlesource.com/c/infradata/config/+/719059
,
Nov 19
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/fef0eb079005f3a89c9580528f30849c26f20cb8 commit fef0eb079005f3a89c9580528f30849c26f20cb8 Author: Robert Iannucci <iannucci@chromium.org> Date: Mon Nov 19 21:51:32 2018
,
Nov 19
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by jclinton@chromium.org
, Jun 6 2018