A bunch of stuck jobs due to "ghost" invocation |
||
Issue descriptionExamples: https://luci-scheduler.appspot.com/jobs/fuchsia/jiri-gitiles-trigger https://luci-scheduler.appspot.com/jobs/cobalt-analytics/cobalt-gitiles-trigger https://luci-scheduler.appspot.com/jobs/fuchsia/cr-catapult-gitiles-trigger Timing matches luci-config outage ( Issue 850796 ) that caused all jobs to disappear and then 20 min later reappear again. "jiri-gitiles-trigger" has pending cron trigger: "cron:v1:242487" (4 days ago) Last triage (4 days ago) [000 ms] Starting [117 ms] Pending triggers set: 1 items, 0 garbage [117 ms] Recently finished set: 0 items, 0 garbage [117 ms] The preparation is finished [129 ms] Started the transaction [129 ms] Number of active invocations: 1 [129 ms] Number of recently finished: 0 [134 ms] Triggers available in this txn: 1 [134 ms] Invoking the triggering policy function [134 ms] Max concurrent invocations is 1 and there's 1 running => refusing to launch more [134 ms] The policy requested 0 new invocations [134 ms] Removing consumed dsset items [134 ms] Landing the transaction [232 ms] Done The important part: "Max concurrent invocations is 1 and there's 1 running". So it thinks there's something running already. Job.ActiveInvocations in datastore is [9110585685853734112], which matches the last invocation, which is successful: https://luci-scheduler.appspot.com/jobs/fuchsia/jiri-gitiles-trigger/9110585685853734112 So it appears the invocation completion notification has been lost and ActiveInvocations list hasn't been cleaned up. My guess is that the invocation completed after the job has been disabled, and thus the notification was dropped.
,
Jun 12 2018
Confirmed: thanks for the quick mitigation, Vadim!
,
Jun 13 2018
This is unrelated to luci-config outage. I've found several jobs in similar condition on luci-scheduler-dev, and they broke 2 days ago. Luckily, logs for them are still present. Here's the suspicious part wrt invocation 9109539223601762176: https://paste.googleplex.com/4709887648989184 It seems the triage transaction is not entirely transactional when datastore times out. Somehow the invocation completion notification is consumed from dsset.Set, but the entity is not updated. I suspect the garbage cleaning process removes something that shouldn't be removed.
,
Jun 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/29ff7eee5d9701d273e4b0652b0c16f27a6a9796 commit 29ff7eee5d9701d273e4b0652b0c16f27a6a9796 Author: Vadim Shtayura <vadimsh@chromium.org> Date: Wed Jun 13 01:26:33 2018 [scheduler] Boilerplate for adding Admin RPC interface. It will be used for debugging and various internal manual actions (if any). Had to move it to 'internal' package to be able to reuse protos defined there. R=tandrii@chromium.org BUG= 852142 Change-Id: I11d2aa1c8df0a21d8b0ce00084fd1d891cb8e07d Reviewed-on: https://chromium-review.googlesource.com/1098220 Commit-Queue: Vadim Shtayura <vadimsh@chromium.org> Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org> [add] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/apiservers/admin.go [modify] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/frontend/handler.go [add] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/internal/admin.pb.go [add] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/internal/admin.proto [add] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/internal/adminserver_dec.go [modify] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/internal/gen.go [add] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/internal/pb.discovery.go [modify] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/messages/config.pb.go
,
Jun 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/8a9054c6592f70c36831f6f188c6f3550f859fbb commit 8a9054c6592f70c36831f6f188c6f3550f859fbb Author: Vadim Shtayura <vadimsh@chromium.org> Date: Wed Jun 13 02:22:34 2018 [scheduler] Implement internal.admin.Admin.GetDebugJobState RPC. Most importantly it assembles, deserializes and returns various sets stored in a serialized form in the datastore (and thus not readable through Cloud Console). R=tandrii@chromium.org BUG= 852142 Change-Id: I704d0e0ec399925f9ad11319cf8604f4c70484d6 Reviewed-on: https://chromium-review.googlesource.com/1098232 Commit-Queue: Vadim Shtayura <vadimsh@chromium.org> Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org> [modify] https://crrev.com/8a9054c6592f70c36831f6f188c6f3550f859fbb/scheduler/appengine/apiservers/admin.go [modify] https://crrev.com/8a9054c6592f70c36831f6f188c6f3550f859fbb/scheduler/appengine/engine/engine.go [modify] https://crrev.com/8a9054c6592f70c36831f6f188c6f3550f859fbb/scheduler/appengine/internal/admin.pb.go [modify] https://crrev.com/8a9054c6592f70c36831f6f188c6f3550f859fbb/scheduler/appengine/internal/admin.proto [modify] https://crrev.com/8a9054c6592f70c36831f6f188c6f3550f859fbb/scheduler/appengine/internal/adminserver_dec.go [modify] https://crrev.com/8a9054c6592f70c36831f6f188c6f3550f859fbb/scheduler/appengine/internal/pb.discovery.go
,
Jun 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/fc9920598c7b2bed174f0c95546db19d7fc263ac commit fc9920598c7b2bed174f0c95546db19d7fc263ac Author: Vadim Shtayura <vadimsh@chromium.org> Date: Wed Jun 13 02:30:23 2018 [scheduler] Add RPC to fetch single invocation by its ID. Will be useful when examining ActiveInvocations list. R=tandrii@chromium.org BUG= 852142 Change-Id: Ifd57574f227a1c94db1a30c9606ce73693bed4be Reviewed-on: https://chromium-review.googlesource.com/1098304 Commit-Queue: Vadim Shtayura <vadimsh@chromium.org> Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org> [modify] https://crrev.com/fc9920598c7b2bed174f0c95546db19d7fc263ac/scheduler/api/scheduler/v1/pb.discovery.go [modify] https://crrev.com/fc9920598c7b2bed174f0c95546db19d7fc263ac/scheduler/api/scheduler/v1/scheduler.pb.go [modify] https://crrev.com/fc9920598c7b2bed174f0c95546db19d7fc263ac/scheduler/api/scheduler/v1/scheduler.proto [modify] https://crrev.com/fc9920598c7b2bed174f0c95546db19d7fc263ac/scheduler/appengine/apiservers/scheduler.go [modify] https://crrev.com/fc9920598c7b2bed174f0c95546db19d7fc263ac/scheduler/appengine/apiservers/scheduler_test.go [modify] https://crrev.com/fc9920598c7b2bed174f0c95546db19d7fc263ac/scheduler/appengine/internal/pb.discovery.go
,
Jun 13 2018
The only other stuck job is gyp/gyp-gitiles-trigger (detected by https://chromium-review.googlesource.com/c/infra/luci/luci-go/+/1098463) I'll manually unstuck it. The actual fix is https://chromium-review.googlesource.com/c/infra/luci/luci-go/+/1098469
,
Jun 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/24d742e444c84df99629d8a6aff7ca7e6c90f995 commit 24d742e444c84df99629d8a6aff7ca7e6c90f995 Author: Vadim Shtayura <vadimsh@chromium.org> Date: Wed Jun 13 17:02:07 2018 [scheduler] Add adhoc script to detect jobs with stuck ActiveInvocations list. R=tandrii@chromium.org BUG= 852142 Change-Id: Idae7f05c5045a72ff85db8587f8bd74c0b80fb06 Reviewed-on: https://chromium-review.googlesource.com/1098463 Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org> Commit-Queue: Vadim Shtayura <vadimsh@chromium.org> [add] https://crrev.com/24d742e444c84df99629d8a6aff7ca7e6c90f995/scheduler/misc/detect_stuck_active_invs.py
,
Jun 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/c7578c6d177f2bca6107c6a5b7c03945b0036cbf commit c7578c6d177f2bca6107c6a5b7c03945b0036cbf Author: Vadim Shtayura <vadimsh@chromium.org> Date: Wed Jun 13 17:32:17 2018 [scheduler] Do not delete dsset items if the triage txn fails. This is likely regressed when the triage log was introduced, since it changed when we call triageOp.finialize(...). R=tandrii@chromium.org BUG= 852142 Change-Id: I55fcc2cc56073ad9d048d85f4253d9cfde2d82f4 Reviewed-on: https://chromium-review.googlesource.com/1098469 Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org> Commit-Queue: Vadim Shtayura <vadimsh@chromium.org> [modify] https://crrev.com/c7578c6d177f2bca6107c6a5b7c03945b0036cbf/scheduler/appengine/engine/engine.go [modify] https://crrev.com/c7578c6d177f2bca6107c6a5b7c03945b0036cbf/scheduler/appengine/engine/triage.go [modify] https://crrev.com/c7578c6d177f2bca6107c6a5b7c03945b0036cbf/scheduler/appengine/engine/triage_test.go
,
Jun 13 2018
|
||
►
Sign in to add a comment |
||
Comment 1 by vadimsh@chromium.org
, Jun 12 2018