New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 852142 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

A bunch of stuck jobs due to "ghost" invocation

Project Member Reported by vadimsh@chromium.org, Jun 12 2018

Issue description

Examples:
  https://luci-scheduler.appspot.com/jobs/fuchsia/jiri-gitiles-trigger
  https://luci-scheduler.appspot.com/jobs/cobalt-analytics/cobalt-gitiles-trigger
  https://luci-scheduler.appspot.com/jobs/fuchsia/cr-catapult-gitiles-trigger

Timing matches luci-config outage ( Issue 850796 ) that caused all jobs to disappear and then 20 min later reappear again.

"jiri-gitiles-trigger" has pending cron trigger: "cron:v1:242487" (4 days ago)

Last triage (4 days ago)
[000 ms] Starting
[117 ms] Pending triggers set:  1 items, 0 garbage
[117 ms] Recently finished set: 0 items, 0 garbage
[117 ms] The preparation is finished
[129 ms] Started the transaction
[129 ms] Number of active invocations: 1
[129 ms] Number of recently finished:  0
[134 ms] Triggers available in this txn: 1
[134 ms] Invoking the triggering policy function
[134 ms] Max concurrent invocations is 1 and there's 1 running => refusing to launch more
[134 ms] The policy requested 0 new invocations
[134 ms] Removing consumed dsset items
[134 ms] Landing the transaction
[232 ms] Done

The important part: "Max concurrent invocations is 1 and there's 1 running". So it thinks there's something running already.

Job.ActiveInvocations in datastore is [9110585685853734112], which matches the last invocation, which is successful: https://luci-scheduler.appspot.com/jobs/fuchsia/jiri-gitiles-trigger/9110585685853734112

So it appears the invocation completion notification has been lost and ActiveInvocations list hasn't been cleaned up.

My guess is that the invocation completed after the job has been disabled, and thus the notification was dropped.
 
I'm unstucking the known stuck jobs manually for now: 
1. Find the Job entity in the datastore editor and delete "ActiveInvocations" field.
2. Click "Abort" button on the job page to initiate a triage.

jiri-gitiles-trigger, cobalt-gitiles-trigger and cr-catapult-gitiles-trigger are unstuck now.

Comment 2 by dbort@google.com, Jun 12 2018

Confirmed: thanks for the quick mitigation, Vadim!
This is unrelated to luci-config outage. I've found several jobs in similar condition on luci-scheduler-dev, and they broke 2 days ago.

Luckily, logs for them are still present. Here's the suspicious part wrt invocation 9109539223601762176: https://paste.googleplex.com/4709887648989184

It seems the triage transaction is not entirely transactional when datastore times out. Somehow the invocation completion notification is consumed from dsset.Set, but the entity is not updated. I suspect the garbage cleaning process removes something that shouldn't be removed.
Project Member

Comment 4 by bugdroid1@chromium.org, Jun 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-go.git/+/29ff7eee5d9701d273e4b0652b0c16f27a6a9796

commit 29ff7eee5d9701d273e4b0652b0c16f27a6a9796
Author: Vadim Shtayura <vadimsh@chromium.org>
Date: Wed Jun 13 01:26:33 2018

[scheduler] Boilerplate for adding Admin RPC interface.

It will be used for debugging and various internal manual actions (if any).
Had to move it to 'internal' package to be able to reuse protos defined
there.

R=tandrii@chromium.org
BUG= 852142 

Change-Id: I11d2aa1c8df0a21d8b0ce00084fd1d891cb8e07d
Reviewed-on: https://chromium-review.googlesource.com/1098220
Commit-Queue: Vadim Shtayura <vadimsh@chromium.org>
Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org>

[add] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/apiservers/admin.go
[modify] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/frontend/handler.go
[add] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/internal/admin.pb.go
[add] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/internal/admin.proto
[add] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/internal/adminserver_dec.go
[modify] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/internal/gen.go
[add] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/internal/pb.discovery.go
[modify] https://crrev.com/29ff7eee5d9701d273e4b0652b0c16f27a6a9796/scheduler/appengine/messages/config.pb.go

The only other stuck job is gyp/gyp-gitiles-trigger (detected by https://chromium-review.googlesource.com/c/infra/luci/luci-go/+/1098463)

I'll manually unstuck it.

The actual fix is https://chromium-review.googlesource.com/c/infra/luci/luci-go/+/1098469
Project Member

Comment 8 by bugdroid1@chromium.org, Jun 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-go.git/+/24d742e444c84df99629d8a6aff7ca7e6c90f995

commit 24d742e444c84df99629d8a6aff7ca7e6c90f995
Author: Vadim Shtayura <vadimsh@chromium.org>
Date: Wed Jun 13 17:02:07 2018

[scheduler] Add adhoc script to detect jobs with stuck ActiveInvocations list.

R=tandrii@chromium.org
BUG= 852142 

Change-Id: Idae7f05c5045a72ff85db8587f8bd74c0b80fb06
Reviewed-on: https://chromium-review.googlesource.com/1098463
Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org>
Commit-Queue: Vadim Shtayura <vadimsh@chromium.org>

[add] https://crrev.com/24d742e444c84df99629d8a6aff7ca7e6c90f995/scheduler/misc/detect_stuck_active_invs.py

Project Member

Comment 9 by bugdroid1@chromium.org, Jun 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-go.git/+/c7578c6d177f2bca6107c6a5b7c03945b0036cbf

commit c7578c6d177f2bca6107c6a5b7c03945b0036cbf
Author: Vadim Shtayura <vadimsh@chromium.org>
Date: Wed Jun 13 17:32:17 2018

[scheduler] Do not delete dsset items if the triage txn fails.

This is likely regressed when the triage log was introduced, since it changed
when we call triageOp.finialize(...).

R=tandrii@chromium.org
BUG= 852142 

Change-Id: I55fcc2cc56073ad9d048d85f4253d9cfde2d82f4
Reviewed-on: https://chromium-review.googlesource.com/1098469
Reviewed-by: Andrii Shyshkalov <tandrii@chromium.org>
Commit-Queue: Vadim Shtayura <vadimsh@chromium.org>

[modify] https://crrev.com/c7578c6d177f2bca6107c6a5b7c03945b0036cbf/scheduler/appengine/engine/engine.go
[modify] https://crrev.com/c7578c6d177f2bca6107c6a5b7c03945b0036cbf/scheduler/appengine/engine/triage.go
[modify] https://crrev.com/c7578c6d177f2bca6107c6a5b7c03945b0036cbf/scheduler/appengine/engine/triage_test.go

Status: Fixed (was: Assigned)

Sign in to add a comment