New issue
Advanced search Search tips

Issue 803996 link

Starred by 0 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 3
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Tricium fails to collect swarming task result sometimes, it is sometimes not ready

Project Member Reported by qyears...@chromium.org, Jan 19 2018

Issue description

Found in the tricium-dev logs.

Full logged error: [driver] Failed to call Driver.Collect :: {"error":"rpc error: code = Internal desc = failed to trigger worker: failed to collect swarming task result: missing isolated output, task id: 3b292fe464967010"}

Example cases:
[driver]: Received collect request (run ID: 4880256448069632, worker: GitFileIsolator_UBUNTU, task ID: 3b294206517ac510)
[driver]: Received collect request (run ID: 6372074683629568, worker: Spacey_UBUNTU, task ID: 3b292fe464967010)

In the first one above, the task failed; in the second it succeed.

Logged here:
https://cs.chromium.org/chromium/infra/go/src/infra/tricium/appengine/driver/rpc_collect.go?l=50

M-A, do you have ideas about where to look next to investigate this? Is it relevant that the error message says "missing isolated output"?
 

Comment 1 by emso@chromium.org, Jan 22 2018

AFAIK, this is because the isolate input isn't available yet. It retries until it succeeds. The way to fix it is probably to wait before asking for the isolated output after being notified of completion. 

Comment 2 by emso@chromium.org, Jan 22 2018

Labels: Tricium
Owner: qyears...@chromium.org
Status: Assigned (was: Available)
Inspecting the logs confirms the above in #1.

For example: https://chromium-swarm.appspot.com/task?id=3c11014474fec310 was running for 5 seconds, 2:02:13 - 2:02:18 PM PST, and in the middle of those 5 seconds there were several failures in a row for the same task, e.g.

2018-03-05 14:02:15.166 PST
[driver]: Received collect request (run ID: 5140140221530112, worker: Spacey_UBUNTU, task ID: 3c11014474fec310)
2018-03-05 14:02:15.220 PST
[driver] Failed to call Driver.Collect :: {"error":"rpc error: code = Internal desc = failed to trigger worker: failed to collect swarming task result: missing isolated output, task id: 3c11014474fec310"}

2018-03-05 14:02:15.765 PST
[driver]: Received collect request (run ID: 5140140221530112, worker: Spacey_UBUNTU, task ID: 3c11014474fec310)
2018-03-05 14:02:15.818 PST
[driver] Failed to call Driver.Collect :: {"error":"rpc error: code = Internal desc = failed to trigger worker: failed to collect swarming task result: missing isolated output, task id: 3c11014474fec310"}

Then immediately after the above, there was a success for the same task:

2018-03-05 14:02:19.671 PST
[driver]: Received collect request (run ID: 5140140221530112, worker: Spacey_UBUNTU, task ID: 3c11014474fec310)
2018-03-05 14:02:20.140 PST
Fetched isolate: "{\"algo\":\"sha-1\",\"files\":{\"tricium/data/results.json\":{\"h\":\"bf21a9e8fbc5a3846fb05b4fa0859e0917b2202f\",\"m\":416,\"s\":2}},\"version\":\"1.6\"}", iso: &{sha-1 [] map[tricium/data/results.json:{bf21a9e8fbc5a3846fb05b4fa0859e0917b2202f <nil> 0xc0087c1378 0xc0087c1398 }] [] <nil>  1.6}
2018-03-05 14:02:20.495 PST
[driver] Successfully completed collect
More notes about this:

The Collect function in common/swarming.go currently says:

  The task in question should be completed before this function is                               
  called and the task should have isolated output. The isolated output                           
  and exit code of the task are returned.

If there's a better/simpler way to ask Swarming "is this task completed yet?"
then we could use that. Otherwise, I propose that we should change Collect
in common/swarming.go to return no output isolate hash and no error in order
to indicate that the task isn't completed yet.

OK, I think I'm starting to understand this slightly better: Tricium is only supposed to try to Collect tasks when it has been notified by Swarming via PubSub that the task is done, so we really are expecting Collect to be called when the task is already completed. Right?

Another example of a sequence of logs, including the pubsub notification:

2018-03-07 14:51:01.614 /_ah/push-handlers/notify
[driver] Unwrapped pubsub message, task ID: "3c1b7a839f287e10", TriggerRequest: run_id:5707166098915328 isolated_input_hash:"556329f3fbb2bc1fe2bca898efe3a75a7c34dc56" worker:"GitFileIsolator_UBUNTU" 
[driver] Enqueued collect request, runID: 5707166098915328, worker: GitFileIsolator_UBUNTU

2018-03-07 14:51:01.848 /driver/internal/collect
[driver] Failed to call Driver.Collect :: {"error":"rpc error: code = Internal desc = failed to trigger worker: failed to collect swarming task result: missing isolated output, task id: 3c1b7a839f287e10"}

[Five more failed requests to /driver/internal/collect with the same result, followed by one successful request a few seconds later, at 2018-03-07 14:51:10.119]

The task page (https://chromium-swarm.appspot.com/task?id=3c1b7a839f287e10&refresh=10&show_raw=1) indicates that the PubSub push handler was called when the task just started and hadn't finished yet. Is this a bug?
As discussed on Tuesday; this will probably involve waiting for a different push notification from swarming; swarming may do pubsub pushes on task start and task complete, we only care about task complete.
Another case study of a particular task:

https://chromium-swarm.appspot.com/task?id=3c6dc7ed88a55d10
Started:   3:24:17 PM (PDT)
Completed: 3:24:22 PM (PDT)

In swarming logs:
Path: /internal/taskqueue/pubsub/3c6dc7ed88a55d10
2018-03-23 15:24:18.212 PDT
Sending PubSub notify to "projects/tricium-dev/topics/worker-completion" (with userdata "CICAgKCjgIEKEigyNjE2Mjc1NmNhM2ExMGUxMGEzZjg1Y2YzNjQ1ZTk0ZTQ0ZTk5MzU5Gg1TcGFjZXlfVUJVTlRV") about completion of "3c6dc7ed88a55d10"

In tricium logs:
Path: /_ah/push-handlers/notify
2018-03-23 15:24:18.265 PDT
[driver] Received pubsub message, messageId: "60552732162855", publishTime: "2018-03-23T22:24:18.246Z"
2018-03-23 15:24:18.265 PDT
[driver] Unwrapped pubsub message, task ID: "3c6dc7ed88a55d10", TriggerRequest: run_id:5633907043074048 isolated_input_hash:"26162756ca3a10e10a3f85cf3645e94e44e99359" worker:"Spacey_UBUNTU" 

This is followed by about 5 failures to collect the results in Tricium, since the task is still running.
Then, a couple seconds later:

In swarming logs:
Path: /swarming/api/v1/bot/task_update/3c6dc7ed88a55d11
2018-03-23 15:24:22.152 PDT
Sending PubSub notify to "projects/tricium-dev/topics/worker-completion" (with userdata "CICAgKCjgIEKEigyNjE2Mjc1NmNhM2ExMGUxMGEzZjg1Y2YzNjQ1ZTk0ZTQ0ZTk5MzU5Gg1TcGFjZXlfVUJVTlRV") about completion of "3c6dc7ed88a55d10" (/base/data/home/apps/s~chromium-swarm/3405-9aff2e7.408512476543677831/server/task_scheduler.py:379)

In tricium logs:
Path: /_ah/push-handlers/notify
2018-03-23 15:24:22.211 PDT
[driver] Received pubsub message, messageId: "60548791753918", publishTime: "2018-03-23T22:24:22.172Z"
2018-03-23 15:24:22.211 PDT
[driver] Unwrapped pubsub message, task ID: "3c6dc7ed88a55d10", TriggerRequest: run_id:5633907043074048 isolated_input_hash:"26162756ca3a10e10a3f85cf3645e94e44e99359" worker:"Spacey_UBUNTU"


The swarming PubSub notify at 15:24:13 'Sending PubSub notify ... about completion of "3c6dc7ed88a55d10"' seems wrong, because that task is not completed at that time. At least, the log message is not accurate.

Comment 8 by mar...@chromium.org, Mar 23 2018

When tricium polls a task and it is still running, it should just let it go.

The function in swarming.go should be changed.
https://cs.chromium.org/chromium/infra/go/src/infra/tricium/appengine/common/swarming.go?sq=package:chromium&dr=CSs&l=138
It shouldn't return an error when the task is incomplete.
Summary: Tricium fails to collect swarming task result sometimes, it is sometimes not ready (was: Tricium: Failed to call Driver.Collect ... failed to collect swarming task result)
Components: Infra>Platform>Tricium
Components: -Infra>CodeAnalysis
Labels: -Tricium
Project Member

Comment 12 by bugdroid1@chromium.org, Jul 3

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/f186cbc9cf8a25072c82136ac26f4a6f1fad5334

commit f186cbc9cf8a25072c82136ac26f4a6f1fad5334
Author: Quinten Yearsley <qyearsley@chromium.org>
Date: Tue Jul 03 18:23:56 2018

[tricium] Do not consider it an error when there is no isolated output

In general, in the logs we are seeing two pubsub notifications
per worker for Tricium functions, the first one is received before
the task is finished, and the second one afterwards.

This changes the behavior of Tricium to do nothing when trying to
collect and finding that there is no isolated output hash yet, so
that it will just do nothing and return 200 response.
The output should still be collected after the second pubsub
notification comes along.

Note: There's a potential concern though that
if swarming's behavior changes so there's not always a pubsub
notification after the task is done, some task output will stay
uncollected and analyzer runs may stay unfinished.

Bug:  803996 
Change-Id: I253c777c6f8328bf307deee90cf7f2b576f1a327
Reviewed-on: https://chromium-review.googlesource.com/1123062
Commit-Queue: Quinten Yearsley <qyearsley@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/f186cbc9cf8a25072c82136ac26f4a6f1fad5334/go/src/infra/tricium/appengine/driver/rpc_collect_test.go
[modify] https://crrev.com/f186cbc9cf8a25072c82136ac26f4a6f1fad5334/go/src/infra/tricium/appengine/common/swarming.go
[modify] https://crrev.com/f186cbc9cf8a25072c82136ac26f4a6f1fad5334/go/src/infra/tricium/appengine/driver/rpc_collect.go

Project Member

Comment 13 by bugdroid1@chromium.org, Jul 3

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/070dd6cce14cc6c739d8340bef5dfe47713392b6

commit 070dd6cce14cc6c739d8340bef5dfe47713392b6
Author: Quinten Yearsley <qyearsley@chromium.org>
Date: Tue Jul 03 22:53:07 2018

Revert "[tricium] Do not consider it an error when there is no isolated output"

This reverts commit f186cbc9cf8a25072c82136ac26f4a6f1fad5334.

Reason for revert: This doesn't work as intended; results are never collected since collect requests aren't retried like they were before.

I made a proposed fix at https://chromium-review.googlesource.com/c/infra/infra/+/1125303, but I'd like to revert this for now to get the current committed version back to a working state.

Original change's description:
> [tricium] Do not consider it an error when there is no isolated output
> 
> In general, in the logs we are seeing two pubsub notifications
> per worker for Tricium functions, the first one is received before
> the task is finished, and the second one afterwards.
> 
> This changes the behavior of Tricium to do nothing when trying to
> collect and finding that there is no isolated output hash yet, so
> that it will just do nothing and return 200 response.
> The output should still be collected after the second pubsub
> notification comes along.
> 
> Note: There's a potential concern though that
> if swarming's behavior changes so there's not always a pubsub
> notification after the task is done, some task output will stay
> uncollected and analyzer runs may stay unfinished.
> 
> Bug:  803996 
> Change-Id: I253c777c6f8328bf307deee90cf7f2b576f1a327
> Reviewed-on: https://chromium-review.googlesource.com/1123062
> Commit-Queue: Quinten Yearsley <qyearsley@chromium.org>
> Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>

TBR=maruel@chromium.org,qyearsley@chromium.org

Change-Id: I801b58889583501ca077f69c59e3e804a571ac8c
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug:  803996 
Reviewed-on: https://chromium-review.googlesource.com/1124801
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>
Commit-Queue: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/070dd6cce14cc6c739d8340bef5dfe47713392b6/go/src/infra/tricium/appengine/driver/rpc_collect_test.go
[modify] https://crrev.com/070dd6cce14cc6c739d8340bef5dfe47713392b6/go/src/infra/tricium/appengine/common/swarming.go
[modify] https://crrev.com/070dd6cce14cc6c739d8340bef5dfe47713392b6/go/src/infra/tricium/appengine/driver/rpc_collect.go

Status: Fixed (was: Assigned)
See CL: https://chromium-review.googlesource.com/c/infra/infra/+/1125303

I believe this is fixed now.

Sign in to add a comment