Tricium fails to collect swarming task result sometimes, it is sometimes not ready |
|||||||
Issue description
Found in the tricium-dev logs.
Full logged error: [driver] Failed to call Driver.Collect :: {"error":"rpc error: code = Internal desc = failed to trigger worker: failed to collect swarming task result: missing isolated output, task id: 3b292fe464967010"}
Example cases:
[driver]: Received collect request (run ID: 4880256448069632, worker: GitFileIsolator_UBUNTU, task ID: 3b294206517ac510)
[driver]: Received collect request (run ID: 6372074683629568, worker: Spacey_UBUNTU, task ID: 3b292fe464967010)
In the first one above, the task failed; in the second it succeed.
Logged here:
https://cs.chromium.org/chromium/infra/go/src/infra/tricium/appengine/driver/rpc_collect.go?l=50
M-A, do you have ideas about where to look next to investigate this? Is it relevant that the error message says "missing isolated output"?
,
Jan 22 2018
,
Mar 5 2018
Inspecting the logs confirms the above in #1. For example: https://chromium-swarm.appspot.com/task?id=3c11014474fec310 was running for 5 seconds, 2:02:13 - 2:02:18 PM PST, and in the middle of those 5 seconds there were several failures in a row for the same task, e.g. 2018-03-05 14:02:15.166 PST [driver]: Received collect request (run ID: 5140140221530112, worker: Spacey_UBUNTU, task ID: 3c11014474fec310) 2018-03-05 14:02:15.220 PST [driver] Failed to call Driver.Collect :: {"error":"rpc error: code = Internal desc = failed to trigger worker: failed to collect swarming task result: missing isolated output, task id: 3c11014474fec310"} 2018-03-05 14:02:15.765 PST [driver]: Received collect request (run ID: 5140140221530112, worker: Spacey_UBUNTU, task ID: 3c11014474fec310) 2018-03-05 14:02:15.818 PST [driver] Failed to call Driver.Collect :: {"error":"rpc error: code = Internal desc = failed to trigger worker: failed to collect swarming task result: missing isolated output, task id: 3c11014474fec310"} Then immediately after the above, there was a success for the same task: 2018-03-05 14:02:19.671 PST [driver]: Received collect request (run ID: 5140140221530112, worker: Spacey_UBUNTU, task ID: 3c11014474fec310) 2018-03-05 14:02:20.140 PST Fetched isolate: "{\"algo\":\"sha-1\",\"files\":{\"tricium/data/results.json\":{\"h\":\"bf21a9e8fbc5a3846fb05b4fa0859e0917b2202f\",\"m\":416,\"s\":2}},\"version\":\"1.6\"}", iso: &{sha-1 [] map[tricium/data/results.json:{bf21a9e8fbc5a3846fb05b4fa0859e0917b2202f <nil> 0xc0087c1378 0xc0087c1398 }] [] <nil> 1.6} 2018-03-05 14:02:20.495 PST [driver] Successfully completed collect
,
Mar 7 2018
More notes about this: The Collect function in common/swarming.go currently says: The task in question should be completed before this function is called and the task should have isolated output. The isolated output and exit code of the task are returned. If there's a better/simpler way to ask Swarming "is this task completed yet?" then we could use that. Otherwise, I propose that we should change Collect in common/swarming.go to return no output isolate hash and no error in order to indicate that the task isn't completed yet.
,
Mar 7 2018
OK, I think I'm starting to understand this slightly better: Tricium is only supposed to try to Collect tasks when it has been notified by Swarming via PubSub that the task is done, so we really are expecting Collect to be called when the task is already completed. Right? Another example of a sequence of logs, including the pubsub notification: 2018-03-07 14:51:01.614 /_ah/push-handlers/notify [driver] Unwrapped pubsub message, task ID: "3c1b7a839f287e10", TriggerRequest: run_id:5707166098915328 isolated_input_hash:"556329f3fbb2bc1fe2bca898efe3a75a7c34dc56" worker:"GitFileIsolator_UBUNTU" [driver] Enqueued collect request, runID: 5707166098915328, worker: GitFileIsolator_UBUNTU 2018-03-07 14:51:01.848 /driver/internal/collect [driver] Failed to call Driver.Collect :: {"error":"rpc error: code = Internal desc = failed to trigger worker: failed to collect swarming task result: missing isolated output, task id: 3c1b7a839f287e10"} [Five more failed requests to /driver/internal/collect with the same result, followed by one successful request a few seconds later, at 2018-03-07 14:51:10.119] The task page (https://chromium-swarm.appspot.com/task?id=3c1b7a839f287e10&refresh=10&show_raw=1) indicates that the PubSub push handler was called when the task just started and hadn't finished yet. Is this a bug?
,
Mar 15 2018
As discussed on Tuesday; this will probably involve waiting for a different push notification from swarming; swarming may do pubsub pushes on task start and task complete, we only care about task complete.
,
Mar 23 2018
Another case study of a particular task: https://chromium-swarm.appspot.com/task?id=3c6dc7ed88a55d10 Started: 3:24:17 PM (PDT) Completed: 3:24:22 PM (PDT) In swarming logs: Path: /internal/taskqueue/pubsub/3c6dc7ed88a55d10 2018-03-23 15:24:18.212 PDT Sending PubSub notify to "projects/tricium-dev/topics/worker-completion" (with userdata "CICAgKCjgIEKEigyNjE2Mjc1NmNhM2ExMGUxMGEzZjg1Y2YzNjQ1ZTk0ZTQ0ZTk5MzU5Gg1TcGFjZXlfVUJVTlRV") about completion of "3c6dc7ed88a55d10" In tricium logs: Path: /_ah/push-handlers/notify 2018-03-23 15:24:18.265 PDT [driver] Received pubsub message, messageId: "60552732162855", publishTime: "2018-03-23T22:24:18.246Z" 2018-03-23 15:24:18.265 PDT [driver] Unwrapped pubsub message, task ID: "3c6dc7ed88a55d10", TriggerRequest: run_id:5633907043074048 isolated_input_hash:"26162756ca3a10e10a3f85cf3645e94e44e99359" worker:"Spacey_UBUNTU" This is followed by about 5 failures to collect the results in Tricium, since the task is still running. Then, a couple seconds later: In swarming logs: Path: /swarming/api/v1/bot/task_update/3c6dc7ed88a55d11 2018-03-23 15:24:22.152 PDT Sending PubSub notify to "projects/tricium-dev/topics/worker-completion" (with userdata "CICAgKCjgIEKEigyNjE2Mjc1NmNhM2ExMGUxMGEzZjg1Y2YzNjQ1ZTk0ZTQ0ZTk5MzU5Gg1TcGFjZXlfVUJVTlRV") about completion of "3c6dc7ed88a55d10" (/base/data/home/apps/s~chromium-swarm/3405-9aff2e7.408512476543677831/server/task_scheduler.py:379) In tricium logs: Path: /_ah/push-handlers/notify 2018-03-23 15:24:22.211 PDT [driver] Received pubsub message, messageId: "60548791753918", publishTime: "2018-03-23T22:24:22.172Z" 2018-03-23 15:24:22.211 PDT [driver] Unwrapped pubsub message, task ID: "3c6dc7ed88a55d10", TriggerRequest: run_id:5633907043074048 isolated_input_hash:"26162756ca3a10e10a3f85cf3645e94e44e99359" worker:"Spacey_UBUNTU" The swarming PubSub notify at 15:24:13 'Sending PubSub notify ... about completion of "3c6dc7ed88a55d10"' seems wrong, because that task is not completed at that time. At least, the log message is not accurate.
,
Mar 23 2018
When tricium polls a task and it is still running, it should just let it go. The function in swarming.go should be changed. https://cs.chromium.org/chromium/infra/go/src/infra/tricium/appengine/common/swarming.go?sq=package:chromium&dr=CSs&l=138 It shouldn't return an error when the task is incomplete.
,
May 15 2018
,
Jun 15 2018
,
Jun 15 2018
,
Jul 3
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/f186cbc9cf8a25072c82136ac26f4a6f1fad5334 commit f186cbc9cf8a25072c82136ac26f4a6f1fad5334 Author: Quinten Yearsley <qyearsley@chromium.org> Date: Tue Jul 03 18:23:56 2018 [tricium] Do not consider it an error when there is no isolated output In general, in the logs we are seeing two pubsub notifications per worker for Tricium functions, the first one is received before the task is finished, and the second one afterwards. This changes the behavior of Tricium to do nothing when trying to collect and finding that there is no isolated output hash yet, so that it will just do nothing and return 200 response. The output should still be collected after the second pubsub notification comes along. Note: There's a potential concern though that if swarming's behavior changes so there's not always a pubsub notification after the task is done, some task output will stay uncollected and analyzer runs may stay unfinished. Bug: 803996 Change-Id: I253c777c6f8328bf307deee90cf7f2b576f1a327 Reviewed-on: https://chromium-review.googlesource.com/1123062 Commit-Queue: Quinten Yearsley <qyearsley@chromium.org> Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/f186cbc9cf8a25072c82136ac26f4a6f1fad5334/go/src/infra/tricium/appengine/driver/rpc_collect_test.go [modify] https://crrev.com/f186cbc9cf8a25072c82136ac26f4a6f1fad5334/go/src/infra/tricium/appengine/common/swarming.go [modify] https://crrev.com/f186cbc9cf8a25072c82136ac26f4a6f1fad5334/go/src/infra/tricium/appengine/driver/rpc_collect.go
,
Jul 3
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/070dd6cce14cc6c739d8340bef5dfe47713392b6 commit 070dd6cce14cc6c739d8340bef5dfe47713392b6 Author: Quinten Yearsley <qyearsley@chromium.org> Date: Tue Jul 03 22:53:07 2018 Revert "[tricium] Do not consider it an error when there is no isolated output" This reverts commit f186cbc9cf8a25072c82136ac26f4a6f1fad5334. Reason for revert: This doesn't work as intended; results are never collected since collect requests aren't retried like they were before. I made a proposed fix at https://chromium-review.googlesource.com/c/infra/infra/+/1125303, but I'd like to revert this for now to get the current committed version back to a working state. Original change's description: > [tricium] Do not consider it an error when there is no isolated output > > In general, in the logs we are seeing two pubsub notifications > per worker for Tricium functions, the first one is received before > the task is finished, and the second one afterwards. > > This changes the behavior of Tricium to do nothing when trying to > collect and finding that there is no isolated output hash yet, so > that it will just do nothing and return 200 response. > The output should still be collected after the second pubsub > notification comes along. > > Note: There's a potential concern though that > if swarming's behavior changes so there's not always a pubsub > notification after the task is done, some task output will stay > uncollected and analyzer runs may stay unfinished. > > Bug: 803996 > Change-Id: I253c777c6f8328bf307deee90cf7f2b576f1a327 > Reviewed-on: https://chromium-review.googlesource.com/1123062 > Commit-Queue: Quinten Yearsley <qyearsley@chromium.org> > Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> TBR=maruel@chromium.org,qyearsley@chromium.org Change-Id: I801b58889583501ca077f69c59e3e804a571ac8c No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 803996 Reviewed-on: https://chromium-review.googlesource.com/1124801 Reviewed-by: Quinten Yearsley <qyearsley@chromium.org> Commit-Queue: Quinten Yearsley <qyearsley@chromium.org> [modify] https://crrev.com/070dd6cce14cc6c739d8340bef5dfe47713392b6/go/src/infra/tricium/appengine/driver/rpc_collect_test.go [modify] https://crrev.com/070dd6cce14cc6c739d8340bef5dfe47713392b6/go/src/infra/tricium/appengine/common/swarming.go [modify] https://crrev.com/070dd6cce14cc6c739d8340bef5dfe47713392b6/go/src/infra/tricium/appengine/driver/rpc_collect.go
,
Aug 3
See CL: https://chromium-review.googlesource.com/c/infra/infra/+/1125303 I believe this is fixed now. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by emso@chromium.org
, Jan 22 2018