New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 772985 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

swarming.py doesn't generate output for > 8640s, which kills build incorrectly.

Project Member Reported by laszio@chromium.org, Oct 9 2017

Issue description

This seems like the relevant log:

https://uberchromegw.corp.google.com/i/chromeos/builders/amd64-llvm-next-toolchain/builds/407/steps/HWTest%20%5Bbvt-inline%5D/logs/stdio

This command appears to have hung:

12:14:20: INFO: RunCommand: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpsJGxJt/tmpktg_qI/temp_summary.json --raw-cmd --task-name amd64-llvm-next-toolchain/R63-10007.0.0-rc1-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 14400 --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:DEFAULT' '--tags=suite:bvt-inline' '--tags=build:amd64-llvm-next-toolchain/R63-10007.0.0-rc1' '--tags=task_name:amd64-llvm-next-toolchain/R63-10007.0.0-rc1-bvt-inline' '--tags=board:samus' -- /usr/local/autotest/site_utils/run_suite.py --build amd64-llvm-next-toolchain/R63-10007.0.0-rc1 --board samus --suite_name bvt-inline --pool bvt --num 6 --file_bugs False --priority DEFAULT --timeout_mins 180 --retry True --max_retries 5 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 58326711L, 'cidb_build_id': 1920411, 'datastore_parent_key': ('Build', 1920411, 'BuildStage', 58326711L)}" -m 147342251


That triggered this failsafe:

14:38:55: WARNING: No output from <_BackgroundTask(_BackgroundTask-7:7:2, started)> for 8640 seconds


Which failed the build.


I have no idea WHY the swarming request hung.
More interestingly....

I would expect the builder to run "run_suite.py" locally, and for run_suite to make swarming requests. Doing it this way will cause these types of failures any time a suite takes longer than 8640, unless there is spurious and unrelated output on the builder.
The swarming & run_suite.py not hang. They work properly, logs are here:
https://chromeos-proxy.appspot.com/task?id=390bed6b09b6fe10&refresh=10&show_raw=1
https://chromeos-proxy.appspot.com/task?id=390bedaaa5a3b010&refresh=10&show_raw=1

In case there's some credential issue for checking the logs, I paste part of them here:

First swarming call to kick off the tests, then return immediately:

Autotest instance created: cautotest-prod
10-06-2017 [12:14:07] Submitted create_suite_job rpc
10-06-2017 [12:14:19] Created suite job: http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=147342251
@@@STEP_LINK@Link to suite@http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=147342251@@@
--create_and_return was specified, terminating now.
Will return from run_suite with status: OK

...

Second swarming call to wait for all tests to pass:

10-06-2017 [15:01:06] Suite job is finished.
10-06-2017 [15:01:06] Start collecting test results and dump them to json.
Suite job                               [ PASSED ]
platform_DMVerityCorruption             [ PASSED ]
cheets_StartAndroid.stress              [ PASSED ]
platform_DMVerityBitCorruption.first    [ PASSED ]
platform_DMVerityBitCorruption.middle   [ PASSED ]
.......


The problem is the builder 12:14:04 kicks off HWTest, 14:38:22 abort this stage. But the tests finished at 15:01:06. It happens several times in different builders, I'm sure there're some other bugs about this but sorry I can't find them.

We can 1) either find out why https://viceroy.corp.google.com/chromeos/suite_details?build_id=1920411 takes too long (more than 2 hours) and reduce it, 2) Or increase builder timeout. If this HWTest is not supposed to run too long, I suggest the first solution since there might exist some inner DUT issues or test issues.
The reason it aborted this state is because it went 8640 seconds without generating any output.

All of our builders have a mechanism that will kill them if they go for too long without generating any logs.

In practice, that failsafe is inconsistently, applied since it only applies to the current foreground stage. When multiple stages are running in parallel, then "foreground" can be semi-randomly selected.
Owner: xixuan@chromium.org
Cc: laszio@chromium.org xixuan@chromium.org
Owner: nxia@chromium.org
I'm not sure whether which solution is best since I'm don't know whether the HWTest usually succeeds in 8640 seconds.

Assign to the secondary deputy, suggest secondary deputy to first check what's the usual time cost for this HWTest stage. If it only fails this time, it should be some bad DUTs issue. If it goes beyond 8640 seconds very often. That's an issue we should bring to desk to discuss how to address it.

It doesn't have to pass inside that timeout. It only has to generate logs.

run_suite used to regularly report it's current state every so often for this exact reason.

But the swarming client doesn't do that, and may even be buffering output.

Comment 8 Deleted

Comment 9 Deleted

Comment 10 by nxia@chromium.org, Oct 9 2017

Re #3:

run_suite can only return result when the all the job complete? can it report progress back in the middle so the buildbot can get some output?

8640s is long enough and I think we shouldn't extend it longer, as it may slow down the build.
Labels: -Pri-2 Pri-1
Summary: swarming.py doesn't generate output for > 8640s, which kills build incorrectly. (was: bvt-inline succeeded but reported as fail / timed-out)
I have one question:

I agree that 8640s is long enough. But my point is a test shouldn't last such long.
Since I'm not clear about what's the difference between 'let swarming server return results in 8640s then continue to wait, until this test finishes' and 'wait for more time than 8640s for the test to finish'? Both the cases will 'slow down the build'.
I believe there are test suites that can take longer than that.

From the builder:

Will return from run_suite with status: OK
12:14:20: INFO: RunCommand: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpsJGxJt/tmpktg_qI/temp_summary.json --raw-cmd --task-name amd64-llvm-next-toolchain/R63-10007.0.0-rc1-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 14400 --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:DEFAULT' '--tags=suite:bvt-inline' '--tags=build:amd64-llvm-next-toolchain/R63-10007.0.0-rc1' '--tags=task_name:amd64-llvm-next-toolchain/R63-10007.0.0-rc1-bvt-inline' '--tags=board:samus' -- /usr/local/autotest/site_utils/run_suite.py --build amd64-llvm-next-toolchain/R63-10007.0.0-rc1 --board samus --suite_name bvt-inline --pool bvt --num 6 --file_bugs False --priority DEFAULT --timeout_mins 180 --retry True --max_retries 5 --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 58326711L, 'cidb_build_id': 1920411, 'datastore_parent_key': ('Build', 1920411, 'BuildStage', 58326711L)}" -m 147342251

<Refreshing Token Logs are from a different process and don't count>

14:38:22: WARNING: Killing tasks: [<_BackgroundTask(_BackgroundTask-7:7:2, started)>]


That took 2 hours and 24 minutes (ie: 8640s).



From the test suite start:

10/06 12:14:32.251 INFO |          autoserv:0685| Results placed in /usr/local/autotest/results/147342251-chromeos-test/hostless
10/06 12:14:32.251 DEBUG|          autoserv:0693| autoserv is running in drone chromeos-server11.hot.corp.google.com.

From the test suite finish:


10/06 14:58:40.114 DEBUG|   logging_manager:0627| Logging subprocess finished
10/06 14:58:40.114 DEBUG|   logging_manager:0627| Logging subprocess finished

That took 2 hours and 44 minutes.
Re #13, what's the difference between 'let swarming server return results in 8640s then continue to wait, until this test finishes' and 'extend the 8640s, wait for more time for the test to finish'?
The builder doesn't actually care WHAT is logged, just that there are some being generated.

Logging a message every 15 minutes that says "Waited X minutes of Y timeout" would keep the builder from dying.

Components: Blink>Infra
Components: -Blink>Infra Infra>Platform>Swarming
Doesn't seem to be related to Blink>Infra. Feel free to redirect if Infra>Platform>Swarming isn't the right component.
Components: -Infra>Platform>Swarming Infra>Client>ChromeOS
Owner: xixuan@chromium.org
Status: Fixed (was: Untriaged)
It's fixed.

Sign in to add a comment