New issue
Advanced search Search tips

Issue 673576 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner: ----
Closed: Dec 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

27,000 seconds timeout is difficult to explain

Project Member Reported by semenzato@chromium.org, Dec 13 2016

Issue description

Terra-release build 662 fails with a 27748 seconds timeout, which is 7-something hours.

The "steps" stdio (top step) suggests that the build started at 0:00:00.

At about 2am there is some contact with cidb:

02:08:26: INFO: Running cidb query on pid 8267, repr(query) starts with 'SHOW DATABASES'
02:08:27: INFO: Created cidb engine bot@173.194.81.53 for pid 8267

Further time stamps come much later at about 5am:

Started chromeos-base/chromeos-dev-root-0.0.1-r2 (logged in /tmp/chromeos-dev-root-0.0.1-r2-VWS8DU)
Completed chromeos-base/chromeos-dev-root-0.0.1-r2 (in 0m1.7s)
Pending 0/3, [Time 05:03:59 | Elapsed 0m7.1s | Load 4.22 7.64 11.68]

then gsutil fetches between 5:30 and 7:

05:31:47: INFO: Waiting for signer results.
05:31:47: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' cat gs://chromeos-releases/canary-channel/terra/9077.0.0/ChromeOS-recovery-R57-9077.0.0-terra.instructions.json
05:31:52: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' cat gs://chromeos-releases/dev-channel/terra/9077.0.0/ChromeOS-recovery-R57-9077.0.0-terra.instructions.json
05:32:22: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' cat gs://chromeos-releases/canary-channel/terra/9077.0.0/ChromeOS-recovery-R57-9077.0.0-terra.instructions.json

...

06:59:38: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' cat gs://chromeos-releases/dev-channel/terra/9077.0.0/ChromeOS-recovery-R57-9077.0.0-terra.instructions.json
07:00:08: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' cat gs://chromeos-releases/dev-channel/terra/9077.0.0/ChromeOS-recovery-R57-9077.0.0-terra.instructions.json

@@@STEP_TEXT@ChromeOS-recovery-R57-9077.0.0-terra.instructions.json@@@

Payload generation ends at 7:20

07:21:50: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil cp /tmp/cbuildbot-tmpIJrsWh/tmp16hHem/paygen_payload.pfi29R/delta.json gs://chromeos-releases/canary-channel/terra/9077.0.0/payloads/chromeos_9074.0.0-9077.0.0_terra_canary-channel_delta_mp-v2.bin-f4e70ecd8a69d783f2da5ef2d117fd33.signed.json
07:21:51: INFO: * Finished payload generation in 0:21:30.752503

AU tests make it:

The suite job has another 1:29:38.006588 till timeout.
The suite job has another 0:59:30.131206 till timeout.
Suite job                                        [ PASSED ]
autoupdate_EndToEndTest.paygen_au_canary_full    [ PASSED ]
autoupdate_EndToEndTest.paygen_au_canary_delta   [ PASSED ]
autoupdate_EndToEndTest.paygen_au_canary_full    [ PASSED ]
autoupdate_EndToEndTest.paygen_au_canary_full    [ PASSED ]
autoupdate_EndToEndTest.paygen_au_canary_delta   [ PASSED ]

Suite timings:
Downloads started at 2016-12-12 07:22:13
Payload downloads ended at 2016-12-12 07:22:15
Suite started at 2016-12-12 07:22:26
Artifact downloads ended (at latest) at 2016-12-12 07:22:39
Testing started at 2016-12-12 08:59:03
Testing ended at 2016-12-12 09:24:38

... and then this:

@@@STEP_LINK@Link to suite@http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=90270448@@@
--create_and_return was specified, terminating now.
Will return from run_suite with status: OK
07:32:23: INFO: RunCommand: /b/cbuild/internal_master/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpIJrsWh/tmptOygIY/temp_summary.json --raw-cmd --task-name terra-release/R57-9077.0.0-paygen_au_dev --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 14400 --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:Build' '--tags=suite:paygen_au_dev' '--tags=build:terra-release/R57-9077.0.0' '--tags=task_name:terra-release/R57-9077.0.0-paygen_au_dev' '--tags=board:terra' -- /usr/local/autotest/site_utils/run_suite.py --build terra-release/R57-9077.0.0 --board terra --suite_name paygen_au_dev --pool bvt --file_bugs True --priority Build --timeout_mins 180 --retry True --suite_min_duts 2 -m 90270448

@@@STEP_FAILURE@@@
09:50:55: ERROR: Timeout occurred- waited 27748 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time.

@@@STEP_FAILURE@@@
09:50:56: ERROR: Timeout occurred- waited 27279 seconds, failing. Timeout reason: This build has reached the timeout deadline set by the master. Either this stage or a previous one took too long (see stage timing historical summary in ReportStage) or the build failed to start on time.
09:50:56: INFO: Running cidb query on pid 30249, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x7f8dfba528d0>
09:50:57: INFO: Running cidb query on pid 30249, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x7f8df6a5b650>


 
FYI: Interpreting the timeout log message: The master builder (canary master in this case) sets a timeout of 7:50 for the slave builders. This slave hit that timeout and self destructed.
This timeout ensures that the next batch of canaries have a chance at finishing on time (otherwise once a slave is delayed, it used to keep delaying that build config till manual intervention)

Both the master and this slave started at 2:00 AM, so it wasn't delayed.
The build health dashboard for this build: http://shortn/_PZFetaY4kg
Comparing with a passing build: http://shortn/_rqYIIerTJn

BuildPackages is the longest "build" step. It takes 2.5 hours in both cases.
OTOH, HWTest and Paygen go up from ~1 hour and ~1.5 hours (overlapping) to ~1.5 hours and ~3 hours (timeout).

So, I'd blame a slowness in the lab for this. Your best bet is to see why the HWTest / Paygen suite were making slow progress. 
Cc: pprabhu@chromium.org
Thank you Prathmesh.

I was looking at the "steps" stdio

https://uberchromegw.corp.google.com/i/chromeos/builders/terra-release/builds/662/steps/steps/logs/stdio

and the time stamps there start at 00:00:00, and that's what confused me.  Maybe that's the bug?


internal.DEPS (Elapsed: 0:00:00)
----------------------------------------
[0:00:00] Started.
_____ internal.DEPS at refs/remotes/origin/master
[0:00:00] Fetching origin

________ running 'git -c core.deltaBaseCacheLimit=2g fetch origin --prune --verbose' in '/b/internal.DEPS'
[0:00:00] From https://chrome-internal.googlesource.com/chrome/tools/build/internal.DEPS
[0:00:00]  = [up to date]      infra/config -> origin/infra/config
[0:00:00]  = [up to date]      master       -> origin/master

________ running 'git reset --hard HEAD' in '/b/internal.DEPS'
[0:00:00] HEAD is now at 76e3f68 Restore symsrc checkout, accidentally removed in 7e9e5eeb.
Checked out revision 76e3f689b3eed3084f5344b69f138ac24eae3a80
[0:00:00] Finished.
----------------------------------------
gclient(629) ParseDepsFile:Found recursedeps "{'build_internal/scripts/slave': {'deps_file': '.DEPS.git'}}".

build (Elapsed: 0:00:01)
----------------------------------------
[0:00:00] Started.
_____ build at refs/remotes/origin/master
[0:00:01] Fetching origin

________ running 'git -c core.deltaBaseCacheLimit=2g fetch origin --prune --verbose' in '/b/build'
[0:00:01] From https://chromium.googlesource.com/a/chromium/tools/build
[0:00:01]  = [up to date]            infra/config          -> origin/infra/config
[0:00:01]  = [up to date]            master                -> origin/master
[0:00:01]  = [up to date]            recipe-package-canary -> origin/recipe-package-canary

________ running 'git reset --hard HEAD' in '/b/build'
[0:00:01] HEAD is now at 13597f4d50 Added flag to package_index.py to generate index pack in kythe format
Checked out revision 16defb07ee57724842c28417febe64ed24f7d936
[0:00:01] Finished.
----------------------------------------


Status: WontFix (was: Untriaged)
That's sadly the output from 'gclient sync'
Each invocation with the --verbose arg prints the "elapsed time" for that invocation. If you scroll through the logs, you'll find the time reset to 0:00:00 several times because we sync different chrome-infra and chromium repos via gclient sync.

I don't think there is an easy way to fix this. 

Sign in to add a comment