New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 695546 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

"swarming timeout" in PaygenTestCanary shows tests that started after the timeout

Project Member Reported by semenzato@chromium.org, Feb 23 2017

Issue description

https://uberchromegw.corp.google.com/i/chromeos/builders/pyro-release/builds/415

...
00:30:35: INFO: RunCommand: /b/cbuild/internal_master/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpQHm82u/tmpoR_UYW/temp_summary.json --raw-cmd --task-name pyro-release/R58-9309.0.0-paygen_au_canary --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 14400 --io-timeout 14400 --hard-timeout 14400 --expiration 1200 '--tags=priority:Build' '--tags=suite:paygen_au_canary' '--tags=build:pyro-release/R58-9309.0.0' '--tags=task_name:pyro-release/R58-9309.0.0-paygen_au_canary' '--tags=board:pyro' -- /usr/local/autotest/site_utils/run_suite.py --build pyro-release/R58-9309.0.0 --board pyro --suite_name paygen_au_canary --pool bvt --file_bugs True --priority Build --timeout_mins 180 --retry True --suite_min_duts 2 -m 102894526
...
02:54:07: WARNING: Killing tasks: [<_BackgroundTask(_BackgroundTask-7:6:7:3, started)>]

If I read this correctly, this timed out after 2h 24m (exactly 8612s)

(WHERE IS THIS TIMEOUT COMING FROM?  MORE LOGGING IS NEEDED HERE)

when I go the AFE for job 102894526 I see this list of jobs:

	Priority	Client/Server	Status	Passed Tests
102894894	pyro-release/R58-9309.0.0/paygen_au_canary/autoupdate_EndToEndTest_paygen_au_canary_full_9309.0.0	Build	Server	1 Aborted	0 / 2
102894845	pyro-release/R58-9309.0.0/paygen_au_canary/autoupdate_EndToEndTest_paygen_au_canary_delta_9309.0.0	Build	Server	1 Aborted	2 / 2
102894762	pyro-release/R58-9309.0.0/paygen_au_canary/autoupdate_EndToEndTest_paygen_au_canary_full_9306.0.0	Build	Server	1 Completed	2 / 2
102894693	pyro-release/R58-9309.0.0/paygen_au_canary/autoupdate_EndToEndTest_paygen_au_canary_delta_9306.0.0	Build	Server	1 Completed	2 / 2

The autoserv.DEBUG for job 102894845 starts and ends like this:

02/23 03:05:53.390 INFO |          autoserv:0706| Results placed in /usr/local/autotest/results/102894845-chromeos-test/chromeos2-row4-rack4-host10
...
02/23 03:27:27.568 DEBUG|   logging_manager:0627| Logging subprocess finished

It seems to me that between 0254 and 0305 there should have been enough time to decide not to even start this job.

In any case it would be good to know why the job couldn't start earlier.



 
Something similar happened here:

https://uberchromegw.corp.google.com/i/chromeos/builders/butterfly-release/builds/3448/steps/PaygenTestDev/logs/stdio

The hostless autoserv.DEBUG has this:

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/102886795-chromeos-test/hostless/debug/

02/22 23:22:26.777 DEBUG|             suite:1266| Parsing control files ...
02/22 23:22:30.307 DEBUG|             suite:1331| Parsed 2164 control files.
02/22 23:22:30.316 DEBUG|             suite:0913| Discovered 5 stable tests.
02/22 23:22:30.316 DEBUG|             suite:0915| Discovered 0 unstable tests.
02/22 23:22:30.317 INFO |        server_job:0184| INFO	----	Start paygen_au_dev	timestamp=1487834550	localtime=Feb 22 23:22:30	
02/22 23:22:30.317 DEBUG|             suite:0854| Scheduling autoupdate_EndToEndTest_paygen_au_dev_full_2913.200.0
02/22 23:22:36.174 DEBUG|             suite:1206| Adding job keyval for autoupdate_EndToEndTest_paygen_au_dev_full_2913.200.0=102887931-chromeos-test
02/22 23:22:36.174 DEBUG|             suite:0854| Scheduling autoupdate_EndToEndTest_paygen_au_dev_delta_9309.0.0
02/22 23:22:42.064 DEBUG|             suite:1206| Adding job keyval for autoupdate_EndToEndTest_paygen_au_dev_delta_9309.0.0=102887956-chromeos-test
02/22 23:22:42.064 DEBUG|             suite:0854| Scheduling autoupdate_EndToEndTest_paygen_au_dev_full_9280.0.0
02/22 23:22:47.346 DEBUG|             suite:1206| Adding job keyval for autoupdate_EndToEndTest_paygen_au_dev_full_9280.0.0=102887973-chromeos-test
02/22 23:22:47.347 DEBUG|             suite:0854| Scheduling autoupdate_EndToEndTest_paygen_au_dev_full_9309.0.0
02/22 23:22:51.485 DEBUG|             suite:1206| Adding job keyval for autoupdate_EndToEndTest_paygen_au_dev_full_9309.0.0=102888000-chromeos-test
02/22 23:22:51.485 DEBUG|             suite:0854| Scheduling autoupdate_EndToEndTest_paygen_au_dev_delta_9280.0.0
02/22 23:22:54.645 DEBUG|             suite:1206| Adding job keyval for autoupdate_EndToEndTest_paygen_au_dev_delta_9280.0.0=102888019-chromeos-test
02/22 23:22:54.645 DEBUG|             suite:0927| Scheduled 5 tests, writing the total to keyval.
02/22 23:22:54.645 DEBUG|     dynamic_suite:0593| Waiting on suite.
02/22 23:33:23.489 DEBUG|        retry_util:0136| <class 'urllib2.URLError'>(<urlopen error [Errno 110] Connection timed out>)
02/23 00:03:08.461 DEBUG|        retry_util:0129| ending retries with error: <class 'chromite.lib.timeout_util.TimeoutError'>(Timeout occurred- waited 1800 seconds.)
02/23 00:03:08.462 ERROR|             suite:1047| Exception waiting for results

I checked the first couple of tests, using the job id included above.  The ran fine, but started a few minutes after the timeout expired.



Owner: ----
Status: Available (was: Untriaged)
Labels: -current-issue Hotlist-Fixit
Fixit work needed: review & add more logging 
Labels: Logging-improv
Cc: hidehiko@chromium.org semenzato@chromium.org
Investigated. IIUC;

There are four players in this scenario.
- builder.
- parent job. (= job_id: 102894526)
- problematic job. (= job_id: 102894845) (let's call this job A), and
- job running just before problematic job. (= job_id: 102894693) (let's call this job B).

builder starts the test. So parent job is scheduled.
According to the test logs:
parent_job started at 2017-02-23 00:32, and timed out after 2h 58'43" (about 03:30).
job B started 2017-02-23 02:31, and finished in 23'40" (about 02:54).
job A started 2017-02-23 03:05, and finished in 25'33".

So, the reason of the delayed start looks like just the DUT is busy.
Also, job A is started *before* parent job's timeout, which means, there seems no chance to make job A timeout before starting, unfortunately.

In builder, multiple stages run in parallel, and nested.

The timeout log (SIGXCPU signal) is for the subtask of the builder,
rather than the ones for scheduled jobs.

So, what missing looks;
- propagating timeout from builder -> parent_job -> DUT.
  This is actually what is missing. If this is triggered,
  job A will not be started, or cancelled.
  This is also probably good to the lab's load.

- Adding "previously" running job and its time.
  According to my personal experience, sometimes jobs are delayed just due to DUT is busy, like this case.
  If we can add the line, investigation could be easier. Cons here is; this approach breaks independence of each test.

  Another approach would be printing some warning if the start time is much delayed from scheduled time,
  with a link to a page showing jobs of the DUT. Though, then AFE frontend page looks too heavy for this purpose, so practically unusable now.
  Maybe, we have to speed up it, or another FE is needed.

Components: Infra>Client>ChromeOS>CI
Components: -Infra>Client>ChromeOS
Components: -Infra>Client>ChromeOS>CI Infra>Client>ChromeOS>Test
Labels: -Logging-improv

Sign in to add a comment