Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Issue 265463 reimages stomping on each other
Starred by 1 user Project Member Reported by scottz@chromium.org, Jul 29 2013 Back to list
Status: WontFix
Owner: milleral@chromium.org
Closed: Aug 2013
Cc: akes...@chromium.org, beeps@chromium.org, milleral@chromium.org, sbasi@chromium.org
OS: Chrome
Pri: 1
Type: Bug


Sign in to add a comment
Timeout causing suite prep failures?
http://chromegw/i/chromeos/builders/lumpy%20canary/builds/2900

07/28 00:18:07.386 DEBUG|  reimager:0211| scheduling reimaging across at most 6 machines
07/28 00:18:51.133 INFO |  frontend:0098| FAILED RPC CALL: add_label {'name': 'cros-version:lumpy-release/R30-4467.0.0'}
07/28 00:18:51.133 DEBUG|  reimager:0577| Version label cros-version:lumpy-release/R30-4467.0.0 already exists
07/28 00:19:24.738 WARNI|  reimager:0503| Could not find 6 hosts to use; unsatisfied dependencies: [HostSpec ['board:lumpy', 'pool:bvt', 'video_decode_v4l2']].
07/28 00:19:24.747 INFO |server_job:0124| START	----	/build/lumpy//usr/local/autotest/client/site_tests/hardware_VideoDecodeCapable/control.v4l2	timestamp=1374995887	localtime=Jul 28 00:18:07	
07/28 00:19:24.754 INFO |server_job:0124| 	TEST_NA	----	/build/lumpy//usr/local/autotest/client/site_tests/hardware_VideoDecodeCapable/control.v4l2	timestamp=1374995964	localtime=Jul 28 00:19:24	Unsatisfiable DEPENDENCIES
07/28 00:19:24.761 INFO |server_job:0124| END TEST_NA	----	/build/lumpy//usr/local/autotest/client/site_tests/hardware_VideoDecodeCapable/control.v4l2	timestamp=1374995964	localtime=Jul 28 00:19:24	
07/28 00:19:29.300 INFO |  reimager:0238| Created re-imaging job: 3979886
07/28 00:49:29.300 ERROR|server_job:0602| Exception escaped control file, job aborting:
Traceback (most recent call last):
  File "/usr/local/autotest/server/server_job.py", line 594, in run
    self._execute_code(server_control_file, namespace)
  File "/usr/local/autotest/server/server_job.py", line 1087, in _execute_code
    execfile(code_file, namespace, namespace)
  File "/usr/local/autotest/results/3979861-chromeos-test/hostless/control.srv", line 68, in <module>
    max_runtime_mins=20, bug_template=_BUG_TEMPLATE, devserver_url=devserver_url)
  File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 512, in reimage_and_run
    suite_job_id=my_job_id)
  File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 593, in _perform_reimage_and_run
    spec.devserver.finish_download(spec.build)
  File "/usr/local/autotest/client/common_lib/cros/retry.py", line 176, in func_retry
    raise error.TimeoutException('Call is timed out.')
TimeoutException: Call is timed out.
 
Status: Started
All the re-imaging jobs completed successfully.  It looks like there was a problem waiting for the devserver to finish downloading all the artifacts in this call:

spec.devserver.finish_download(spec.build)

Will see if I can learn more from the devserver logs...

Here's a snippet from the devserver log related to this build:


[28/Jul/2013:00:27:51] DOWNLOADER Downloading artifacts gs://chromeos-image-archive/lumpy-release/R30-4467.0.0/autotest.tar->/home/chromeos-test/images/lumpy-release/R30-4467.0.0.
[28/Jul/2013:00:27:54] DOWNLOADER Downloading artifacts gs://chromeos-image-archive/lumpy-release/R30-4467.0.0/autotest.tar->/home/chromeos-test/images/lumpy-release/R30-4467.0.0.
[28/Jul/2013:00:48:33] DOWNLOADER Downloading artifacts gs://chromeos-image-archive/lumpy-release/R30-4467.0.0/autotest.tar->/home/chromeos-test/images/lumpy-release/R30-4467.0.0.
[28/Jul/2013:00:48:33] DOWNLOADER Downloading artifacts gs://chromeos-image-archive/lumpy-release/R30-4467.0.0/autotest.tar->/home/chromeos-test/images/lumpy-release/R30-4467.0.0.
[28/Jul/2013:00:48:42] DOWNLOADER Downloading artifacts gs://chromeos-image-archive/lumpy-release/R30-4467.0.0/autotest.tar->/home/chromeos-test/images/lumpy-release/R30-4467.0.0.
[28/Jul/2013:00:53:04] UPDATE Responding to client to use url http://172.22.50.2:8082/static/lumpy-release/R30-4467.0.0/update.gz to get image


In particular, note the timetamps: at "28/Jul/2013:00:48:42", the downloader is downloading artifacts, but it's not until about 5 minutes later ("28/Jul/2013:00:53:04") that the devserver responds to the client.

Meanwhile, in the step failure debug log, note the timestamps here:

07/28 00:19:24.761 INFO |server_job:0124| END TEST_NA	----	/build/lumpy//usr/local/autotest/client/site_tests/hardware_VideoDecodeCapable/control.v4l2	timestamp=1374995964	localtime=Jul 28 00:19:24	
07/28 00:19:29.300 INFO |  reimager:0238| Created re-imaging job: 3979886
07/28 00:49:29.300 ERROR|server_job:0602| Exception escaped control file, job aborting:

The job is aborted at "07/28 00:49:29.300", about 4 minutes _before_ the devserver responded to the client.

I'm not sure how long it normally takes the devserver to finish downloading everything, but if it sometimes takes this long, maybe we need a longer timeout when waiting for the devserver to finish downloading everything?
Comment 3 by beeps@chromium.org, Jul 29 2013
No i don't think this is devserver, i think those machines were reimaged for R29 somehow. It took 30 minutes to timeout:
07/28 00:19:29.300 INFO |  reimager:0238| Created re-imaging job: 3979886
07/28 00:49:29.300 ERROR|server_job:0602| Exception escaped control file, job aborting:

And you'll see that all but 1 of those reimage jobs finished on time at around:
http://cautotest/afe/#tab_id=view_job&object_id=3979886
localtime=Jul 28 00:25:49
with one finishing at 00:59


If you take those duts that were a part of those reimage jobs, and trace them back in time to see what they were doing around 2013-07-28 00:19:28 you'll see another try job run just after this one. So you'll see R30-4467.0.0 try-job, followed by a successful R29-4319.44.0, after which you'll see successful tests for R29-4319.44.0. R29-4319.44.0 was the canary run at 11 from the release master branch.

from cautotest, tracing one of the duts back i see:
Type	Job ID	Job Owner	Job Name	Time started	Status
	Job	3980919	chromeos-test	lumpy-release/R29-4319.44.0/bvt/security_ProfilePermissions	2013-07-28 01:15:12	Completed
	Job	3980652	chromeos-test	lumpy-release/R29-4319.44.0-try	2013-07-28 00:58:51	Completed
	Job	3979886	chromeos-test	lumpy-release/R30-4467.0.0-try	2013-07-28 00:19:59	Completed
	Job	3979880	chromeos-test	lumpy-release/R30-4467.0.0-try

I believe the same happened for the link canary run failure at the same time. I'm pretty sure the tests should have run before the try job since they're higher priority.
Comment 4 Deleted
Comment 5 by beeps@chromium.org, Jul 29 2013
Cc: sbasi@chromium.org
Summary: reimages stomping on each other (was: Failure due to RPC timeout?)
Interesting - thanks for digging into those logs!

I looked some more, and found out that the one DUT finishing late (at around 00:59) was busy running another job at the same time the other DUTs were reimaging:

http://cautotest/afe/#tab_id=view_job&object_id=3979829

This is "lumpy-release-pgo-generate/R30-4467.0.0/PGO_record/experimental_chromeperf_PGOPageCycler".  That job was originally created at "2013-07-27 23:56:12", but from the Status Log it didn't start running until "Jul 28 00:21:14", and it finished on "Jul 28 00:50:04".

Only after that, the DUT started running the reimage job at "Jul 28 00:51:33" and finished at about 00:59.

So it looks like the reimage job was waiting for the earlier PGOPageCycler job to complete (which happened to be for the same build number).

I suspect that either the reimage job should not have gotten scheduled on this machine in the first place (since that DUT had another job queued up that still needed to run), or else it should have preempted the PGOPageCycler job but did not.
Comment 7 by beeps@chromium.org, Jul 29 2013
No a reimage won't preempt a test, though i think we should have moved on in the face of a failed reimage. Theres code to just move on with tests and not wait on reimage if > 50% of reimages pass, which looks like the case.
https://gerrit.chromium.org/gerrit/#/c/61698/

So the issue i think (and i haven't confirmed anything) is that the 50% of reimages that pass got reimage to R29, perhaps while we were waiting on this late one, so by the time we checked for the 50% we had 0.
Good theory -- do you know if that logic applies after the reimages have all started and we're just waiting for them to finish, or does it also apply when waiting for all reimaging jobs to start in the first place?  The reason I ask is below.


From looking at the reimager.py code, it looks like it waits for reimaging to both (1) start, and (2) finish.

If we've successfully waited for reimaging to start, we should see this message in the logs: "Re-imaging job running."  However, that message is missing, which suggests that we actually timeout while waiting for all the reimage jobs to start in the first place:


07/28 00:19:29.300 INFO |  reimager:0238| Created re-imaging job: 3979886
07/28 00:49:29.300 ERROR|server_job:0602| Exception escaped control file, job aborting:
Traceback (most recent call last):
  File "/usr/local/autotest/server/server_job.py", line 594, in run
    self._execute_code(server_control_file, namespace)
  File "/usr/local/autotest/server/server_job.py", line 1087, in _execute_code
    execfile(code_file, namespace, namespace)
  File "/usr/local/autotest/results/3979861-chromeos-test/hostless/control.srv", line 68, in <module>
    max_runtime_mins=20, bug_template=_BUG_TEMPLATE, devserver_url=devserver_url)
  File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 512, in reimage_and_run
    suite_job_id=my_job_id)
  File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 593, in _perform_reimage_and_run
    spec.devserver.finish_download(spec.build)
  File "/usr/local/autotest/client/common_lib/cros/retry.py", line 176, in func_retry
    raise error.TimeoutException('Call is timed out.')
TimeoutException: Call is timed out.

The timestamps from comments 2 and 6 also suggest the same thing, because the parent job was aborted at "07/28 00:49:29.300", but the delayed DUT didn't even start reimaging until a few minutes after that: "Jul 28 00:51:33".

So I believe reimager.py is waiting for all jobs to start, at the point it aborts.
Cc: akes...@chromium.org
+Aviv in case he has cycles to look at this while lab sheriffing.
Is this bug dead now?
Owner: milleral@chromium.org
Status: WontFix
tryjobs can't stomp eachother if they don't exist anymore.
Sign in to add a comment