Retry failure - tried to retry after a successful retry
Reported by
jrbarnette@chromium.org,
Apr 18 2017
|
|||
Issue description
This slave paladin run failed:
https://luci-milo.appspot.com/buildbot/chromeos/cyan-paladin/2277
It produced this suite job:
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=112754039
The suite job logs show this:
====
04/14 18:26:15.619 DEBUG| suite:1095| Scheduling cheets_SELinuxTest, to retry afe job 112754081
04/14 18:26:15.945 DEBUG| suite:1136| Job 112757619 created to retry job 112754081. Have retried for 1 time(s)
[ ... ]
04/14 18:26:15.948 DEBUG| suite:1095| Scheduling cheets_SELinuxTest, to retry afe job 112754081
04/14 18:26:16.332 ERROR| suite:1281| Exception waiting for results
Traceback (most recent call last):
File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 1278, in wait
bug_template=bug_template)
File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 1306, in _record_result
retry_for=result.id, ignore_errors=True)
File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 1131, in _schedule_test
old_job_id=retry_for, new_job_id=job.id)
File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 155, in add_retry
old_job_id)
ValueError: We have already retried or attempted to retry job 112754081
====
So, job 112754081 failed, we scheduled job 112757619 to retry it, and that
job passed. The we tried to retry 112754081 again, and an internal check
stopped us.
So far, I think it's happened just this once, but I'm still looking.
,
May 16 2017
Do we need more investigation on this?
,
May 31 2017
,
Jun 1 2017
This is more or less expected behavior. The suite job gets the test job status with query like:
select * from tko_test_view_2 where job_tag like "112754081-%%"\G;
The code path is like:
results_generator = job_status.wait_for_child_results(
self._afe, self._tko, self._suite_job_id)
which calls
job_status._yield_job_results
https://cs.corp.google.com/android/external/autotest/server/cros/dynamic_suite/job_status.py?q=wait_for_child_results&sq=package:%5E(android)$+file:(/%7C%5E)&dr=CSs&l=175
which calls
statuses = tko.get_job_test_statuses_from_db(job.id)
which runs the query here:
https://cs.corp.google.com/android/external/autotest/server/frontend.py?rcl=8f0c7263db293872c2d6c31e22f12864450864d4&l=145
basically, the tko query returns 3 records: server job, client job and test
The test passed, but server job and client job failed. Each failed job_status will lead to a retry, that's why there is a second attempt to retry the job 112754081.
The suite job uses a _retry_map to keeps record, so 2 failed job_status for the same job won't lead to 2 new test jobs. Instead, only 1 test job is created.
|
|||
►
Sign in to add a comment |
|||
Comment 1 by nxia@chromium.org
, Apr 20 2017