Race condition between staging control_files.tar and starting the first suite job
Reported by
jrbarnette@chromium.org,
May 23 2018
|
||||||
Issue description
This CQ run failed:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/18686
One of the causes of the failure was the veyron_mighty-paladin:
https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_mighty-paladin/builds/8873
That failed with this complaint:
[Test-Logs]: Suite job: FAIL: No control file for dummy_Pass
This is ... disconcerting. There's no reasonable way to pin it on a
CL in the run, so we have some sort of builder or autotest problem.
P1, and let's call it test infrastructure, until more can be learned.
,
May 24 2018
AFAICT, whatever stages "control_files.tar", it doesn't leave any log entries in the suite debug logs. That's true whether the suite passes or fails...
,
May 24 2018
OK. Skipping over how the file got staged... It _did_ get staged,
because it's now on the devserver. Looking there, you can see this:
chromeos-test@chromeos4-devserver8:~/images/veyron_mighty-paladin/R68-10710.0.0-rc1$ ls -l control_files.tar .control_filesDOTtar
-rw-r--r-- 1 chromeos-test chromeos-test 362195 May 23 15:00 .control_filesDOTtar
-rw-r--r-- 1 chromeos-test chromeos-test 4945920 May 23 14:55 control_files.tar
Here's the time for the failed RPC:
05/23 14:56:03.473 DEBUG| utils:0215| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/controlfiles?suite_name=&build=veyron_mighty-paladin/R68-10710.0.0-rc1"''
So, the call to read the list of control files predated the completion
of the ".control_filesDOTtar" file.
Moreover, there's this:
chromeos-test@chromeos4-devserver8:~/images/veyron_mighty-paladin/R68-10710.0.0-rc1$ find autotest/ -name dummy_Pass -exec ls -ld {} \;
drwxr-xr-x 2 chromeos-test chromeos-test 4096 May 23 15:00 autotest/client/site_tests/dummy_Pass
So, sure looks like a race condition.
,
May 24 2018
On team chat, xixuan@ said this:
we don't have stage call logs because it happens in run_suite.py,
and we don't keep any such logs at golo_proxy server.
I haven't confirmed, but it seems likely that run_suite.py doesn't
wait for staging to complete...
The obvious fix is that somewhere just before we query for the control
files, there should be a call to wait for staging to complete.
This seems not urgent, since the bug has been around forever, and yet
this is the first time we've seen it. OTOH, it caused a CQ failure.
So... P2.
Chase-Pending, since it's caused a CQ failure, and since the fix seems
well-scoped.
,
May 29 2018
,
May 29 2018
,
Jun 4 2018
I went digging for the staging call that's allegedly in "run_suite.py".
I believe that the code that actually makes the call is in
test_suites/dev_harness.py:
ds = dev_server.ImageServer.resolve(new_globals['build'])
if new_globals['SKIP_IMAGE']:
# ... this branch not invoked in the failure case ...
else:
ds.trigger_download(new_globals['build'])
The call to "trigger_download()" starts the download process for
multiple artifacts, including the "control_files" artifact, but
doesn't wait for completion. There's a companion method name
"finish_download()" that's meant to ensure that staging is complete.
So, somehow "finish_download()" doesn't get called.
,
Jun 4 2018
... And for definitive proof:
$ git grep -l '\.finish_download('
client/common_lib/cros/dev_server_unittest.py
The only call to "finish_download()" is made from the unit tests.
Not much good, that.
,
Jun 4 2018
> I believe that the code that actually makes the call is in
> test_suites/dev_harness.py:
Ugh. dev_harness.py appears to be dead. The code is evil.
I'm pretty sure we're back to this code:
def _perform_reimage_and_run(spec, afe, tko, suite_job_id=None):
# ...
if not spec.run_prod_code:
_stage_artifacts_for_build(spec.devserver, spec.test_source_build)
The "run_prod_code" flag is false by default, so this _should_ run every
time. One thing I still don't understand is why it doesn't show up in the
logs.
,
Jun 5 2018
"_perform_reimage_and_run()" is only invoked for regular test suites.
This failure happened with a provision suite. For a provision suite,
we never call "_stage_artifacts_for_build()".
So, we're back to "how did control_files.tar" get staged in the first
place?
Digging through the code, and looking for _all_ of the staging calls
that might happen from a provisioning job, it turns out there are none.
However, it turns out that the "create_suite_job" RPC call makes its own
staging call:
if run_prod_code:
ds = dev_server.resolve(test_source_build, hostname=sample_dut)
keyvals = {}
else:
ds, keyvals = suite_common.stage_build_artifacts(
test_source_build, hostname=sample_dut)
The "stage_build_artifacts()" function does this:
ds.stage_artifacts(image=build, artifacts=['test_suites'])
The devserver code has this to say about the 'test_suites' artifact.
# In general, downloading one artifact usually indicates that the caller will
# want to download other artifacts later. The following map explicitly defines
# this relationship. Specifically:
# If X is requested, all items in Y should also get triggered for download.
CROS_REQUESTED_TO_OPTIONAL_MAP = {
TEST_SUITES: [CONTROL_FILES, AUTOTEST_PACKAGES],
}
Bottom line: The fact that we stage the 'control_files' artifact at all
is _entirely_accidental_ (!)
,
Jun 5 2018
This'll fix it:
https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1086379
Don't know how I'll test it, yet.
,
Jun 6 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/adf058616eb02677d8c7ee18df124268dc7b9244 commit adf058616eb02677d8c7ee18df124268dc7b9244 Author: Richard Barnette <jrbarnette@chromium.org> Date: Wed Jun 06 04:28:38 2018 [autotest] Stage control_files during provisioning. When running the provisioning test suite, the code depends on the 'control_files' artifact being staged prior to creating the provisioning jobs. The provisioning code has no calls that explicitly stage the file. As it happens, the staging call happens implicitly during the 'create_suite_job' RPC. However, that call makes no guarantee that staging 'control_files' is complete when provisioning starts. This adds an explicit call to stage the needed artifact in the provisioning code path, to eliminate the possible race condition. BUG= chromium:846151 TEST=Run provision suite in a local Autotest instance Change-Id: Ie08369fb5f125c72595751a54124039c9965141a Reviewed-on: https://chromium-review.googlesource.com/1086379 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/adf058616eb02677d8c7ee18df124268dc7b9244/server/cros/dynamic_suite/suite.py
,
Jun 6 2018
Waiting on a push to prod. |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by jrbarnette@chromium.org
, May 24 2018Status: Assigned (was: Untriaged)
Here are the relevant log lines from the suite job running on the drone: 05/23 14:56:03.473 DEBUG| utils:0215| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/controlfiles?suite_name=&build=veyron_mighty-paladin/R68-10710.0.0-rc1"'' 05/23 14:56:04.579 ERROR| server_job:0813| Exception escaped control file, job aborting: Traceback (most recent call last): File "/usr/local/autotest/server/server_job.py", line 805, in run self._execute_code(server_control_file, namespace) File "/usr/local/autotest/server/server_job.py", line 1328, in _execute_code execfile(code_file, namespace, namespace) File "/usr/local/autotest/results/202385114-chromeos-test/hostless/control.srv", line 62, in <module> dynamic_suite.run_provision_suite(**args_dict) File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 481, in run_provision_suite child_dependencies=spec.child_dependencies, File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 1602, in __init__ run_prod_code, test_args, test_source_build) File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 1645, in _load_dummy_test return retriever.retrieve_for_test('dummy_Pass') File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 529, in retrieve_for_test self._cf_getter, test_name) File "/usr/local/autotest/server/cros/dynamic_suite/suite_common.py", line 310, in retrieve_control_data_for_test path = cf_getter.get_control_file_path(test_name) File "/usr/local/autotest/server/cros/dynamic_suite/control_file_getter.py", line 134, in get_control_file_path raise error.ControlFileNotFound('No control file for ' + test_name) ControlFileNotFound: No control file for dummy_Pass If you ask the devserver _now_ about whether there's a "dummy_Pass" control file, the answer is "yes". So, my best guess is that there's a race condition: Something, somewhere had to stage the control files on the devserver. That something wasn't done, or else was done on a different devserver from 100.115.219.136.