New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 846151 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Race condition between staging control_files.tar and starting the first suite job

Reported by jrbarnette@chromium.org, May 23 2018

Issue description

This CQ run failed:
    https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/18686

One of the causes of the failure was the veyron_mighty-paladin:
    https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_mighty-paladin/builds/8873

That failed with this complaint:
    [Test-Logs]: Suite job: FAIL: No control file for dummy_Pass

This is ... disconcerting.  There's no reasonable way to pin it on a
CL in the run, so we have some sort of builder or autotest problem.

P1, and let's call it test infrastructure, until more can be learned.

 
Owner: jrbarnette@chromium.org
Status: Assigned (was: Untriaged)
Here are the relevant log lines from the suite job running on the drone:

05/23 14:56:03.473 DEBUG|             utils:0215| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/controlfiles?suite_name=&build=veyron_mighty-paladin/R68-10710.0.0-rc1"''
05/23 14:56:04.579 ERROR|        server_job:0813| Exception escaped control file, job aborting:
Traceback (most recent call last):
  File "/usr/local/autotest/server/server_job.py", line 805, in run
    self._execute_code(server_control_file, namespace)
  File "/usr/local/autotest/server/server_job.py", line 1328, in _execute_code
    execfile(code_file, namespace, namespace)
  File "/usr/local/autotest/results/202385114-chromeos-test/hostless/control.srv", line 62, in <module>
    dynamic_suite.run_provision_suite(**args_dict)
  File "/usr/local/autotest/server/cros/dynamic_suite/dynamic_suite.py", line 481, in run_provision_suite
    child_dependencies=spec.child_dependencies,
  File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 1602, in __init__
    run_prod_code, test_args, test_source_build)
  File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 1645, in _load_dummy_test
    return retriever.retrieve_for_test('dummy_Pass')
  File "/usr/local/autotest/server/cros/dynamic_suite/suite.py", line 529, in retrieve_for_test
    self._cf_getter, test_name)
  File "/usr/local/autotest/server/cros/dynamic_suite/suite_common.py", line 310, in retrieve_control_data_for_test
    path = cf_getter.get_control_file_path(test_name)
  File "/usr/local/autotest/server/cros/dynamic_suite/control_file_getter.py", line 134, in get_control_file_path
    raise error.ControlFileNotFound('No control file for ' + test_name)
ControlFileNotFound: No control file for dummy_Pass

If you ask the devserver _now_ about whether there's a "dummy_Pass" control
file, the answer is "yes".

So, my best guess is that there's a race condition:  Something, somewhere
had to stage the control files on the devserver.  That something wasn't done,
or else was done on a different devserver from 100.115.219.136.

Cc: ayatane@chromium.org
AFAICT, whatever stages "control_files.tar", it doesn't leave any log entries
in the suite debug logs.  That's true whether the suite passes or fails...

OK.  Skipping over how the file got staged...  It _did_ get staged,
because it's now on the devserver.  Looking there, you can see this:

chromeos-test@chromeos4-devserver8:~/images/veyron_mighty-paladin/R68-10710.0.0-rc1$ ls -l control_files.tar .control_filesDOTtar 
-rw-r--r-- 1 chromeos-test chromeos-test  362195 May 23 15:00 .control_filesDOTtar
-rw-r--r-- 1 chromeos-test chromeos-test 4945920 May 23 14:55 control_files.tar

Here's the time for the failed RPC:
05/23 14:56:03.473 DEBUG|             utils:0215| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/controlfiles?suite_name=&build=veyron_mighty-paladin/R68-10710.0.0-rc1"''

So, the call to read the list of control files predated the completion
of the ".control_filesDOTtar" file.

Moreover, there's this:
chromeos-test@chromeos4-devserver8:~/images/veyron_mighty-paladin/R68-10710.0.0-rc1$ find autotest/ -name dummy_Pass -exec ls -ld {} \;
drwxr-xr-x 2 chromeos-test chromeos-test 4096 May 23 15:00 autotest/client/site_tests/dummy_Pass

So, sure looks like a race condition.

Labels: -Pri-1 Chase-Pending Pri-2
Owner: ----
Status: Available (was: Assigned)
On team chat, xixuan@ said this:
    we don't have stage call logs because it happens in run_suite.py,
    and we don't keep any such logs at golo_proxy server.
I haven't confirmed, but it seems likely that run_suite.py doesn't
wait for staging to complete...

The obvious fix is that somewhere just before we query for the control
files, there should be a call to wait for staging to complete.

This seems not urgent, since the bug has been around forever, and yet
this is the first time we've seen it.  OTOH, it caused a CQ failure.
So... P2.

Chase-Pending, since it's caused a CQ failure, and since the fix seems
well-scoped.

Labels: -Pri-2 -Chase-Pending Chase Pri-1
Owner: jrbarnette@chromium.org
Status: Assigned (was: Available)
Summary: Race condition between staging control_files.tar and starting the first suite job (was: Mysterious "No control file for dummy_Pass" on CQ run)
I went digging for the staging call that's allegedly in "run_suite.py".

I believe that the code that actually makes the call is in
test_suites/dev_harness.py:
ds = dev_server.ImageServer.resolve(new_globals['build'])
if new_globals['SKIP_IMAGE']:
    # ... this branch not invoked in the failure case ...
else:
    ds.trigger_download(new_globals['build'])

The call to "trigger_download()" starts the download process for
multiple artifacts, including the "control_files" artifact, but
doesn't wait for completion.  There's a companion method name
"finish_download()" that's meant to ensure that staging is complete.

So, somehow "finish_download()" doesn't get called.


... And for definitive proof:

    $ git grep -l '\.finish_download('
    client/common_lib/cros/dev_server_unittest.py

The only call to "finish_download()" is made from the unit tests.
Not much good, that.

> I believe that the code that actually makes the call is in
> test_suites/dev_harness.py:

Ugh. dev_harness.py appears to be dead.  The code is evil.

I'm pretty sure we're back to this code:
    def _perform_reimage_and_run(spec, afe, tko, suite_job_id=None):
        # ...
        if not spec.run_prod_code:
            _stage_artifacts_for_build(spec.devserver, spec.test_source_build)


The "run_prod_code" flag is false by default, so this _should_ run every
time.  One thing I still don't understand is why it doesn't show up in the
logs.

"_perform_reimage_and_run()" is only invoked for regular test suites.
This failure happened with a provision suite.  For a provision suite,
we never call "_stage_artifacts_for_build()".

So, we're back to "how did control_files.tar" get staged in the first
place?

Digging through the code, and looking for _all_ of the staging calls
that might happen from a provisioning job, it turns out there are none.
However, it turns out that the "create_suite_job" RPC call makes its own
staging call:
    if run_prod_code:
        ds = dev_server.resolve(test_source_build, hostname=sample_dut)
        keyvals = {}
    else:
        ds, keyvals = suite_common.stage_build_artifacts(
                test_source_build, hostname=sample_dut)

The "stage_build_artifacts()" function does this:
        ds.stage_artifacts(image=build, artifacts=['test_suites'])

The devserver code has this to say about the 'test_suites' artifact.
# In general, downloading one artifact usually indicates that the caller will
# want to download other artifacts later. The following map explicitly defines
# this relationship. Specifically:
# If X is requested, all items in Y should also get triggered for download.
CROS_REQUESTED_TO_OPTIONAL_MAP = {
    TEST_SUITES: [CONTROL_FILES, AUTOTEST_PACKAGES],
}

Bottom line:  The fact that we stage the 'control_files' artifact at all
is _entirely_accidental_ (!)
This'll fix it:
    https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1086379

Don't know how I'll test it, yet.

Project Member

Comment 12 by bugdroid1@chromium.org, Jun 6 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/adf058616eb02677d8c7ee18df124268dc7b9244

commit adf058616eb02677d8c7ee18df124268dc7b9244
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Wed Jun 06 04:28:38 2018

[autotest] Stage control_files during provisioning.

When running the provisioning test suite, the code depends on the
'control_files' artifact being staged prior to creating the
provisioning jobs.  The provisioning code has no calls that
explicitly stage the file.

As it happens, the staging call happens implicitly during the
'create_suite_job' RPC.  However, that call makes no guarantee that
staging 'control_files' is complete when provisioning starts.

This adds an explicit call to stage the needed artifact in the
provisioning code path, to eliminate the possible race condition.

BUG= chromium:846151 
TEST=Run provision suite in a local Autotest instance

Change-Id: Ie08369fb5f125c72595751a54124039c9965141a
Reviewed-on: https://chromium-review.googlesource.com/1086379
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/adf058616eb02677d8c7ee18df124268dc7b9244/server/cros/dynamic_suite/suite.py

Status: Fixed (was: Assigned)
Waiting on a push to prod.

Sign in to add a comment