New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 849829 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Oct 25
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

partial push to cros-full-0008 led to stale chromite .pyc files

Project Member Reported by pprabhu@chromium.org, Jun 5 2018

Issue description

In the lab push ~10:30 AM PST, cros-full-0008 was left in a weird state temporarily.
It seems that autotest's site-packages/chromite checkout (via build_externals.py) was left with stale .pyc files.

The fallout was:

- temporary failure of scheduler / host_scheduler / shard_client: http://shortn/_Nsecs9a8a4
  - These recovered on their own.
- temporary AFE RPC failures: http://shortn/_fcPpZ0rCNy
  - These were fixed after a manual 'service apache restart'
- Failed HWTests due to the same import exception (e.g.: http://cautotest-prod/afe/#tab_id=view_job&object_id=205748964)

Observed import error:

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 209, in <module>
    sys.exit(main(sys.argv))
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 46, in main
    ret = _main(args)
  File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 88, in _main
    ts_mon_config = autotest.chromite_load('ts_mon_config')
  File "/usr/local/autotest/venv/lucifer/autotest.py", line 174, in chromite_load
    return _load('chromite.lib.%s' % name)
  File "/usr/local/autotest/venv/lucifer/autotest.py", line 198, in _load
    return importlib.import_module(name)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/usr/local/autotest/venv/autotest_lib/site-packages/chromite/lib/ts_mon_config.py", line 21, in <module>
    from chromite.lib import parallel
  File "/usr/local/autotest/venv/autotest_lib/site-packages/chromite/lib/parallel.py", line 32, in <module>
    from chromite.lib import failures_lib
  File "/usr/local/autotest/venv/autotest_lib/site-packages/chromite/lib/failures_lib.py", line 21, in <module>
    class StepFailure(Exception):
  File "/usr/local/autotest/venv/autotest_lib/site-packages/chromite/lib/failures_lib.py", line 36, in StepFailure
    EXCEPTION_CATEGORY = constants.EXCEPTION_CATEGORY_UNKNOWN
AttributeError: 'module' object has no attribute 'EXCEPTION_CATEGORY_UNKNOWN'

 
Cc: cra...@chromium.org jclinton@chromium.org akes...@chromium.org
In the months leading up to the CI handoff, I noticed that pushing code was a recurring pain point. Is there something we can do to make this less error-prone? Is there a tracking bug for that?
My best guess here is that some process was using the contants.pyc file in a way that blocked any new processes from generating a new .pyc file for the duration of this issue.

The problem fixed itself, so I don't have much to go on beyond that.
Cc: ayatane@chromium.org
Pushing python packages to locations where they're in active use is not safe.

There are plans to stop doing this as we rework how skylab-drones deploy software. We'll work on that in Q3/Q4 timeframe iirc. ayatane@ will be leading that effort.
Currently, we deploy a variety of python packages on the lab servers in a variety of ways. So any "quick fix" for this would be non-trivial, and likely remain quite dirty.

The solution is to not deploy shared python packages that used by arbitrary processes while they're being updated. ayatane@'s work will address the problem that way.
jclinton: Can you elaborate on what you mean by "a recurring pain point"?  I suspect you're not referring to one specific bug, but rather something bigger picture.

Trolling back through chromeos-infra-discuss, we see a few instances of:

* folks pushing but some kind of problem leaving some machines running old code. This last happened 3 times in November and doesn't seem to have happened since then.

* pushes causing a regression and then a flurry of emergency pushes/rollbacks. Seems to happen once every month or two.

Owner: ayatane@chromium.org
Status: Assigned (was: Untriaged)
-> ayatane is going to pick up an overhaul of our push/deployment process, once wrapping up his current lucifer/skylab work. I leave it to his design to address this and similar concerns.
Status: WontFix (was: Assigned)
yet another reason to prefer Go tools, which is in the works

Sign in to add a comment