partial push to cros-full-0008 led to stale chromite .pyc files |
||||
Issue descriptionIn the lab push ~10:30 AM PST, cros-full-0008 was left in a weird state temporarily. It seems that autotest's site-packages/chromite checkout (via build_externals.py) was left with stale .pyc files. The fallout was: - temporary failure of scheduler / host_scheduler / shard_client: http://shortn/_Nsecs9a8a4 - These recovered on their own. - temporary AFE RPC failures: http://shortn/_fcPpZ0rCNy - These were fixed after a manual 'service apache restart' - Failed HWTests due to the same import exception (e.g.: http://cautotest-prod/afe/#tab_id=view_job&object_id=205748964) Observed import error: Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 209, in <module> sys.exit(main(sys.argv)) File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 46, in main ret = _main(args) File "/usr/local/autotest/venv/lucifer/cmd/job_reporter.py", line 88, in _main ts_mon_config = autotest.chromite_load('ts_mon_config') File "/usr/local/autotest/venv/lucifer/autotest.py", line 174, in chromite_load return _load('chromite.lib.%s' % name) File "/usr/local/autotest/venv/lucifer/autotest.py", line 198, in _load return importlib.import_module(name) File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module __import__(name) File "/usr/local/autotest/venv/autotest_lib/site-packages/chromite/lib/ts_mon_config.py", line 21, in <module> from chromite.lib import parallel File "/usr/local/autotest/venv/autotest_lib/site-packages/chromite/lib/parallel.py", line 32, in <module> from chromite.lib import failures_lib File "/usr/local/autotest/venv/autotest_lib/site-packages/chromite/lib/failures_lib.py", line 21, in <module> class StepFailure(Exception): File "/usr/local/autotest/venv/autotest_lib/site-packages/chromite/lib/failures_lib.py", line 36, in StepFailure EXCEPTION_CATEGORY = constants.EXCEPTION_CATEGORY_UNKNOWN AttributeError: 'module' object has no attribute 'EXCEPTION_CATEGORY_UNKNOWN'
,
Jun 5 2018
My best guess here is that some process was using the contants.pyc file in a way that blocked any new processes from generating a new .pyc file for the duration of this issue. The problem fixed itself, so I don't have much to go on beyond that.
,
Jun 5 2018
Pushing python packages to locations where they're in active use is not safe. There are plans to stop doing this as we rework how skylab-drones deploy software. We'll work on that in Q3/Q4 timeframe iirc. ayatane@ will be leading that effort.
,
Jun 5 2018
Currently, we deploy a variety of python packages on the lab servers in a variety of ways. So any "quick fix" for this would be non-trivial, and likely remain quite dirty. The solution is to not deploy shared python packages that used by arbitrary processes while they're being updated. ayatane@'s work will address the problem that way.
,
Jun 5 2018
jclinton: Can you elaborate on what you mean by "a recurring pain point"? I suspect you're not referring to one specific bug, but rather something bigger picture.
,
Jun 5 2018
Trolling back through chromeos-infra-discuss, we see a few instances of: * folks pushing but some kind of problem leaving some machines running old code. This last happened 3 times in November and doesn't seem to have happened since then. * pushes causing a regression and then a flurry of emergency pushes/rollbacks. Seems to happen once every month or two.
,
Jun 7 2018
-> ayatane is going to pick up an overhaul of our push/deployment process, once wrapping up his current lucifer/skylab work. I leave it to his design to address this and similar concerns.
,
Oct 25
yet another reason to prefer Go tools, which is in the works |
||||
►
Sign in to add a comment |
||||
Comment 1 by jclinton@chromium.org
, Jun 5 2018