shards crashloop with AssertionError from _get_consistent_execution_path | lucifer doesn't understand tasks at HQE level |
||||||||||||
Issue description
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 190, in main_without_exception_handling
dispatcher.tick()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 395, in tick
self._send_to_lucifer()
File "/usr/local/autotest/scheduler/monitor_db.py", line 303, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 969, in _send_to_lucifer
self._send_parsing_to_lucifer()
File "/usr/local/autotest/scheduler/monitor_db.py", line 1022, in _send_parsing_to_lucifer
pidfile_id=pidfile_id)
File "/usr/local/autotest/scheduler/luciferlib.py", line 115, in spawn_parsing_job_handler
results_dir = _results_dir(manager, job)
File "/usr/local/autotest/scheduler/luciferlib.py", line 213, in _results_dir
return manager.absolute_path(_working_directory(job))
File "/usr/local/autotest/scheduler/luciferlib.py", line 217, in _working_directory
return _get_consistent_execution_path(job.hostqueueentry_set.all())
File "/usr/local/autotest/scheduler/luciferlib.py", line 227, in _get_consistent_execution_path
execution_entries[0]))
AssertionError: 176467115-chromeos-test/ (chromeos6-row2-rack8-host5/176467115 (176881199)) != 176467115-chromeos-test/chromeos6-row2-rack7-host16 (chromeos6-row2-rack7-host16/176467115 (176881200))
,
Feb 13 2018
This is a repeat of Issue 806131
,
Feb 13 2018
This behavior predates lucifer, so I don't have any special insight into solving it. We can always remove the assertion, but the assertion is catching a data inconsistency. The results directory to be used is store in the HQE row, and for tests with multiple HQEs, that should be the same. In the worst case having different results means losing the test results and tko/parse failing probably, but since we always use the directory of the first HQE it should be deterministic, so maybe it is safe to remove.
,
Feb 13 2018
10:36:03 ERROR| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 190, in main_without_exception_handling
dispatcher.tick()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 395, in tick
self._send_to_lucifer()
File "/usr/local/autotest/scheduler/monitor_db.py", line 303, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 969, in _send_to_lucifer
self._send_parsing_to_lucifer()
File "/usr/local/autotest/scheduler/monitor_db.py", line 1022, in _send_parsing_to_lucifer
pidfile_id=pidfile_id)
File "/usr/local/autotest/scheduler/luciferlib.py", line 107, in spawn_parsing_job_handler
drone = manager.get_drone_for_pidfile(pidfile_id)
File "/usr/local/autotest/scheduler/luciferlib.py", line 170, in get_drone_for_pidfile
return _wrap_drone(self._manager.get_drone_for_pidfile_id(pidfile_id))
File "/usr/local/autotest/scheduler/drone_manager.py", line 361, in get_drone_for_pidfile_id
return self._get_drone_for_pidfile_id(pidfile_id)
File "/usr/local/autotest/scheduler/drone_manager.py", line 352, in _get_drone_for_pidfile_id
assert pidfile_contents.process is not None
AssertionError
I aborted the two jobs. now the scheduler dies with this error ^
,
Feb 13 2018
In the meantime, aborted the job to fix the shard. Lucifer may in fact be complicit in making this issue more prevalent. Jobs with multiple HQEs running on different DUTs are never going to have the same execution_dir, I think, since execution_dir is based on DUT name. This only matters when adopting pidfiles however, which would only be necessary if the scheduler restarts between the multi-DUT job running and parsing. Probably related to #4 which I will read in a moment.
,
Feb 13 2018
Ok, scheduler came up after one crash from the assertionerror.
,
Feb 14 2018
3 more shards are down because of this: cros-full-0004 cros-full-0006 cros-full-0012
,
Feb 14 2018
chromeos-test@cros-full-0012:~$ /usr/local/autotest/cli/atest job abort --no-confirmation 176674940
Operation abort_host_queue_entries failed:
ValidationError: {'': u'You cannot abort part of a synchronous job execution (176674940/chromeos6-row1-rack3-host15), 1 included, 2 expected'}
,
Feb 14 2018
I did more SQL surgery to fix the 3 shards.
,
Feb 14 2018
Going to take a look at what execution_dir is supposed to be.
,
Feb 14 2018
Okay, a number of issues. It is possible for a job to have HQEs with different execution_dir. Lucifer does not handle this. I am not sure if this feature is being used. A quick query of the autotest database suggest that this feature is not being used. Thus, I'm inclined to ignore it since it will be much easier to implement/fix after the next stage of lucifer rolls out. select hqe.id, j.id, j.synch_count, execution_subdir from afe_host_queue_entries as hqe join afe_jobs as j on j.id = hqe.job_id where j.synch_count > 1 limit 1000; -- slow The current issue is a bit different though. I looked at the two jobs referenced on this bug, and they both have synch_count=2 and 2 HQEs. Thus, those 2 HQEs should be in the same execution group and thus have the same execution_dir. Most jobs that have synch_count=2 have both HQEs with execution_dir=group0. The bad ones have one HQE execution_dir=<hostname> and the other HQE with execution_dir="" I believe the execution_dir assignment happens during host scheduling, so some kind of edge case bug in the host scheduler is the root cause? Will dig into that next. The two tests I have seen so far that are affected are autotest_SyncCount and autotest_P2P
,
Feb 14 2018
This does need to be fixed in lucifer, sadly. If provision fails, then gathering/parsing will be run on each HQE separately rather than for the whole job/execution group.
,
Feb 14 2018
,
Feb 15 2018
cros-full-0006 is crashlooping on this. -> P0 Can this be turned into an ignorable exception (e.g. MalformedRecordException) as a short term mitigation?
,
Feb 15 2018
Fixed cros-full-0006 with more surgery (update afe_host_queue_entries set status='Aborted', aborted=1 where job_id=176908257;)
,
Feb 15 2018
#14 Ack. This is affecting provision failures for a subset of jobs, if we can stop the scheduler crashes that should reduce the severity enough to give me time to fix properly. Spoke with phobbs about landing a quick CL to catch and ignore the assertion, badly timed scheduled training makes it hard for me to babysit the process.
,
Feb 15 2018
crrev.com/922883 should fix it
,
Feb 16 2018
Issue 806131 has been merged into this issue.
,
Feb 16 2018
This is crashlooping shards 4 and 9.
,
Feb 16 2018
,
Feb 16 2018
fixed 4 and 9, will update CL
,
Feb 16 2018
crrev is broken, CL in question https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/922883
,
Feb 16 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/6823723485cb7c4e1b5411d1b5059905d53dc2f6 commit 6823723485cb7c4e1b5411d1b5059905d53dc2f6 Author: Allen Li <ayatane@chromium.org> Date: Fri Feb 16 20:06:19 2018 [autotest] Ignore split PARSING HQEs The relevant parts of the state machine are: PROVISIONING------------------------ | | V V RUNNING (stuff) PARSING (1) | V PARSING (2) For multi-DUT jobs, lucifer cannot handle PARSING (1) yet. PARSING (2) can be handled, and PARSING (1) for single DUT jobs can be handled. This CL is handing control of multi-DUT PARSING (1) back to the scheduler. This involves two changes: don’t send these HQEs to lucifer and don’t ignore adding the agents for these HQEs in the scheduler. These HQEs are identified by the fact that they don’t have a consistent execution path for all HQEs in that job. During and after RUNNING, they will all have the same execution path (group0), but during provisioning they will have unique execution paths (the hostname of the DUT). BUG= chromium:811877 TEST=None Change-Id: I685c32dd22e0a5e8a1bf1784364551cdcea9d5ca Reviewed-on: https://chromium-review.googlesource.com/924325 Reviewed-by: Paul Hobbs <phobbs@google.com> Tested-by: Allen Li <ayatane@chromium.org> [modify] https://crrev.com/6823723485cb7c4e1b5411d1b5059905d53dc2f6/scheduler/luciferlib.py [modify] https://crrev.com/6823723485cb7c4e1b5411d1b5059905d53dc2f6/scheduler/monitor_db.py
,
Feb 17 2018
Should be fixed by the push of the above CL ^ |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by akes...@chromium.org
, Feb 13 2018