New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 811877 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 0
Type: Bug



Sign in to add a comment

shards crashloop with AssertionError from _get_consistent_execution_path | lucifer doesn't understand tasks at HQE level

Project Member Reported by pho...@chromium.org, Feb 13 2018

Issue description

Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 190, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 395, in tick
    self._send_to_lucifer()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 303, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 969, in _send_to_lucifer
    self._send_parsing_to_lucifer()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 1022, in _send_parsing_to_lucifer
    pidfile_id=pidfile_id)
  File "/usr/local/autotest/scheduler/luciferlib.py", line 115, in spawn_parsing_job_handler
    results_dir = _results_dir(manager, job)
  File "/usr/local/autotest/scheduler/luciferlib.py", line 213, in _results_dir
    return manager.absolute_path(_working_directory(job))
  File "/usr/local/autotest/scheduler/luciferlib.py", line 217, in _working_directory
    return _get_consistent_execution_path(job.hostqueueentry_set.all())
  File "/usr/local/autotest/scheduler/luciferlib.py", line 227, in _get_consistent_execution_path
    execution_entries[0]))
AssertionError: 176467115-chromeos-test/ (chromeos6-row2-rack8-host5/176467115 (176881199)) != 176467115-chromeos-test/chromeos6-row2-rack7-host16 (chromeos6-row2-rack7-host16/176467115 (176881200))
 
Labels: -Pri-1 Pri-0
This is a repeat of Issue 806131
This behavior predates lucifer, so I don't have any special insight into solving it.  We can always remove the assertion, but the assertion is catching a data inconsistency.

The results directory to be used is store in the HQE row, and for tests with multiple HQEs, that should be the same.  In the worst case having different results means losing the test results and tko/parse failing probably, but since we always use the directory of the first HQE it should be deterministic, so maybe it is safe to remove.

Comment 4 by pho...@chromium.org, Feb 13 2018

10:36:03 ERROR| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 190, in main_without_exception_handling
    dispatcher.tick()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 395, in tick
    self._send_to_lucifer()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 303, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 969, in _send_to_lucifer
    self._send_parsing_to_lucifer()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 1022, in _send_parsing_to_lucifer
    pidfile_id=pidfile_id)
  File "/usr/local/autotest/scheduler/luciferlib.py", line 107, in spawn_parsing_job_handler
    drone = manager.get_drone_for_pidfile(pidfile_id)
  File "/usr/local/autotest/scheduler/luciferlib.py", line 170, in get_drone_for_pidfile
    return _wrap_drone(self._manager.get_drone_for_pidfile_id(pidfile_id))
  File "/usr/local/autotest/scheduler/drone_manager.py", line 361, in get_drone_for_pidfile_id
    return self._get_drone_for_pidfile_id(pidfile_id)
  File "/usr/local/autotest/scheduler/drone_manager.py", line 352, in _get_drone_for_pidfile_id
    assert pidfile_contents.process is not None
AssertionError


I aborted the two jobs. now the scheduler dies with this error ^
In the meantime, aborted the job to fix the shard.

Lucifer may in fact be complicit in making this issue more prevalent.  Jobs with multiple HQEs running on different DUTs are never going to have the same execution_dir, I think, since execution_dir is based on DUT name.  This only matters when adopting pidfiles however, which would only be necessary if the scheduler restarts between the multi-DUT job running and parsing.

Probably related to #4 which I will read in a moment.

Comment 6 by pho...@chromium.org, Feb 13 2018

Status: Verified (was: Assigned)
Ok, scheduler came up after one crash from the assertionerror.

Comment 7 by pho...@chromium.org, Feb 14 2018

Status: Assigned (was: Verified)
Summary: Lucifer crashlooping shards (was: Lucifer crashlooping shard chromeos-skunk-5.mtv)
3 more shards are down because of this: 

cros-full-0004
cros-full-0006
cros-full-0012

Comment 8 by pho...@chromium.org, Feb 14 2018

chromeos-test@cros-full-0012:~$ /usr/local/autotest/cli/atest job abort --no-confirmation 176674940
Operation abort_host_queue_entries failed:
    ValidationError: {'': u'You cannot abort part of a synchronous job execution (176674940/chromeos6-row1-rack3-host15), 1 included, 2 expected'}

Comment 9 by pho...@chromium.org, Feb 14 2018

Labels: -Pri-0 Pri-1
I did more SQL surgery to fix the 3 shards.
Status: Started (was: Assigned)
Going to take a look at what execution_dir is supposed to be.
Okay, a number of issues.

It is possible for a job to have HQEs with different execution_dir.  Lucifer does not handle this.

I am not sure if this feature is being used.  A quick query of the autotest database suggest that this feature is not being used.  Thus, I'm inclined to ignore it since it will be much easier to implement/fix after the next stage of lucifer rolls out.

select hqe.id, j.id, j.synch_count, execution_subdir from afe_host_queue_entries as hqe join afe_jobs as j on j.id = hqe.job_id where j.synch_count > 1 limit 1000; -- slow

The current issue is a bit different though.  I looked at the two jobs referenced on this bug, and they both have synch_count=2 and 2 HQEs.  Thus, those 2 HQEs should be in the same execution group and thus have the same execution_dir.

Most jobs that have synch_count=2 have both HQEs with execution_dir=group0.  The bad ones have one HQE execution_dir=<hostname> and the other HQE with execution_dir=""

I believe the execution_dir assignment happens during host scheduling, so some kind of edge case bug in the host scheduler is the root cause?  Will dig into that next.

The two tests I have seen so far that are affected are autotest_SyncCount and autotest_P2P
Labels: Hotlist-Lucifer
This does need to be fixed in lucifer, sadly.

If provision fails, then gathering/parsing will be run on each HQE separately rather than for the whole job/execution group.
Summary: lucifer doesn't understand tasks at HQE level (was: Lucifer crashlooping shards)
Labels: -Pri-1 Pri-0
cros-full-0006 is crashlooping on this. -> P0

Can this be turned into an ignorable exception (e.g. MalformedRecordException) as a short term mitigation?
Fixed cros-full-0006 with more surgery (update afe_host_queue_entries set status='Aborted', aborted=1 where job_id=176908257;)
#14 Ack.  This is affecting provision failures for a subset of jobs, if we can stop the scheduler crashes that should reduce the severity enough to give me time to fix properly.

Spoke with phobbs about landing a quick CL to catch and ignore the assertion, badly timed scheduled training makes it hard for me to babysit the process.

Comment 17 Deleted

Comment 18 Deleted

Cc: -pho...@chromium.org ayatane@chromium.org
Owner: pho...@chromium.org
crrev.com/922883 should fix it
Issue 806131 has been merged into this issue.
Summary: shards crashloop with AssertionError from _get_consistent_execution_path | lucifer doesn't understand tasks at HQE level (was: lucifer doesn't understand tasks at HQE level)
This is crashlooping shards 4 and 9.
Cc: snanda@chromium.org
fixed 4 and 9, will update CL
Project Member

Comment 25 by bugdroid1@chromium.org, Feb 16 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/6823723485cb7c4e1b5411d1b5059905d53dc2f6

commit 6823723485cb7c4e1b5411d1b5059905d53dc2f6
Author: Allen Li <ayatane@chromium.org>
Date: Fri Feb 16 20:06:19 2018

[autotest] Ignore split PARSING HQEs

The relevant parts of the state machine are:

 PROVISIONING------------------------
      |                             |
      V                             V
RUNNING (stuff)                 PARSING (1)
      |
      V
  PARSING (2)

For multi-DUT jobs, lucifer cannot handle PARSING (1) yet.
PARSING (2) can be handled, and PARSING (1) for single DUT jobs can be
handled.

This CL is handing control of multi-DUT PARSING (1) back to the
scheduler.  This involves two changes: don’t send these HQEs to
lucifer and don’t ignore adding the agents for these HQEs in the
scheduler.

These HQEs are identified by the fact that they don’t have a
consistent execution path for all HQEs in that job.  During and after
RUNNING, they will all have the same execution path (group0), but during
provisioning they will have unique execution paths (the hostname
of the DUT).

BUG= chromium:811877 
TEST=None

Change-Id: I685c32dd22e0a5e8a1bf1784364551cdcea9d5ca
Reviewed-on: https://chromium-review.googlesource.com/924325
Reviewed-by: Paul Hobbs <phobbs@google.com>
Tested-by: Allen Li <ayatane@chromium.org>

[modify] https://crrev.com/6823723485cb7c4e1b5411d1b5059905d53dc2f6/scheduler/luciferlib.py
[modify] https://crrev.com/6823723485cb7c4e1b5411d1b5059905d53dc2f6/scheduler/monitor_db.py

Status: Fixed (was: Started)
Should be fixed by the push of the above CL ^

Sign in to add a comment