New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 832146 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jul 24
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

DUTs getting stuck in Pending status

Project Member Reported by akes...@chromium.org, Apr 12 2018

Issue description

related to Issue 829925 ?
Cc: pho...@chromium.org
+phobbs to comment on whether this is an artifact of the metric/precomputation.
For bug 804625 I have a new inventory script that will find DUTs that
have been stuck like this for at least 24 hours.  I just ran the script;
it's identified only 16 stuck DUTs.  So, either this is a metrics anomaly,
or the problem hasn't yet lasted for 24 hours.

Below is the output identifying the stuck DUTs.

====
Scanning for idle, unlocked DUTs.
Untestable DUT: chromeos4-row6-rack11-host15, model: veyron_mighty, pool: suites
Untestable DUT: chromeos4-row9-rack10-host20, model: veyron_minnie, pool: suites
Untestable DUT: chromeos6-row2-rack20-host5, model: wizpig, pool: suites
Untestable DUT: chromeos6-row4-rack17-host4, model: eve, pool: suites
Untestable DUT: chromeos6-row4-rack17-host8, model: eve, pool: suites
Untestable DUT: chromeos6-row4-rack17-host10, model: eve, pool: suites
Untestable DUT: chromeos6-row4-rack17-host16, model: eve, pool: suites
Untestable DUT: chromeos6-row4-rack11-host20, model: eve, pool: cts
Untestable DUT: chromeos4-row8-rack8-host20, model: setzer, pool: suites
Untestable DUT: chromeos4-row5-rack7-host7, model: gandof, pool: suites
Untestable DUT: chromeos2-row7-rack11-host3, model: elm, pool: bvt
Untestable DUT: chromeos6-row4-rack21-host19, model: nasher360, pool: suites
Untestable DUT: chromeos4-row11-rack9-host5, model: chell, pool: suites
Untestable DUT: chromeos4-row12-rack9-host5, model: chell, pool: suites
Untestable DUT: chromeos6-row4-rack6-host1, model: veyron_jerry, pool: suites
Untestable DUT: chromeos6-row3-rack6-host7, model: pyro, pool: cts

Metrics say it hasn't lasted for 24h yet.

Can you run the same diagnosis with a 12h window?
Owner: ayatane@chromium.org
Status: Assigned (was: Untriaged)
-> ayatane is investigating.
Example stuck DUT chromeos6-row4-rack24-host13

Last job was at 4pm yesterday, stuck in Pending.

The lucifer logs from the last job contain no abnormalities and claim to have set the host back to Ready.

The scheduler logs do not show that it has also been trying to run the job, which was the cause for the Running bug.

The last job ran was not a synch_count > 1 job, and Pending state only really gets used for synch_count > 1 jobs.

Pending state also precedes Starting, and Lucifer only takes over at Starting.

Frustrating.

Comment 7 by pbe...@chromium.org, Apr 12 2018

Cc: pbe...@chromium.org
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>Test
> Metrics say it hasn't lasted for 24h yet.
>
> Can you run the same diagnosis with a 12h window?

Running now.  ETA for data is about 11:15.

This isn't an artifact of the precomputation being wrong. Here's both the precomputation and equivalent raw query together: http://shortn/_j6iCThJOGQ
With a 12 hour window, the inventory finds 323 idle DUTs.  The list
is in the attached file.

idle-list.txt
22.4 KB View Download
Re 11, How accurate/useful is that list?

For example, chromeos2-row4-rack2-host4 on that list is in state Ready.  It's true that it hasn't run a job since 2018-04-06 though.
chromeos2-row4-rack2-host4 has been running verify tasks every day since 2018-04-06, its last job.
I found this in the code.  Famous last words

        # Some debug code here: sends an email if an asynchronous job does not
        # immediately enter Starting.
        # TODO: Remove this once we figure out why asynchronous jobs are getting
        # stuck in Pending.
        self.job.run_if_ready(queue_entry=self)
        if (self.job.synch_count == 1 and
                self.status == models.HostQueueEntry.Status.PENDING):
            subject = 'Job %s (id %s)' % (self.job.name, self.job.id)
            message = 'Asynchronous job stuck in Pending'
            email_manager.manager.enqueue_notify_email(subject, message)

> Re 11, How accurate/useful is that list?

It identifies all DUTs that haven't run a job in the last 12 hours.
We have a cron job that runs every 8 hours and schedules trivial
work on any DUT that's on the verge of being idle more than 24 hours.
So, if a DUT is on that list, attempts to schedule work on the DUT
have likely hung up.

So, in the case of chromeos2-row4-rack2-host4, if it hasn't done anything
since 4/6 then the regularly scheduled verify task isn't ever running.
That means there's something wrong with the scheduler which is preventing
the DUT from being scheduled.

The list does have a caveat:  The cron job that schedules work for idle DUTs
looks for idle periods greater than 20 hours.  The list identifies DUTs that
have been idle for more than 12 hours.  So, some DUTs may be flagged simply
because they hadn't hit the 20 hour threshold as of 4:00 this morning.

MySQL [chromeos_autotest_db]> select * from afe_hosts where status='Pending';
+------+------------------------------+--------+----------+---------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
| id   | hostname                     | locked | synch_id | status  | invalid | protection | locked_by_id | lock_time | dirty | leased | shard_id | lock_reason |
+------+------------------------------+--------+----------+---------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
| 1912 | chromeos4-row5-rack12-host13 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      231 |             |
| 1914 | chromeos4-row5-rack12-host15 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      231 |             |
| 1916 | chromeos4-row5-rack12-host17 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      231 |             |
| 1918 | chromeos4-row5-rack12-host19 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      231 |             |
| 2347 | chromeos4-row5-rack13-host9  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      231 |             |
| 2369 | chromeos4-row13-rack1-host3  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      232 |             |
| 2370 | chromeos4-row13-rack1-host4  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      232 |             |
| 2374 | chromeos4-row13-rack1-host8  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      232 |             |
| 2377 | chromeos4-row13-rack1-host11 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      232 |             |
| 2551 | chromeos4-row10-rack4-host9  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      231 |             |
| 2675 | chromeos4-row13-rack8-host10 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      232 |             |
| 2688 | chromeos4-row13-rack9-host7  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      232 |             |
| 2691 | chromeos4-row13-rack9-host10 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      232 |             |
| 2751 | chromeos4-row1-rack3-host15  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      230 |             |
| 2767 | chromeos4-row1-rack5-host15  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      230 |             |
| 3195 | chromeos4-row8-rack8-host15  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      222 |             |
| 3609 | chromeos4-row5-rack7-host7   |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      245 |             |
| 4242 | chromeos4-row12-rack8-host9  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      225 |             |
| 4668 | chromeos4-row8-rack8-host10  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      222 |             |
| 4819 | chromeos4-row8-rack8-host20  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      222 |             |
| 5329 | chromeos2-row11-rack1-host2  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      229 |             |
| 5662 | chromeos2-row5-rack3-host19  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      225 |             |
| 6069 | chromeos2-row8-rack6-host1   |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      232 |             |
| 6179 | chromeos6-row2-rack14-host18 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     1 |      0 |      212 |             |
| 6602 | chromeos6-row1-rack17-host13 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      245 |             |
| 6760 | chromeos6-row2-rack20-host5  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      241 |             |
| 7036 | chromeos6-row21-rack4-host15 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     1 |      0 |      223 |             |
| 7797 | chromeos6-row4-rack11-host17 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     1 |      0 |      226 |             |
| 8340 | chromeos6-row4-rack24-host13 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     1 |      0 |      240 |             |
+------+------------------------------+--------+----------+---------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
29 rows in set (0.03 sec)

MySQL [chromeos_autotest_db]> select * from afe_host_queue_entries where status='Pending';
Empty set (0.08 sec)

MySQL [chromeos_autotest_db]> select * from afe_shards where id in (select distinct shard_id from afe_hosts where status='Pending');
+-----+---------------------------------------------+
| id  | hostname                                    |
+-----+---------------------------------------------+
| 212 | chromeos-skunk-5.mtv.corp.google.com        |
| 222 | chromeos-server-tester2.mtv.corp.google.com |
| 225 | cros-full-0003.mtv.corp.google.com          |
| 227 | cros-full-0005.mtv.corp.google.com          |
| 228 | cros-full-0006.mtv.corp.google.com          |
| 229 | cros-full-0007.mtv.corp.google.com          |
| 232 | cros-full-0010.mtv.corp.google.com          |
| 236 | cros-full-0014.mtv.corp.google.com          |
| 237 | cros-full-0015.mtv.corp.google.com          |
| 239 | cros-full-0017.mtv.corp.google.com          |
| 240 | cros-full-0018.mtv.corp.google.com          |
| 241 | cros-full-0019.mtv.corp.google.com          |
| 245 | cros-full-0023.mtv.corp.google.com          |
+-----+---------------------------------------------+
13 rows in set (0.01 sec)
I think there is some shard heartbeat lag that is making this more confusing.  There appears to be only a dozen or so DUTs that are stuck in Pending, and I cross-checked a few shards to confirm.

MySQL [chromeos_autotest_db]> select * from afe_hosts where status='Pending';
+------+------------------------------+--------+----------+---------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
| id   | hostname                     | locked | synch_id | status  | invalid | protection | locked_by_id | lock_time | dirty | leased | shard_id | lock_reason |
+------+------------------------------+--------+----------+---------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
| 3195 | chromeos4-row8-rack8-host15  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      222 |             |
| 3609 | chromeos4-row5-rack7-host7   |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      245 |             |
| 4242 | chromeos4-row12-rack8-host9  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      225 |             |
| 4248 | chromeos4-row12-rack8-host17 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      225 |             |
| 4694 | chromeos2-row3-rack2-host13  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      241 |             |
| 4719 | chromeos2-row3-rack5-host13  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     1 |      0 |      240 |             |
| 4819 | chromeos4-row8-rack8-host20  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      222 |             |
| 5348 | chromeos2-row11-rack4-host3  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      199 |             |
| 5779 | chromeos2-row8-rack9-host13  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      1 |     NULL |             |
| 6069 | chromeos2-row8-rack6-host1   |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      232 |             |
| 6179 | chromeos6-row2-rack14-host18 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     1 |      0 |      212 |             |
| 6602 | chromeos6-row1-rack17-host13 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      245 |             |
| 6760 | chromeos6-row2-rack20-host5  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      241 |             |
| 7876 | chromeos6-row3-rack16-host15 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     1 |      0 |      241 |             |
| 8204 | chromeos6-row4-rack19-host3  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     1 |      0 |      240 |             |
| 8322 | chromeos6-row4-rack24-host4  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     1 |      0 |      240 |             |
| 8340 | chromeos6-row4-rack24-host13 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     1 |      0 |      240 |             |
+------+------------------------------+--------+----------+---------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
17 rows in set (0.00 sec)
I believe there are only 13 DUTs stuck in Pending (by querying multiple times and removing the DUTs that were temporarily in Pending)

chromeos2-row8-rack6-host1
chromeos2-row8-rack9-host13
chromeos2-row8-rack9-host6
chromeos4-row12-rack10-host7
chromeos4-row12-rack8-host9
chromeos4-row5-rack7-host7
chromeos4-row8-rack8-host15
chromeos4-row8-rack8-host20
chromeos6-row1-rack17-host13
chromeos6-row2-rack14-host18
chromeos6-row2-rack20-host5
chromeos6-row4-rack24-host13

MySQL [chromeos_autotest_db]> select * from afe_hosts where hostname in ('chromeos2-row8-rack6-host1',
    -> 'chromeos2-row8-rack9-host13',
    -> 'chromeos2-row8-rack9-host6',
    -> 'chromeos4-row12-rack10-host7',
    -> 'chromeos4-row12-rack8-host9',
    -> 'chromeos4-row5-rack7-host7',
    -> 'chromeos4-row8-rack8-host15',
    -> 'chromeos4-row8-rack8-host20',
    -> 'chromeos6-row1-rack17-host13',
    -> 'chromeos6-row2-rack14-host18',
    -> 'chromeos6-row2-rack20-host5',
    -> 'chromeos6-row4-rack24-host13'
    -> );
+------+------------------------------+--------+----------+---------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
| id   | hostname                     | locked | synch_id | status  | invalid | protection | locked_by_id | lock_time | dirty | leased | shard_id | lock_reason |
+------+------------------------------+--------+----------+---------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
| 3195 | chromeos4-row8-rack8-host15  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      222 |             |
| 3609 | chromeos4-row5-rack7-host7   |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      245 |             |
| 4242 | chromeos4-row12-rack8-host9  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      225 |             |
| 4587 | chromeos4-row12-rack10-host7 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     1 |      0 |      235 |             |
| 4819 | chromeos4-row8-rack8-host20  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      222 |             |
| 5777 | chromeos2-row8-rack9-host6   |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      1 |     NULL |             |
| 5779 | chromeos2-row8-rack9-host13  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      1 |     NULL |             |
| 6069 | chromeos2-row8-rack6-host1   |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      232 |             |
| 6179 | chromeos6-row2-rack14-host18 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     1 |      0 |      212 |             |
| 6602 | chromeos6-row1-rack17-host13 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      245 |             |
| 6760 | chromeos6-row2-rack20-host5  |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     0 |      0 |      241 |             |
| 8340 | chromeos6-row4-rack24-host13 |      0 |     NULL | Pending |       0 |          0 |         NULL | NULL      |     1 |      0 |      240 |             |
+------+------------------------------+--------+----------+---------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
12 rows in set (0.01 sec)

Re #3, #11, chromeos4-row5-rack4-host13 shows up in #11 but not #3

The last job ran by that DUT was in March, 2018-03-08.  Are you sure that's right?
As for why that DUT doesn't run _tests_:   bug 832228 .

There's only three DUTs stuck in Pending?

chromeos6-row1-rack17-host13
chromeos6-row2-rack14-host18
chromeos6-row4-rack24-host13

Labels: -Pri-1 Pri-2
Owner: jkop@chromium.org
Actually, all the DUTs stuck in Pending are just waiting for enough DUTs to run a synch_count > 1 job.

That took me way more time to realize than it should have.

I think this is some sort of host scheduler issue.

It's not a serious issue right now; it may become a serious issue later.  Basically, a job that needs two DUTs to run is holding one of the DUTs in Pending to wait for another DUT, yet the host scheduler is not getting that extra DUT for the job.
To clarify, the issue isn't necessary in the host scheduler service itself, but rather in the host scheduling system.  It could be emergent behavior from how the host scheduler logic works, how multi-DUT jobs work, and some configuration of the statuses of the DUTs that causes the system to deadlock or slow down greatly.
To clarify #24, are the DUTs stuck in pending getting released once a regular single-DUT job comes along that wants to run against them, or are they being retained for the multi-dut job?

If the latter, that still sounds like a serious issue because it will eat up a bunch of capacity.
The latter.  A DUT assigned to a multi-DUT job will wait in Pending until the job gets enough DUTs to run.

Based on my manual querying, there are very few DUTs pending for more than a few minutes.

Yes, it has the potential to become a serious issue; however, the situation is far less worse than the dashboard seems to indicate.

Lucifer is also affecting it somehow, since there's a clear correlation in the dashboard, but I cannot think of a clear mechanism for it, since Lucifer does not touch this part of the job lifetime.
Cc: pwang@chromium.org ihf@chromium.org
Are we running enough multi-dut jobs that we are using that much of the fleet on them? They used to be a rarity.

+ihf +pwang have we recently turned on multi-DUT CTS?
> The latter.  A DUT assigned to a multi-DUT job will wait in Pending until the job gets enough DUTs to run.

Until it gets enough to run, or until it completes its run?

Comment 30 by ihf@chromium.org, Apr 12 2018

We have been for a few weeks now running 2 DUTs per some CTS jobs (N only and "all"). This is in pool:cts only, not suites as listed above.

Comment 31 by ihf@chromium.org, Apr 12 2018

If this causes true problems we can drop this back to 1 DUT. No big deal for CTS at the moment (but we want to use it at some point in P for real).

Comment 32 by pwang@chromium.org, Apr 12 2018

We used to experiment using 4 DUTs few weeks ago and for 2-3 days and turns to 2 DUTs afterwards for CTS. And yes, they are scheduled in pool:cts but not pool:suite.
I'll drop the SYNC_COUNT to 1 to see if it resolve the issue.

Comment 33 by ihf@chromium.org, Apr 12 2018

Just one more remark: we want to use this feature for super long (multi-day) CTS jobs. Using SYNC_COUNT>1 will allow them to be completed in maybe about a day. If infra has assumptions that jobs take less than a week - those should probably be dropped.

Comment 34 by ihf@chromium.org, Apr 12 2018

That said for now CTS can be made not to use this feature if needed:
https://chrome-internal-review.googlesource.com/#/c/chromeos/autotest-cheets/+/608169/

Comment 35 by ihf@chromium.org, Apr 12 2018

Labels: M-67 OS-Chrome
ihf, we should chat about CTS on Skylab if you haven't already talked with pprabhu.  I suspect we can come up a better solution for CTS on Skylab as we begin migrating soon.
Labels: -M-67
Owner: ayatane@chromium.org
Upon further investigation into the misleading metrics, I discovered that I broke the sending of Running metrics, which causes the dashboard to present misleading and alarming graphs.

I believe that there is no new Pending problem.  If there are issues with synch_count jobs, they are long-standing, and I don't think there's a reason to address them now.

(CTS on Skylab discussion is still needed, because synch_count does not fit well into Skylab.)
Labels: Hotlist-CrOS-Sheriffing
Project Member

Comment 39 by bugdroid1@chromium.org, Apr 23 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/88a3160953edc8ef1bcf90d9e93ddc7758fa7121

commit 88a3160953edc8ef1bcf90d9e93ddc7758fa7121
Author: Allen Li <ayatane@chromium.org>
Date: Mon Apr 23 22:00:30 2018

[autotest] Add host status metrics to Lucifer

BUG= chromium:832146 
TEST=None

Change-Id: I91319808a1e3b4a83588ace21721d7bcf7eb6c3c
Reviewed-on: https://chromium-review.googlesource.com/1012763
Commit-Ready: Allen Li <ayatane@chromium.org>
Tested-by: Allen Li <ayatane@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/88a3160953edc8ef1bcf90d9e93ddc7758fa7121/venv/lucifer/handlers.py

Status: Fixed (was: Assigned)

Sign in to add a comment