New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 748209 link

Starred by 2 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

Index table for hqes off by a bit

Project Member Reported by pho...@chromium.org, Jul 24 2017

Issue description

Also curious is the host history page:
    https://ubercautotest.corp.google.com/afe/#tab_id=view_host&object_id=2037

Here's what the database says about that host's history:

mysql> select job_id, started_on, finished_on from afe_host_queue_entries where host_id=2037 and started_on >= '2017-06-30 10:00' order by started_on;
+-----------+---------------------+---------------------+
| job_id    | started_on          | finished_on         |
+-----------+---------------------+---------------------+
| 125957179 | 2017-06-30 10:03:03 | 2017-06-30 10:06:35 |
| 125957208 | 2017-06-30 10:07:18 | 2017-06-30 10:09:25 |
| 125957220 | 2017-06-30 10:10:08 | 2017-06-30 10:13:20 |
| 125957223 | 2017-06-30 10:14:08 | 2017-06-30 10:16:59 |
| 125957238 | 2017-06-30 10:17:22 | 2017-06-30 10:19:21 |
| 125957250 | 2017-06-30 10:19:58 | 2017-06-30 10:24:38 |
| 125957257 | 2017-06-30 10:24:56 | 2017-06-30 10:26:59 |
| 125957263 | 2017-06-30 10:27:35 | 2017-06-30 10:31:19 |
| 125957270 | 2017-06-30 10:31:51 | 2017-06-30 10:33:48 |
| 126027349 | 2017-06-30 10:34:43 | 2017-06-30 10:36:55 |
| 125956935 | 2017-06-30 11:16:42 | 2017-06-30 11:21:15 |
| 125957046 | 2017-06-30 11:21:48 | 2017-06-30 11:23:42 |
| 125956880 | 2017-06-30 11:24:13 | 2017-06-30 12:04:53 |
| 125956961 | 2017-06-30 12:05:28 | 2017-06-30 12:08:16 |
| 125956965 | 2017-06-30 12:09:04 | 2017-06-30 12:12:21 |
| 125956990 | 2017-06-30 12:13:08 | 2017-06-30 12:16:40 |
| 125956998 | 2017-06-30 12:17:23 | 2017-06-30 12:20:41 |
| 125957038 | 2017-06-30 12:21:18 | 2017-06-30 12:25:16 |
| 125957086 | 2017-06-30 12:25:56 | 2017-06-30 12:29:54 |
| 126044361 | 2017-06-30 12:46:18 | 2017-06-30 12:58:04 |
| 126044366 | 2017-06-30 12:58:21 | 2017-06-30 13:00:35 |
| 126044368 | 2017-06-30 13:00:54 | 2017-06-30 13:02:34 |
| 126044370 | 2017-06-30 13:03:06 | 2017-06-30 13:05:17 |
| 126044372 | 2017-06-30 13:05:54 | 2017-06-30 13:07:39 |
+-----------+---------------------+---------------------+

There's a steady stream of jobs from 10:03 onward.  But the
AFE host page for both the master and the shard doesn't shows
no jobs after 6:58, until the job at 10:34, then skips jobs
until the job at 12:46.  Basically, history is incomplete.
 

Comment 1 by pho...@chromium.org, Jul 24 2017

Blocking: 738036
Status: WontFix (was: Assigned)
This will be obsolete with the new index we're adding.
Blocking: -738036
Status: Started (was: WontFix)
Summary: Index table for hqes off by a bit (was: DB anomalies in quawks)
Seeing similar issues with dut_status today:


./dut_status.py -f --since '2017-09-06 05:00:00' --until '2017-09-06 06:00:00' chromeos2-row8-rack1-host12

versus

./dut_status.py -f --since '2017-09-06 04:30:00' --until '2017-09-06 06:00:00' chromeos2-row8-rack1-host12

only the second one shows the test 140140416. But it should be shown in both time ranges.
Looking at server/lib/status_history.py, I see this in
`_TestJobEvent.get_hqes()`:
            hqelist = afe.get_host_queue_entries_by_insert_time(
                host_id=host_id,
                insert_time_after=query_start,
                insert_time_before=query_end,
                started_on__gte=query_start,
                started_on__lte=query_end,
                complete=1)

It seems like somehow, the `get_host_queue_entries_by_insert_time()`
RPC must be doing the wrong thing.

... and looking at the logic for `get_host_queue_entries_by_insert_time()`,
(and thinking about the actual data), I think the place to look for
"the wrong thing" would be how the call handles a large number of
entries with the same insert_time, and different started_on values.

Looking at the data, I'm now less convinced that the problem is
a consequence of multiple insertions at about the same time.
The area that seems worth studying is to look for off-by-one
errors at the boundaries.

In any event, whatever the cause, the explanation can be found
looking the data from these queries:

mysql> select * from afe_host_queue_entry_start_times where insert_time >= '2017-09-06 04:20:00' and insert_time <= '2017-09-06 06:10:00' order by highest_hqe_id;
+-------+---------------------+----------------+
| id    | insert_time         | highest_hqe_id |
+-------+---------------------+----------------+
| 13052 | 2017-09-06 04:21:55 |      140496159 |
| 13053 | 2017-09-06 04:31:55 |      140498977 |
| 13054 | 2017-09-06 04:41:55 |      140502151 |
| 13055 | 2017-09-06 04:51:55 |      140504270 |
| 13056 | 2017-09-06 05:01:55 |      140506908 |
| 13057 | 2017-09-06 05:11:55 |      140509064 |
| 13058 | 2017-09-06 05:21:55 |      140511381 |
| 13059 | 2017-09-06 05:31:55 |      140514980 |
| 13060 | 2017-09-06 05:41:55 |      140516626 |
| 13061 | 2017-09-06 05:51:55 |      140518013 |
| 13062 | 2017-09-06 06:01:55 |      140520366 |
+-------+---------------------+----------------+
11 rows in set (0.00 sec)

mysql> select id, started_on, finished_on from afe_host_queue_entries where host_id=6725 and started_on>='2017-09-06 05:00:00' and finished_on<='2017-09-06 06:00:00' order by id;
+-----------+---------------------+---------------------+
| id        | started_on          | finished_on         |
+-----------+---------------------+---------------------+
| 140497214 | 2017-09-06 05:15:50 | 2017-09-06 05:16:22 |
| 140497222 | 2017-09-06 05:16:33 | 2017-09-06 05:17:25 |
| 140497230 | 2017-09-06 05:17:34 | 2017-09-06 05:18:13 |
| 140497244 | 2017-09-06 05:18:24 | 2017-09-06 05:19:23 |
| 140497255 | 2017-09-06 05:19:36 | 2017-09-06 05:20:22 |
| 140508976 | 2017-09-06 05:13:56 | 2017-09-06 05:15:41 |
+-----------+---------------------+---------------------+
6 rows in set (0.05 sec)


Underscoring the point from c#6, I note that this call
fails to find any of HQEs in the range [140497214, 140497255]:
    dut-status -f --since '2017-09-06 04:31:55' --until '2017-09-06 06:00:00' chromeos2-row8-rack1-host12

This call _does_ find the target HQEs:
    dut-status -f --since '2017-09-06 04:31:54' --until '2017-09-06 06:00:00' chromeos2-row8-rack1-host12

Which points pretty strongly to an off-by-one error, and roughly where
it has to be.

> Which points pretty strongly to an off-by-one error, [ ... ]

Of course, it does no such thing...  It points to "the data
violates hidden assumptions in the code's logic".  Or maybe
just "the logic is flat out wrong."  Whatever.

In this case, it looks like the bug happens when the query
range includes the job time, but fails to include the insertion
time.  For instance, in the case above, the problem entries
were inserted between 04:21:55 and 04:31:55, but the first of them
didn't start execution until 05:15:50, or nearly an hour later.

Comment 9 by pho...@chromium.org, Sep 13 2017

Right, the large difference between inserted_on and started_on violates the assumptions of the query (that those are closely related). Maybe the easiest way to fix this would be to loosen that assumption - say, assume that HQEs will be started within a few hours of when they are created (inserted).  Right now, it assumes that time gap is at most 10 minutes, which is clearly wrong in this case.

As a side note, why does it take so long to start these jobs, after all?
Oh.  The problem is at the call site:
            hqelist = afe.get_host_queue_entries_by_insert_time(
                host_id=host_id,
                insert_time_after=query_start,
                insert_time_before=query_end,
                started_on__gte=query_start,
                started_on__lte=query_end,
                complete=1)

The caller plainly says "look for HQEs that weren't inserted
before they started".  That's just plain wrong, since _every_
HQE must be inserted strictly before it's started.  It only
works because the queries get up to 10 minutes or so of grace.

Really, this is a design bug in the API:  This filtering based
on insert time should be an optimization done on the server side,
not on the client side.

I think the best path forward here is to just remove the extra index table & RPC, because we now have a started_on and finished_on index for HQEs.
Project Member

Comment 12 by bugdroid1@chromium.org, Oct 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/e1207af9ea77dfe241c14639a773449f6d07ecf5

commit e1207af9ea77dfe241c14639a773449f6d07ecf5
Author: Paul Hobbs <phobbs@google.com>
Date: Thu Oct 26 22:57:49 2017

[autotest] Fix constraints in hqe_by_insert_time

get_host_queue_entries_by_insert_time was showing some strange behavior because
it was making a wrong assumptions about the gap between insert-time and
start-time being small.

- Renamed to "get_host_queue_entries_by_start_time"
- Changed the API to be based on started_on constraints.
- Changed the function to infer only correct insert-time constraints. This
  means we cannot assume any lower bound on the insert-time, because the gap
  between insert and start times may be arbitrarily large.
- Added more unit test coverage.

BUG=chromium:748209
TEST=unit tests pass.

Change-Id: I60133e7a5f6431f35349c17f0644bc04e04cfcbd
Reviewed-on: https://chromium-review.googlesource.com/685735
Commit-Ready: Paul Hobbs <phobbs@google.com>
Tested-by: Paul Hobbs <phobbs@google.com>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/e1207af9ea77dfe241c14639a773449f6d07ecf5/client/common_lib/time_utils.py
[modify] https://crrev.com/e1207af9ea77dfe241c14639a773449f6d07ecf5/frontend/afe/rpc_interface_unittest.py
[modify] https://crrev.com/e1207af9ea77dfe241c14639a773449f6d07ecf5/server/lib/status_history.py
[modify] https://crrev.com/e1207af9ea77dfe241c14639a773449f6d07ecf5/frontend/afe/rpc_interface.py

Project Member

Comment 13 by bugdroid1@chromium.org, Nov 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/74a56111af9eccd95ab8dafd1d114a6c04e247ec

commit 74a56111af9eccd95ab8dafd1d114a6c04e247ec
Author: Paul Hobbs <phobbs@google.com>
Date: Mon Nov 06 19:31:42 2017

Revert "[autotest] Fix constraints in hqe_by_insert_time"

This reverts commit e1207af9ea77dfe241c14639a773449f6d07ecf5.

Reason for revert: Speculative fix for crbug.com/781302

Original change's description:
> [autotest] Fix constraints in hqe_by_insert_time
>
> get_host_queue_entries_by_insert_time was showing some strange behavior because
> it was making a wrong assumptions about the gap between insert-time and
> start-time being small.
>
> - Renamed to "get_host_queue_entries_by_start_time"
> - Changed the API to be based on started_on constraints.
> - Changed the function to infer only correct insert-time constraints. This
>   means we cannot assume any lower bound on the insert-time, because the gap
>   between insert and start times may be arbitrarily large.
> - Added more unit test coverage.
>
> BUG=chromium:748209
> TEST=unit tests pass.
>
> Change-Id: I60133e7a5f6431f35349c17f0644bc04e04cfcbd
> Reviewed-on: https://chromium-review.googlesource.com/685735
> Commit-Ready: Paul Hobbs <phobbs@google.com>
> Tested-by: Paul Hobbs <phobbs@google.com>
> Reviewed-by: Paul Hobbs <phobbs@google.com>

Bug: chromium:748209
Change-Id: I6a4e3c030527aafdd43f1218f514fbd3bbe6bbc6
Reviewed-on: https://chromium-review.googlesource.com/753998
Commit-Ready: Paul Hobbs <phobbs@google.com>
Tested-by: Paul Hobbs <phobbs@google.com>
Reviewed-by: Allen Li <ayatane@chromium.org>

[modify] https://crrev.com/74a56111af9eccd95ab8dafd1d114a6c04e247ec/client/common_lib/time_utils.py
[modify] https://crrev.com/74a56111af9eccd95ab8dafd1d114a6c04e247ec/frontend/afe/rpc_interface_unittest.py
[modify] https://crrev.com/74a56111af9eccd95ab8dafd1d114a6c04e247ec/server/lib/status_history.py
[modify] https://crrev.com/74a56111af9eccd95ab8dafd1d114a6c04e247ec/frontend/afe/rpc_interface.py

Cc: pho...@chromium.org akes...@chromium.org xixuan@chromium.org ayatane@chromium.org
 Issue 796628  has been merged into this issue.
What's the status of this?  Are there any imminent plans to have this fixed?
Cc: davidri...@chromium.org
Cc: jkop@chromium.org
Owner: akes...@chromium.org
Status: Untriaged (was: Started)
Chatted with Aviv about this - the way forward for this bug is probably to re-introduce the CL in #12, but with additional heuristic constraint on the maximum time between job creation and job execution - say, 48 hours. This should reduce the search space enough to still be cheap, but still be accurate most of the time.

We discussed offloading the query to the readonly database, but I don't think that works - the query is going to shard databases, not the master database.

Assigning to akeshet@ for prioritization.
> but with additional heuristic constraint on the maximum time between job creation and job execution

... But really, I think all of this can be solved simply by creating
indexes on the "started_on" and "finished_on" columns.  Last I heard,
we suspect that that change isn't impossibly expensive, in which case,
that's the way to go.

> [ ... ] the query is going to shard databases, not the master database

This is the query for the HQE data, not the special task data.
That should be going to the master database.  I also expect that this
is slow precisely because it's going to a single database server, making
a query against one of the largest tables, without an index to make it
faster.

Labels: -Pri-2 Pri-3
Status: avail (was: Untriaged)
Owner: ----
Status: Available (was: Avail)
Not currently a high priority.

Sign in to add a comment