Index table for hqes off by a bit |
|||||||||
Issue description
Also curious is the host history page:
https://ubercautotest.corp.google.com/afe/#tab_id=view_host&object_id=2037
Here's what the database says about that host's history:
mysql> select job_id, started_on, finished_on from afe_host_queue_entries where host_id=2037 and started_on >= '2017-06-30 10:00' order by started_on;
+-----------+---------------------+---------------------+
| job_id | started_on | finished_on |
+-----------+---------------------+---------------------+
| 125957179 | 2017-06-30 10:03:03 | 2017-06-30 10:06:35 |
| 125957208 | 2017-06-30 10:07:18 | 2017-06-30 10:09:25 |
| 125957220 | 2017-06-30 10:10:08 | 2017-06-30 10:13:20 |
| 125957223 | 2017-06-30 10:14:08 | 2017-06-30 10:16:59 |
| 125957238 | 2017-06-30 10:17:22 | 2017-06-30 10:19:21 |
| 125957250 | 2017-06-30 10:19:58 | 2017-06-30 10:24:38 |
| 125957257 | 2017-06-30 10:24:56 | 2017-06-30 10:26:59 |
| 125957263 | 2017-06-30 10:27:35 | 2017-06-30 10:31:19 |
| 125957270 | 2017-06-30 10:31:51 | 2017-06-30 10:33:48 |
| 126027349 | 2017-06-30 10:34:43 | 2017-06-30 10:36:55 |
| 125956935 | 2017-06-30 11:16:42 | 2017-06-30 11:21:15 |
| 125957046 | 2017-06-30 11:21:48 | 2017-06-30 11:23:42 |
| 125956880 | 2017-06-30 11:24:13 | 2017-06-30 12:04:53 |
| 125956961 | 2017-06-30 12:05:28 | 2017-06-30 12:08:16 |
| 125956965 | 2017-06-30 12:09:04 | 2017-06-30 12:12:21 |
| 125956990 | 2017-06-30 12:13:08 | 2017-06-30 12:16:40 |
| 125956998 | 2017-06-30 12:17:23 | 2017-06-30 12:20:41 |
| 125957038 | 2017-06-30 12:21:18 | 2017-06-30 12:25:16 |
| 125957086 | 2017-06-30 12:25:56 | 2017-06-30 12:29:54 |
| 126044361 | 2017-06-30 12:46:18 | 2017-06-30 12:58:04 |
| 126044366 | 2017-06-30 12:58:21 | 2017-06-30 13:00:35 |
| 126044368 | 2017-06-30 13:00:54 | 2017-06-30 13:02:34 |
| 126044370 | 2017-06-30 13:03:06 | 2017-06-30 13:05:17 |
| 126044372 | 2017-06-30 13:05:54 | 2017-06-30 13:07:39 |
+-----------+---------------------+---------------------+
There's a steady stream of jobs from 10:03 onward. But the
AFE host page for both the master and the shard doesn't shows
no jobs after 6:58, until the job at 10:34, then skips jobs
until the job at 12:46. Basically, history is incomplete.
,
Sep 2 2017
This will be obsolete with the new index we're adding.
,
Sep 7 2017
Seeing similar issues with dut_status today: ./dut_status.py -f --since '2017-09-06 05:00:00' --until '2017-09-06 06:00:00' chromeos2-row8-rack1-host12 versus ./dut_status.py -f --since '2017-09-06 04:30:00' --until '2017-09-06 06:00:00' chromeos2-row8-rack1-host12 only the second one shows the test 140140416. But it should be shown in both time ranges.
,
Sep 13 2017
Looking at server/lib/status_history.py, I see this in
`_TestJobEvent.get_hqes()`:
hqelist = afe.get_host_queue_entries_by_insert_time(
host_id=host_id,
insert_time_after=query_start,
insert_time_before=query_end,
started_on__gte=query_start,
started_on__lte=query_end,
complete=1)
It seems like somehow, the `get_host_queue_entries_by_insert_time()`
RPC must be doing the wrong thing.
,
Sep 13 2017
... and looking at the logic for `get_host_queue_entries_by_insert_time()`, (and thinking about the actual data), I think the place to look for "the wrong thing" would be how the call handles a large number of entries with the same insert_time, and different started_on values.
,
Sep 13 2017
Looking at the data, I'm now less convinced that the problem is a consequence of multiple insertions at about the same time. The area that seems worth studying is to look for off-by-one errors at the boundaries. In any event, whatever the cause, the explanation can be found looking the data from these queries: mysql> select * from afe_host_queue_entry_start_times where insert_time >= '2017-09-06 04:20:00' and insert_time <= '2017-09-06 06:10:00' order by highest_hqe_id; +-------+---------------------+----------------+ | id | insert_time | highest_hqe_id | +-------+---------------------+----------------+ | 13052 | 2017-09-06 04:21:55 | 140496159 | | 13053 | 2017-09-06 04:31:55 | 140498977 | | 13054 | 2017-09-06 04:41:55 | 140502151 | | 13055 | 2017-09-06 04:51:55 | 140504270 | | 13056 | 2017-09-06 05:01:55 | 140506908 | | 13057 | 2017-09-06 05:11:55 | 140509064 | | 13058 | 2017-09-06 05:21:55 | 140511381 | | 13059 | 2017-09-06 05:31:55 | 140514980 | | 13060 | 2017-09-06 05:41:55 | 140516626 | | 13061 | 2017-09-06 05:51:55 | 140518013 | | 13062 | 2017-09-06 06:01:55 | 140520366 | +-------+---------------------+----------------+ 11 rows in set (0.00 sec) mysql> select id, started_on, finished_on from afe_host_queue_entries where host_id=6725 and started_on>='2017-09-06 05:00:00' and finished_on<='2017-09-06 06:00:00' order by id; +-----------+---------------------+---------------------+ | id | started_on | finished_on | +-----------+---------------------+---------------------+ | 140497214 | 2017-09-06 05:15:50 | 2017-09-06 05:16:22 | | 140497222 | 2017-09-06 05:16:33 | 2017-09-06 05:17:25 | | 140497230 | 2017-09-06 05:17:34 | 2017-09-06 05:18:13 | | 140497244 | 2017-09-06 05:18:24 | 2017-09-06 05:19:23 | | 140497255 | 2017-09-06 05:19:36 | 2017-09-06 05:20:22 | | 140508976 | 2017-09-06 05:13:56 | 2017-09-06 05:15:41 | +-----------+---------------------+---------------------+ 6 rows in set (0.05 sec)
,
Sep 13 2017
Underscoring the point from c#6, I note that this call
fails to find any of HQEs in the range [140497214, 140497255]:
dut-status -f --since '2017-09-06 04:31:55' --until '2017-09-06 06:00:00' chromeos2-row8-rack1-host12
This call _does_ find the target HQEs:
dut-status -f --since '2017-09-06 04:31:54' --until '2017-09-06 06:00:00' chromeos2-row8-rack1-host12
Which points pretty strongly to an off-by-one error, and roughly where
it has to be.
,
Sep 13 2017
> Which points pretty strongly to an off-by-one error, [ ... ] Of course, it does no such thing... It points to "the data violates hidden assumptions in the code's logic". Or maybe just "the logic is flat out wrong." Whatever. In this case, it looks like the bug happens when the query range includes the job time, but fails to include the insertion time. For instance, in the case above, the problem entries were inserted between 04:21:55 and 04:31:55, but the first of them didn't start execution until 05:15:50, or nearly an hour later.
,
Sep 13 2017
Right, the large difference between inserted_on and started_on violates the assumptions of the query (that those are closely related). Maybe the easiest way to fix this would be to loosen that assumption - say, assume that HQEs will be started within a few hours of when they are created (inserted). Right now, it assumes that time gap is at most 10 minutes, which is clearly wrong in this case. As a side note, why does it take so long to start these jobs, after all?
,
Sep 13 2017
Oh. The problem is at the call site:
hqelist = afe.get_host_queue_entries_by_insert_time(
host_id=host_id,
insert_time_after=query_start,
insert_time_before=query_end,
started_on__gte=query_start,
started_on__lte=query_end,
complete=1)
The caller plainly says "look for HQEs that weren't inserted
before they started". That's just plain wrong, since _every_
HQE must be inserted strictly before it's started. It only
works because the queries get up to 10 minutes or so of grace.
Really, this is a design bug in the API: This filtering based
on insert time should be an optimization done on the server side,
not on the client side.
,
Sep 13 2017
I think the best path forward here is to just remove the extra index table & RPC, because we now have a started_on and finished_on index for HQEs.
,
Oct 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/e1207af9ea77dfe241c14639a773449f6d07ecf5 commit e1207af9ea77dfe241c14639a773449f6d07ecf5 Author: Paul Hobbs <phobbs@google.com> Date: Thu Oct 26 22:57:49 2017 [autotest] Fix constraints in hqe_by_insert_time get_host_queue_entries_by_insert_time was showing some strange behavior because it was making a wrong assumptions about the gap between insert-time and start-time being small. - Renamed to "get_host_queue_entries_by_start_time" - Changed the API to be based on started_on constraints. - Changed the function to infer only correct insert-time constraints. This means we cannot assume any lower bound on the insert-time, because the gap between insert and start times may be arbitrarily large. - Added more unit test coverage. BUG=chromium:748209 TEST=unit tests pass. Change-Id: I60133e7a5f6431f35349c17f0644bc04e04cfcbd Reviewed-on: https://chromium-review.googlesource.com/685735 Commit-Ready: Paul Hobbs <phobbs@google.com> Tested-by: Paul Hobbs <phobbs@google.com> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/e1207af9ea77dfe241c14639a773449f6d07ecf5/client/common_lib/time_utils.py [modify] https://crrev.com/e1207af9ea77dfe241c14639a773449f6d07ecf5/frontend/afe/rpc_interface_unittest.py [modify] https://crrev.com/e1207af9ea77dfe241c14639a773449f6d07ecf5/server/lib/status_history.py [modify] https://crrev.com/e1207af9ea77dfe241c14639a773449f6d07ecf5/frontend/afe/rpc_interface.py
,
Nov 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/74a56111af9eccd95ab8dafd1d114a6c04e247ec commit 74a56111af9eccd95ab8dafd1d114a6c04e247ec Author: Paul Hobbs <phobbs@google.com> Date: Mon Nov 06 19:31:42 2017 Revert "[autotest] Fix constraints in hqe_by_insert_time" This reverts commit e1207af9ea77dfe241c14639a773449f6d07ecf5. Reason for revert: Speculative fix for crbug.com/781302 Original change's description: > [autotest] Fix constraints in hqe_by_insert_time > > get_host_queue_entries_by_insert_time was showing some strange behavior because > it was making a wrong assumptions about the gap between insert-time and > start-time being small. > > - Renamed to "get_host_queue_entries_by_start_time" > - Changed the API to be based on started_on constraints. > - Changed the function to infer only correct insert-time constraints. This > means we cannot assume any lower bound on the insert-time, because the gap > between insert and start times may be arbitrarily large. > - Added more unit test coverage. > > BUG=chromium:748209 > TEST=unit tests pass. > > Change-Id: I60133e7a5f6431f35349c17f0644bc04e04cfcbd > Reviewed-on: https://chromium-review.googlesource.com/685735 > Commit-Ready: Paul Hobbs <phobbs@google.com> > Tested-by: Paul Hobbs <phobbs@google.com> > Reviewed-by: Paul Hobbs <phobbs@google.com> Bug: chromium:748209 Change-Id: I6a4e3c030527aafdd43f1218f514fbd3bbe6bbc6 Reviewed-on: https://chromium-review.googlesource.com/753998 Commit-Ready: Paul Hobbs <phobbs@google.com> Tested-by: Paul Hobbs <phobbs@google.com> Reviewed-by: Allen Li <ayatane@chromium.org> [modify] https://crrev.com/74a56111af9eccd95ab8dafd1d114a6c04e247ec/client/common_lib/time_utils.py [modify] https://crrev.com/74a56111af9eccd95ab8dafd1d114a6c04e247ec/frontend/afe/rpc_interface_unittest.py [modify] https://crrev.com/74a56111af9eccd95ab8dafd1d114a6c04e247ec/server/lib/status_history.py [modify] https://crrev.com/74a56111af9eccd95ab8dafd1d114a6c04e247ec/frontend/afe/rpc_interface.py
,
Dec 20 2017
Issue 796628 has been merged into this issue.
,
Dec 20 2017
What's the status of this? Are there any imminent plans to have this fixed?
,
Dec 20 2017
,
Apr 19 2018
,
Apr 20 2018
Chatted with Aviv about this - the way forward for this bug is probably to re-introduce the CL in #12, but with additional heuristic constraint on the maximum time between job creation and job execution - say, 48 hours. This should reduce the search space enough to still be cheap, but still be accurate most of the time. We discussed offloading the query to the readonly database, but I don't think that works - the query is going to shard databases, not the master database. Assigning to akeshet@ for prioritization.
,
Apr 20 2018
> but with additional heuristic constraint on the maximum time between job creation and job execution ... But really, I think all of this can be solved simply by creating indexes on the "started_on" and "finished_on" columns. Last I heard, we suspect that that change isn't impossibly expensive, in which case, that's the way to go. > [ ... ] the query is going to shard databases, not the master database This is the query for the HQE data, not the special task data. That should be going to the master database. I also expect that this is slow precisely because it's going to a single database server, making a query against one of the largest tables, without an index to make it faster.
,
Apr 24 2018
,
Apr 24 2018
Not currently a high priority. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by pho...@chromium.org
, Jul 24 2017