New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 737674 link

Starred by 5 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug
okr



Sign in to add a comment

Limit Lifespan of PreCQ test configs in CL Actions Table

Project Member Reported by dgarr...@chromium.org, Jun 28 2017

Issue description

Currently, we store the list of PreCQ builds to test with in CIDB. When we update the list of builds to use, those pre-computed values continue to be used until CLs are rebased.

We should have a tool to manually clear them for cases where a PreCQ config becomes bad. Happened recently with (x86-alex, betty).

It would also be nice if the stored values were invalidated after some period of time, automatically.
 
Can someone point me to where the configs to clear are?
Talked with Aviv, and we have this approach...

Currently, we add a 'reset' action to CL actions after a CL has been approved for more than a week.

Part A)

When retesting a CL after it's 'reset' we will recompute the configs to test with.

Part B)

Create a tool that can insert RESETs for existing CLs that haven't yet passed the PreCQ.


A and B together mean that CLs are retested with current concepts of what to test, and we have a way to respond to bad CL PreCQ configs in a hurry.
If I'm reading the code correctly, A is already true, EXCEPT for the PreCQLauncher run which expires a given CL.

In clactions:

GetPreCQProgressMap uses GetCLPreCQProgress uses FilterPreResetActions when creating "progress_map" used by GetPreCQConfigsToTest, which is where precomputed build configs come from.

So.... A is not true for CLs whose PreCQ verified values expire, only because PreCQLauncherStage loads all CIDB data before adding RESET actions mid-processing.

At least, that's how I think it works.

If that understanding is correct, adding a RESET action for all CLs with CL_STATUS_INFLICT or CL_STATUS_FAILED gives us a way to reset the cached PreCQ test configs, and rerun all CLs that were mistakenly failed.

It will also rerun all CLs that legitimately failed, perhaps for a very long way back, causing a brief but very heavy spike in PreCQ traffic.
Cc: pprabhu@chromium.org
Here is a CIDB query that shows the history for a given CL:

mysql> select action, reason, timestamp from clActionTable where change_number = 545101 and patch_number = 8;
+---------------------------+-------------------------------+---------------------+
| action                    | reason                        | timestamp           |
+---------------------------+-------------------------------+---------------------+
| speculative               | NULL                          | 2017-06-27 20:05:52 |
| validation_pending_pre_cq | daisy_spring-no-vmtest-pre-cq | 2017-06-27 20:05:56 |
| validation_pending_pre_cq | betty-pre-cq                  | 2017-06-27 20:05:56 |
| validation_pending_pre_cq | binhost-pre-cq                | 2017-06-27 20:05:56 |
| validation_pending_pre_cq | nyan_blaze-no-vmtest-pre-cq   | 2017-06-27 20:05:56 |
| validation_pending_pre_cq | lumpy-no-vmtest-pre-cq        | 2017-06-27 20:05:56 |
| validation_pending_pre_cq | whirlwind-no-vmtest-pre-cq    | 2017-06-27 20:05:56 |
| validation_pending_pre_cq | kevin-no-vmtest-pre-cq        | 2017-06-27 20:05:56 |
| validation_pending_pre_cq | cyan-no-vmtest-pre-cq         | 2017-06-27 20:05:56 |
| validation_pending_pre_cq | samus-no-vmtest-pre-cq        | 2017-06-27 20:05:56 |
| validation_pending_pre_cq | reef-no-vmtest-pre-cq         | 2017-06-27 20:05:56 |
| screened_for_pre_cq       | NULL                          | 2017-06-27 20:05:56 |
| trybot_launching          | daisy_spring-no-vmtest-pre-cq | 2017-06-27 20:09:10 |
| trybot_launching          | betty-pre-cq                  | 2017-06-27 20:09:10 |
| trybot_launching          | binhost-pre-cq                | 2017-06-27 20:09:10 |
| trybot_launching          | nyan_blaze-no-vmtest-pre-cq   | 2017-06-27 20:09:10 |
| trybot_launching          | lumpy-no-vmtest-pre-cq        | 2017-06-27 20:09:10 |
| trybot_launching          | reef-no-vmtest-pre-cq         | 2017-06-27 20:09:10 |
| trybot_launching          | whirlwind-no-vmtest-pre-cq    | 2017-06-27 20:09:10 |
| trybot_launching          | kevin-no-vmtest-pre-cq        | 2017-06-27 20:09:10 |
| trybot_launching          | cyan-no-vmtest-pre-cq         | 2017-06-27 20:09:10 |
| trybot_launching          | samus-no-vmtest-pre-cq        | 2017-06-27 20:09:10 |
| trybot_launching          | cyan-pre-cq                   | 2017-06-27 20:09:10 |
| trybot_launching          | caroline-no-vmtest-pre-cq     | 2017-06-27 20:09:10 |
| picked_up                 | NULL                          | 2017-06-27 20:12:50 |
| picked_up                 | NULL                          | 2017-06-27 20:14:20 |
| picked_up                 | NULL                          | 2017-06-27 20:14:22 |
| picked_up                 | NULL                          | 2017-06-27 20:14:28 |
| picked_up                 | NULL                          | 2017-06-27 20:14:31 |
| picked_up                 | NULL                          | 2017-06-27 20:14:32 |
| picked_up                 | NULL                          | 2017-06-27 20:14:33 |
| picked_up                 | NULL                          | 2017-06-27 20:14:55 |
| picked_up                 | NULL                          | 2017-06-27 20:15:01 |
| picked_up                 | NULL                          | 2017-06-27 20:15:11 |
| picked_up                 | NULL                          | 2017-06-27 20:15:17 |
| picked_up                 | NULL                          | 2017-06-27 20:15:20 |
| pre_cq_inflight           | NULL                          | 2017-06-27 20:17:49 |
| verified                  | NULL                          | 2017-06-27 20:39:39 |
| verified                  | NULL                          | 2017-06-27 20:56:51 |
| verified                  | NULL                          | 2017-06-27 21:03:15 |
| verified                  | NULL                          | 2017-06-27 21:12:27 |
| verified                  | NULL                          | 2017-06-27 21:13:50 |
| requeued                  | NULL                          | 2017-06-27 21:15:01 |
| verified                  | NULL                          | 2017-06-27 21:15:21 |
| verified                  | NULL                          | 2017-06-27 21:20:57 |
| verified                  | NULL                          | 2017-06-27 21:25:06 |
| verified                  | NULL                          | 2017-06-27 22:05:56 |
| verified                  | NULL                          | 2017-06-27 22:07:38 |
| verified                  | NULL                          | 2017-06-27 22:07:39 |
| kicked_out                | unknown                       | 2017-06-27 22:37:53 |
| pre_cq_failed             | NULL                          | 2017-06-27 22:37:53 |
| speculative               | NULL                          | 2017-06-27 22:38:36 |
| requeued                  | NULL                          | 2017-06-27 22:58:19 |
| trybot_launching          | betty-pre-cq                  | 2017-06-27 22:59:09 |
| picked_up                 | NULL                          | 2017-06-27 23:03:23 |
| pre_cq_inflight           | NULL                          | 2017-06-27 23:05:38 |
| kicked_out                | unknown                       | 2017-06-28 01:15:43 |
| pre_cq_failed             | NULL                          | 2017-06-28 01:15:43 |
| speculative               | NULL                          | 2017-06-28 01:18:20 |
+---------------------------+-------------------------------+---------------------+
59 rows in set (0.12 sec)


The list of configs to test is defined by the "validation_pending_pre_cq" entries created before "screened_for_pre_cq". Given that this table is treated as "write only", the only way to reset the values to append a "reset" action (everything before the reset is ignored).

Issuing RESETs as I suggested in #3 will rerun all CLs currently in a PreCQ failed state, and screw up CLs currently mid-processing.

A different option is too occasionally recompute the "screened" configs for CLs in failed state to see if the configs set has changed, and to issue a reset, if it has.


Comment 5 by nxia@chromium.org, Jun 29 2017

reset operation will clean up the already passed pre-cq and thus increase the load on the tryserver loads. 

The pre-cq-launcher can instead compute the current pre-cq target set of a CL every time, when all the pre-cqs in the target set have already passed, it should mark the CL as pre_cq_all_verified and pre_cq_passed. So it doesn't need to insert reset actions or re-run the already passed pre-cq. 

Comment 6 by nxia@chromium.org, Jun 29 2017

if I remember correctly, as long as a CL is marked as 'passed', pre-cq-launcher won't trigger the pre-cqs again unless it's reset after it's stale after a couple days.
Only recomputing the builders to run on failure will be lower cost, if recomputing turns out to be very expensive.
Cc: dgarr...@chromium.org
 Issue 737793  has been merged into this issue.
Deduped bug was asking us to do #5 essentially.

Comment 10 by nxia@chromium.org, Jun 29 2017

if "recomputing" in #7 means computing the pre-cq configs to test for a CL, it's not expensive, it's getting special pre-cq configs from Commit-Queue.ini config file and merging them with default pre-cq test configs if needed.
Owner: ----
Summary: Limit Lifespan of PreCQ test configs in CL Actions Table (was: Create tool to clear cached PreCQ test configs.)
This would still be a very good tool to have, but I'm not going to work on it any time soon. Throwing back in the pool to find an owner.
Labels: -Pri-3 OKR Pri-2
Owner: nxia@chromium.org
I think this is a good okr sideproject candidate, towards the goal of reducing precq time. Needs a bit of thought.

Comment 13 by nxia@chromium.org, Jan 9 2018

Cc: ejcaruso@chromium.org jclinton@chromium.org
 Issue 795863  has been merged into this issue.

Comment 14 by nxia@chromium.org, Mar 19 2018

 Issue 710941  has been merged into this issue.
I noticed the other day that the PreCQ annotates the CL with a specific message if it tries to launch a config and the config doesn't exist.

What happens if we write a 'reset' action to the CL Actions table if that happens?
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>CI

Comment 17 by nxia@chromium.org, May 18 2018

Owner: ----
Status: Available (was: Untriaged)

Comment 18 by nxia@chromium.org, Jun 8 2018

Cc: -nxia@chromium.org

Sign in to add a comment