Limit Lifespan of PreCQ test configs in CL Actions Table |
|||||||
Issue descriptionCurrently, we store the list of PreCQ builds to test with in CIDB. When we update the list of builds to use, those pre-computed values continue to be used until CLs are rebased. We should have a tool to manually clear them for cases where a PreCQ config becomes bad. Happened recently with (x86-alex, betty). It would also be nice if the stored values were invalidated after some period of time, automatically.
,
Jun 28 2017
Talked with Aviv, and we have this approach... Currently, we add a 'reset' action to CL actions after a CL has been approved for more than a week. Part A) When retesting a CL after it's 'reset' we will recompute the configs to test with. Part B) Create a tool that can insert RESETs for existing CLs that haven't yet passed the PreCQ. A and B together mean that CLs are retested with current concepts of what to test, and we have a way to respond to bad CL PreCQ configs in a hurry.
,
Jun 28 2017
If I'm reading the code correctly, A is already true, EXCEPT for the PreCQLauncher run which expires a given CL. In clactions: GetPreCQProgressMap uses GetCLPreCQProgress uses FilterPreResetActions when creating "progress_map" used by GetPreCQConfigsToTest, which is where precomputed build configs come from. So.... A is not true for CLs whose PreCQ verified values expire, only because PreCQLauncherStage loads all CIDB data before adding RESET actions mid-processing. At least, that's how I think it works. If that understanding is correct, adding a RESET action for all CLs with CL_STATUS_INFLICT or CL_STATUS_FAILED gives us a way to reset the cached PreCQ test configs, and rerun all CLs that were mistakenly failed. It will also rerun all CLs that legitimately failed, perhaps for a very long way back, causing a brief but very heavy spike in PreCQ traffic.
,
Jun 29 2017
Here is a CIDB query that shows the history for a given CL: mysql> select action, reason, timestamp from clActionTable where change_number = 545101 and patch_number = 8; +---------------------------+-------------------------------+---------------------+ | action | reason | timestamp | +---------------------------+-------------------------------+---------------------+ | speculative | NULL | 2017-06-27 20:05:52 | | validation_pending_pre_cq | daisy_spring-no-vmtest-pre-cq | 2017-06-27 20:05:56 | | validation_pending_pre_cq | betty-pre-cq | 2017-06-27 20:05:56 | | validation_pending_pre_cq | binhost-pre-cq | 2017-06-27 20:05:56 | | validation_pending_pre_cq | nyan_blaze-no-vmtest-pre-cq | 2017-06-27 20:05:56 | | validation_pending_pre_cq | lumpy-no-vmtest-pre-cq | 2017-06-27 20:05:56 | | validation_pending_pre_cq | whirlwind-no-vmtest-pre-cq | 2017-06-27 20:05:56 | | validation_pending_pre_cq | kevin-no-vmtest-pre-cq | 2017-06-27 20:05:56 | | validation_pending_pre_cq | cyan-no-vmtest-pre-cq | 2017-06-27 20:05:56 | | validation_pending_pre_cq | samus-no-vmtest-pre-cq | 2017-06-27 20:05:56 | | validation_pending_pre_cq | reef-no-vmtest-pre-cq | 2017-06-27 20:05:56 | | screened_for_pre_cq | NULL | 2017-06-27 20:05:56 | | trybot_launching | daisy_spring-no-vmtest-pre-cq | 2017-06-27 20:09:10 | | trybot_launching | betty-pre-cq | 2017-06-27 20:09:10 | | trybot_launching | binhost-pre-cq | 2017-06-27 20:09:10 | | trybot_launching | nyan_blaze-no-vmtest-pre-cq | 2017-06-27 20:09:10 | | trybot_launching | lumpy-no-vmtest-pre-cq | 2017-06-27 20:09:10 | | trybot_launching | reef-no-vmtest-pre-cq | 2017-06-27 20:09:10 | | trybot_launching | whirlwind-no-vmtest-pre-cq | 2017-06-27 20:09:10 | | trybot_launching | kevin-no-vmtest-pre-cq | 2017-06-27 20:09:10 | | trybot_launching | cyan-no-vmtest-pre-cq | 2017-06-27 20:09:10 | | trybot_launching | samus-no-vmtest-pre-cq | 2017-06-27 20:09:10 | | trybot_launching | cyan-pre-cq | 2017-06-27 20:09:10 | | trybot_launching | caroline-no-vmtest-pre-cq | 2017-06-27 20:09:10 | | picked_up | NULL | 2017-06-27 20:12:50 | | picked_up | NULL | 2017-06-27 20:14:20 | | picked_up | NULL | 2017-06-27 20:14:22 | | picked_up | NULL | 2017-06-27 20:14:28 | | picked_up | NULL | 2017-06-27 20:14:31 | | picked_up | NULL | 2017-06-27 20:14:32 | | picked_up | NULL | 2017-06-27 20:14:33 | | picked_up | NULL | 2017-06-27 20:14:55 | | picked_up | NULL | 2017-06-27 20:15:01 | | picked_up | NULL | 2017-06-27 20:15:11 | | picked_up | NULL | 2017-06-27 20:15:17 | | picked_up | NULL | 2017-06-27 20:15:20 | | pre_cq_inflight | NULL | 2017-06-27 20:17:49 | | verified | NULL | 2017-06-27 20:39:39 | | verified | NULL | 2017-06-27 20:56:51 | | verified | NULL | 2017-06-27 21:03:15 | | verified | NULL | 2017-06-27 21:12:27 | | verified | NULL | 2017-06-27 21:13:50 | | requeued | NULL | 2017-06-27 21:15:01 | | verified | NULL | 2017-06-27 21:15:21 | | verified | NULL | 2017-06-27 21:20:57 | | verified | NULL | 2017-06-27 21:25:06 | | verified | NULL | 2017-06-27 22:05:56 | | verified | NULL | 2017-06-27 22:07:38 | | verified | NULL | 2017-06-27 22:07:39 | | kicked_out | unknown | 2017-06-27 22:37:53 | | pre_cq_failed | NULL | 2017-06-27 22:37:53 | | speculative | NULL | 2017-06-27 22:38:36 | | requeued | NULL | 2017-06-27 22:58:19 | | trybot_launching | betty-pre-cq | 2017-06-27 22:59:09 | | picked_up | NULL | 2017-06-27 23:03:23 | | pre_cq_inflight | NULL | 2017-06-27 23:05:38 | | kicked_out | unknown | 2017-06-28 01:15:43 | | pre_cq_failed | NULL | 2017-06-28 01:15:43 | | speculative | NULL | 2017-06-28 01:18:20 | +---------------------------+-------------------------------+---------------------+ 59 rows in set (0.12 sec) The list of configs to test is defined by the "validation_pending_pre_cq" entries created before "screened_for_pre_cq". Given that this table is treated as "write only", the only way to reset the values to append a "reset" action (everything before the reset is ignored). Issuing RESETs as I suggested in #3 will rerun all CLs currently in a PreCQ failed state, and screw up CLs currently mid-processing. A different option is too occasionally recompute the "screened" configs for CLs in failed state to see if the configs set has changed, and to issue a reset, if it has.
,
Jun 29 2017
reset operation will clean up the already passed pre-cq and thus increase the load on the tryserver loads. The pre-cq-launcher can instead compute the current pre-cq target set of a CL every time, when all the pre-cqs in the target set have already passed, it should mark the CL as pre_cq_all_verified and pre_cq_passed. So it doesn't need to insert reset actions or re-run the already passed pre-cq.
,
Jun 29 2017
if I remember correctly, as long as a CL is marked as 'passed', pre-cq-launcher won't trigger the pre-cqs again unless it's reset after it's stale after a couple days.
,
Jun 29 2017
Only recomputing the builders to run on failure will be lower cost, if recomputing turns out to be very expensive.
,
Jun 29 2017
,
Jun 29 2017
Deduped bug was asking us to do #5 essentially.
,
Jun 29 2017
if "recomputing" in #7 means computing the pre-cq configs to test for a CL, it's not expensive, it's getting special pre-cq configs from Commit-Queue.ini config file and merging them with default pre-cq test configs if needed.
,
Oct 6 2017
This would still be a very good tool to have, but I'm not going to work on it any time soon. Throwing back in the pool to find an owner.
,
Oct 6 2017
I think this is a good okr sideproject candidate, towards the goal of reducing precq time. Needs a bit of thought.
,
Jan 9 2018
,
Mar 19 2018
Issue 710941 has been merged into this issue.
,
Mar 19 2018
I noticed the other day that the PreCQ annotates the CL with a specific message if it tries to launch a config and the config doesn't exist. What happens if we write a 'reset' action to the CL Actions table if that happens?
,
May 14 2018
,
May 18 2018
,
Jun 8 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by dgarr...@chromium.org
, Jun 28 2017