New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 878068 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 1
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Audit all TKO tables for AUTO_INCREMENT hitting ceiling

Project Member Reported by pprabhu@chromium.org, Aug 27

Issue description

We had an outage this week due to an AUTO_INCREMENT column hitting its limit (issue 877917). This bug is a follow up.

- Audit all TKO tables to figure out if we're close to hitting any other AUTO_INCREMENT limit.
- Consider adding an alert in the prod-TKO instance for this condition on various columns
 
> - Consider adding an alert in the prod-TKO instance for this condition on various columns

If we find any instances of an id that could conceivably run out before we
retire the TKO db, I'd say we should take the time to just grow the id to
a 64-bit value rather than count on an alert.

Labels: -Chase-Pending Chase
Owner: jkop@chromium.org
Status: Assigned (was: Untriaged)
Cc: jkop@chromium.org
Owner: jrbarnette@chromium.org
Status: Started (was: Assigned)
Through an application of magic devised here: http://code.openark.org/blog/mysql/checking-for-auto_increment-capacity-with-single-query, I can safely say that only one table has any chance whatsoever of hitting the ceiling, which is tko_job_keyvals.

tko_job_keyvals.id is at 43% of id capacity, tko_tests.test_idx is at 20% of capacity, tko_jobs.job_idx and tko_task_references.id are nearing 5% of capacity. All others register as 00.00% of ids taken.

I'm only ~80% certain that this means tko_job_keyvals is in no danger; handing to Richard to confirm, hand it back to me if there's any work to be done.
magic query.txt
869 bytes View Download
query results.txt
4.3 KB View Download
Owner: jkop@chromium.org
The nominal goal is along the lines of "we'll be dead (or at least retired)
before this is a problem."  A more objective measure would be "good for at
least 50 years, even at a significant rate of growth."

None of these tables is more than 10 years old, so if we assume linear
growth, 40% used in a 10 year space means we have 15 years to go.  But
"assume linear growth" is unrealistic (quadratic or exponential is more
like it).  And you, at least, should still be working 15 years from now.

So, I'd recommend we grow the key size for all four of the columns with
a non-zero "AUTO_INCREMENT_RATIO".

Comment 5 Deleted

Last I heard, the TKO has less than five years to live. (And probably less than *one*.) Expanding TKO tables to cover a 10-year span, let alone 50 years, is a definite YAGNI.
> Last I heard, the TKO has less than five years to live.

? This is news to me.

If it's true, then yes, the current schema probably doesn't need
revisiting.

I may have misunderstood, but I thought we were abandoning it as a later step of the move to Skylab.
Need to know which tables will die in how long
(TKO Will be around for > 5 years), so that we can decide the order in which to handle them.
Want a table of expected table death on the bug.
Punt until next week.
Results:

tko_jobs and tko_tests, by started_time with NULLs ignored:
tko_jobs
job_idx 168499490 at Unix time 1521788401
job_idx 211105961 at Unix time 1537829343
2.656 idx per second
estimated time of death: 2067

tko_tests
test_idx 856946024 at Unix time 1537830425
test_idx 700695476 at Unix time 1522220400
10.01 idx per second
estimated time of death: 2029

tko_task_references and tko_job_keyvals, by the two-week span since #c3; much lower confidence estimate
tko_task_references
rate: 3.08 per second
estimated time of death: 2040
tko_job_keyvals
rate: 63.8 per second
estimated time of death: April 2019

So tko_job_keyvals is P1, tko_tests is P2, tko_jobs and tko_task_references are P3+
Labels: -Pri-1 -Chase Pri-2
Owner: akes...@chromium.org
Status: Assigned (was: Started)
Ok, so the most forseeable failure is the recurrence of the tko_job_keyvals failure:

tko_job_keyvals
rate: 63.8 per second
estimated time of death: April 2019

The solution there is to make it a 64bit key, which we can do with no downtime. Will file a separate chase bug for that and link here.

Removing chase here locally.
Status: Fixed (was: Assigned)
Follow up action at Issue 890970

Sign in to add a comment