New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 871808 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Sep 6
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Task



Sign in to add a comment

Rebuild cros-bighd-0002, and restart replication

Reported by jrbarnette@chromium.org, Aug 7

Issue description

Since the failure in bug 869754, the slave DB server cros-bighd-0002
hasn't been replicating the master database.  So, our current config
is this:
  * The master server is cros-bighd-0001, and serves most RPC queries.
  * The server cros-bighd-0003 is a read-only slave replica that serves
    selected read-only RPC queries from the shards, and supports the
    CI archiver process.
  * The server cros-bighd-0002 is configured as a read-only slave, but
    replication has failed, and the server is unused.

We should get out of this state:  Either we should have a backup
read-only replica (probably a good idea), or we should just decommission
and return bighd-0002 (easier, and a plausible response).

Assuming we want to keep a backup slave, we need to rebuild the database
on bighd-0002 and restart replication.  IIUC, that's a non-trivial project
that includes downtime of several hours.
 
Owner: dgarr...@chromium.org
Status: Assigned (was: Available)
According to dgarrett@, there are written instructions about how to
rebuild the database and restore replication.  Alas, I breezed through
content at go/cros-infra and go/chromeos-lab-admin, and didn't find
anything.

So, assigning to dgarrett@, who will post the link to the instructions.
Labels: Hotlist-Deputy
Owner: ----
Status: Available (was: Assigned)
Labels: -Pri-2 Pri-1
Owner: jkop@chromium.org
Status: Assigned (was: Available)
Status: Started (was: Assigned)
In progress, currently copying the state of the primary replica.
Labels: Chase-Pending
Status: Assigned (was: Started)
That took over four hours and did not complete; postponing until another time, probably a weekend.
Labels: -Chase-Pending Chase
Jacob will follow-up with Don on replica provisioning.
Owner: xixuan@chromium.org
Passing on Hotlist-Deputy
Owner: jkop@chromium.org
Labels: -Hotlist-Deputy
Cc: akes...@chromium.org dgarr...@chromium.org xixuan@chromium.org pprabhu@chromium.org
Synced with dgarret. Fairly clear cause of this state-dump taking >3x the expected time.

There is a table with multiple entries added by every test, for which no data is relevant or meaningful after its test completes. This balloons rapidly and comprises most of the DB by raw byte-count. At one point dshi kept this a manageable size personally, and after that improvements have been nixed by akeshet@ as it will be obsolete in Skylab. A manually-runnable script exists to do this cleanup, but it breaks all active tests.

Given that, scheduled downtime to both run the cleanup and then create the state-dump file seems called for. Alternatives exist: it is, in theory, possible to copy the raw binary files that MySQL uses to store the DB, and migrate those to the new server. This is riskier and poorly understood, so it's not recommended.
The db cleanup script that I know of is for the tko database, not the afe, and its revival is described at  Issue 805580 

Can you elaborate on what is the afe cleaning script you refer to?
Oh.... wait, yeah.

I liked to jkop@ based on fuzzy memories of what I did last time. Sorry about that. I'm not sure why the dump would take so much longer.
Talking with pprabhu, the table in question is *probably* job dependencies? It would be a script that stored the knowledge of which table is bloated and the SQL to clean it out. I can't find such a script in the codebase, so it may not exist.

Don also mentioned that he believed xixuan@ had some tool for partially automating the DB bringup process. Xixuan, is that correct?
Or that, OK.
I think I was confusing the gCloud TKO upgrade with setting up AFE replication. Similar events with long runs and lots of confusion.
Just in case the problem looked like that anyway, I did a size check on the various tables of the AFE. Summary stats attached.
bighd0003-AFE-DB-Sizes-MB.txt
4.8 KB View Download
Re #14, "partially automating the DB bringup process", hmm, I don't exactly know what it means. 
I think it meant "DB cleanup process".
The initial creation of master and replicas on bighd instances was discussed on bug Issue 810584 , and there is some disagreement between that and the process linked to above (https://sites.google.com/a/google.com/chromeos/for-team-members/infrastructure/chromeos-admin/server-management-in-the-lab/database-management)

In particular, Issue 810584 suggests the existence of a backup snapshot on gs, that is being created in the background automatically, and that we can restore to rather than needing to pause and dump from the master or a replica. Worth investigating if that is correct.
Using the backup in GS to restart a slave (as compared to recreating the master from scratch) requires that you know the replication point at which it was taken.
At least, with the mechanism that I used last time, that was true.
$ gsutil ls gs://chromeos-lab/backup/database/weekly/

<snip>
...
gs://chromeos-lab/backup/database/weekly/autotest-dump.18.08.12.gz

Sounds promising.
Oh, didn't see your two comments dgarrett@

Ok, needs investigation.
Status: Started (was: Assigned)
Investigating, it seems possible to just run a backup script out of band without disrupting the lab. I've run this today:

$ /usr/local/autotest/site_utils/backup_mysql_db.py --verbose --keep 20 --type
daily
11:25:42 INFO | Starting new HTTP connection (1): metadata.google.internal
11:25:42 DEBUG| Start db backup: daily
11:25:42 DEBUG| Dumping mysql database to file /tmp/tmpC0oyYPautotest_db_dump
11:25:42 DEBUG| Running 'set -o pipefail; mysqldump --user=root --password=<REDACTED by JKOP> --ignore-table=performance_schema.cond_instances --ignore-table=performance_schema....<snip>... --all-databases | gzip - >
/tmp/tmpC0oyYPautotest_db_dump
11:25:42 NOTIC| ts_mon was set up.

My hope is that the timestamps here will allow me to judge the replication point and use that to properly update the new replica.
That process is now cancelled, as an existing dump did not have enough information to update to the time it was created, so this presumably would not either.

However, there is functionality for this in mysqldump: https://dev.mysql.com/doc/refman/5.7/en/mysqldump.html#option_mysqldump_dump-slave

So I will use that, and implement an additional type for backup_mysql_db.py that does so automatically.
Tried that Friday evening; it ended prematurely with `mysqldump: Error 1317: Query execution was interrupted when dumping table `afe_jobs_dependency_labels` at row: 41169401`, which seems to usually be a timeout problem.
gs backup is missing the replication point, which is needed; but maybe we can just include the replication point in the backup.
CL uploaded at crrev.com/c/1182311
This would make every weekly backup store a replication point, but that inherently necessitates that it write-lock the AFE DB for >3 hours every week. Opinions wanted.
CL replaced with a dump type to be run manually, now here: crrev.com/c/1182319
Project Member

Comment 31 by bugdroid1@chromium.org, Aug 22

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/28718871689ea3ba7ff147c6762fc0b2de1a0885

commit 28718871689ea3ba7ff147c6762fc0b2de1a0885
Author: Jacob Kopczynski <jkop@google.com>
Date: Wed Aug 22 07:50:38 2018

autotest: Create MySQL dump type for replication

This type, intended to be manually run only for the moment, write-locks the
slave DB for the duration (>3 hours) in order to save a guaranteed
replication point. This allows the dump to be used to spin up a new
replica or restore a broken one.
Alternative to CL:1182311

BUG= chromium:871808 
TEST=untested

Change-Id: I3740f8d930d136469a02bad48d28b689c61d323c
Reviewed-on: https://chromium-review.googlesource.com/1182319
Reviewed-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Commit-Queue: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/28718871689ea3ba7ff147c6762fc0b2de1a0885/site_utils/backup_mysql_db.py

A ReplicationDelay alert fired yesterday, and inspecting the history it does so every Sunday morning 3-5 hours after the scheduled backup begins at 01:00 AM. The recorded delay reliably begins between 01:50 AM and 02:10 AM http://shortn/_KTJliWN2HD (smaller time ranges needed to get precision on beginning, peak, and end of delay). The alert fires based on the minimum in a 1 hour time window.

Given the timing, it is nearly certain that these backups impose a write lock on the replica for the duration. Given that, I'm reviving the proposal to make the weekly scheduled backups include a replication point: crrev.com/c/1182311

Additional action item: Make the alert not fire on Sunday mornings/when the backup script is being run, since this behavior is expected.
crrev.com/i/668928 switches the weekly backup to use the new dump type; I intend that this be reverted as soon as it's verified as successful.
Project Member

Comment 34 by bugdroid1@chromium.org, Aug 29

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/68c840b618123bd16118c236b6bdeb56831e3ddb

commit 68c840b618123bd16118c236b6bdeb56831e3ddb
Author: Jacob Kopczynski <jkop@google.com>
Date: Wed Aug 29 16:31:25 2018

Backup created successfully.
*Backup dump
Project Member

Comment 37 by bugdroid1@chromium.org, Sep 2

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/db1c89fca3128a7328e3f9c19d9015f819af895a

commit db1c89fca3128a7328e3f9c19d9015f819af895a
Author: Jacob Kopczynski <jkop@google.com>
Date: Sun Sep 02 22:42:24 2018

Creation of the replica DB on c-bh-02 is in progress
Status: Fixed (was: Started)
Project Member

Comment 40 by bugdroid1@chromium.org, Sep 8

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/1a05dc0eb0d7ba2d0289024d09810f7444d6ac27

commit 1a05dc0eb0d7ba2d0289024d09810f7444d6ac27
Author: Jacob Kopczynski <jkop@google.com>
Date: Sat Sep 08 01:37:00 2018

Add replication as allowable backup type

BUG= chromium:871808 
TEST=untested

Change-Id: Ibf1d83f81716b425a263c146a55ec06339789642
Reviewed-on: https://chromium-review.googlesource.com/1201323
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/1a05dc0eb0d7ba2d0289024d09810f7444d6ac27/site_utils/backup_mysql_db.py

Project Member

Comment 41 by bugdroid1@chromium.org, Sep 26

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/93cf7d6646ffb3cf9f4a048fa94be74b5e039a88

commit 93cf7d6646ffb3cf9f4a048fa94be74b5e039a88
Author: Jacob Kopczynski <jkop@google.com>
Date: Wed Sep 26 17:32:38 2018

autotest: Add replication point to weekly backup

Change the weekly scheduled backups to include a replication
point. This allows a replica to be created from them.

Pros: A new replica can be created and brought up at most a week out of
date without unexpected downtime.

Cons: Backups impose a write lock on the DB for their duration,
  which is ~5 hours. This affects the replica only and is outside
  business hours, but it creates downtime.

BUG= chromium:871808 
BUG=chromium:878507
TEST=untested

Change-Id: I8c00a571f0cca6150ded32eefb5c3151cf3eb8ea
Reviewed-on: https://chromium-review.googlesource.com/1182311
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/93cf7d6646ffb3cf9f4a048fa94be74b5e039a88/site_utils/backup_mysql_db.py

Sign in to add a comment