Rebuild cros-bighd-0002, and restart replication
Reported by
jrbarnette@chromium.org,
Aug 7
|
||||||||||||
Issue description
Since the failure in bug 869754, the slave DB server cros-bighd-0002
hasn't been replicating the master database. So, our current config
is this:
* The master server is cros-bighd-0001, and serves most RPC queries.
* The server cros-bighd-0003 is a read-only slave replica that serves
selected read-only RPC queries from the shards, and supports the
CI archiver process.
* The server cros-bighd-0002 is configured as a read-only slave, but
replication has failed, and the server is unused.
We should get out of this state: Either we should have a backup
read-only replica (probably a good idea), or we should just decommission
and return bighd-0002 (easier, and a plausible response).
Assuming we want to keep a backup slave, we need to rebuild the database
on bighd-0002 and restart replication. IIUC, that's a non-trivial project
that includes downtime of several hours.
,
Aug 7
,
Aug 7
,
Aug 7
,
Aug 9
In progress, currently copying the state of the primary replica.
,
Aug 9
That took over four hours and did not complete; postponing until another time, probably a weekend.
,
Aug 13
Jacob will follow-up with Don on replica provisioning.
,
Aug 13
Passing on Hotlist-Deputy
,
Aug 13
,
Aug 13
,
Aug 13
Synced with dgarret. Fairly clear cause of this state-dump taking >3x the expected time. There is a table with multiple entries added by every test, for which no data is relevant or meaningful after its test completes. This balloons rapidly and comprises most of the DB by raw byte-count. At one point dshi kept this a manageable size personally, and after that improvements have been nixed by akeshet@ as it will be obsolete in Skylab. A manually-runnable script exists to do this cleanup, but it breaks all active tests. Given that, scheduled downtime to both run the cleanup and then create the state-dump file seems called for. Alternatives exist: it is, in theory, possible to copy the raw binary files that MySQL uses to store the DB, and migrate those to the new server. This is riskier and poorly understood, so it's not recommended.
,
Aug 14
The db cleanup script that I know of is for the tko database, not the afe, and its revival is described at Issue 805580 Can you elaborate on what is the afe cleaning script you refer to?
,
Aug 14
Oh.... wait, yeah. I liked to jkop@ based on fuzzy memories of what I did last time. Sorry about that. I'm not sure why the dump would take so much longer.
,
Aug 14
Talking with pprabhu, the table in question is *probably* job dependencies? It would be a script that stored the knowledge of which table is bloated and the SQL to clean it out. I can't find such a script in the codebase, so it may not exist. Don also mentioned that he believed xixuan@ had some tool for partially automating the DB bringup process. Xixuan, is that correct?
,
Aug 14
Or that, OK.
,
Aug 14
I think I was confusing the gCloud TKO upgrade with setting up AFE replication. Similar events with long runs and lots of confusion.
,
Aug 14
Just in case the problem looked like that anyway, I did a size check on the various tables of the AFE. Summary stats attached.
,
Aug 15
Re #14, "partially automating the DB bringup process", hmm, I don't exactly know what it means.
,
Aug 15
I think it meant "DB cleanup process".
,
Aug 16
The initial creation of master and replicas on bighd instances was discussed on bug Issue 810584 , and there is some disagreement between that and the process linked to above (https://sites.google.com/a/google.com/chromeos/for-team-members/infrastructure/chromeos-admin/server-management-in-the-lab/database-management) In particular, Issue 810584 suggests the existence of a backup snapshot on gs, that is being created in the background automatically, and that we can restore to rather than needing to pause and dump from the master or a replica. Worth investigating if that is correct.
,
Aug 16
Using the backup in GS to restart a slave (as compared to recreating the master from scratch) requires that you know the replication point at which it was taken.
,
Aug 16
At least, with the mechanism that I used last time, that was true.
,
Aug 16
$ gsutil ls gs://chromeos-lab/backup/database/weekly/ <snip> ... gs://chromeos-lab/backup/database/weekly/autotest-dump.18.08.12.gz Sounds promising.
,
Aug 16
Oh, didn't see your two comments dgarrett@ Ok, needs investigation.
,
Aug 17
Investigating, it seems possible to just run a backup script out of band without disrupting the lab. I've run this today: $ /usr/local/autotest/site_utils/backup_mysql_db.py --verbose --keep 20 --type daily 11:25:42 INFO | Starting new HTTP connection (1): metadata.google.internal 11:25:42 DEBUG| Start db backup: daily 11:25:42 DEBUG| Dumping mysql database to file /tmp/tmpC0oyYPautotest_db_dump 11:25:42 DEBUG| Running 'set -o pipefail; mysqldump --user=root --password=<REDACTED by JKOP> --ignore-table=performance_schema.cond_instances --ignore-table=performance_schema....<snip>... --all-databases | gzip - > /tmp/tmpC0oyYPautotest_db_dump 11:25:42 NOTIC| ts_mon was set up. My hope is that the timestamps here will allow me to judge the replication point and use that to properly update the new replica.
,
Aug 17
That process is now cancelled, as an existing dump did not have enough information to update to the time it was created, so this presumably would not either. However, there is functionality for this in mysqldump: https://dev.mysql.com/doc/refman/5.7/en/mysqldump.html#option_mysqldump_dump-slave So I will use that, and implement an additional type for backup_mysql_db.py that does so automatically.
,
Aug 18
Tried that Friday evening; it ended prematurely with `mysqldump: Error 1317: Query execution was interrupted when dumping table `afe_jobs_dependency_labels` at row: 41169401`, which seems to usually be a timeout problem.
,
Aug 20
gs backup is missing the replication point, which is needed; but maybe we can just include the replication point in the backup.
,
Aug 20
CL uploaded at crrev.com/c/1182311 This would make every weekly backup store a replication point, but that inherently necessitates that it write-lock the AFE DB for >3 hours every week. Opinions wanted.
,
Aug 21
CL replaced with a dump type to be run manually, now here: crrev.com/c/1182319
,
Aug 22
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/28718871689ea3ba7ff147c6762fc0b2de1a0885 commit 28718871689ea3ba7ff147c6762fc0b2de1a0885 Author: Jacob Kopczynski <jkop@google.com> Date: Wed Aug 22 07:50:38 2018 autotest: Create MySQL dump type for replication This type, intended to be manually run only for the moment, write-locks the slave DB for the duration (>3 hours) in order to save a guaranteed replication point. This allows the dump to be used to spin up a new replica or restore a broken one. Alternative to CL:1182311 BUG= chromium:871808 TEST=untested Change-Id: I3740f8d930d136469a02bad48d28b689c61d323c Reviewed-on: https://chromium-review.googlesource.com/1182319 Reviewed-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Commit-Queue: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/28718871689ea3ba7ff147c6762fc0b2de1a0885/site_utils/backup_mysql_db.py
,
Aug 28
A ReplicationDelay alert fired yesterday, and inspecting the history it does so every Sunday morning 3-5 hours after the scheduled backup begins at 01:00 AM. The recorded delay reliably begins between 01:50 AM and 02:10 AM http://shortn/_KTJliWN2HD (smaller time ranges needed to get precision on beginning, peak, and end of delay). The alert fires based on the minimum in a 1 hour time window. Given the timing, it is nearly certain that these backups impose a write lock on the replica for the duration. Given that, I'm reviving the proposal to make the weekly scheduled backups include a replication point: crrev.com/c/1182311 Additional action item: Make the alert not fire on Sunday mornings/when the backup script is being run, since this behavior is expected.
,
Aug 29
crrev.com/i/668928 switches the weekly backup to use the new dump type; I intend that this be reverted as soon as it's verified as successful.
,
Aug 29
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/68c840b618123bd16118c236b6bdeb56831e3ddb commit 68c840b618123bd16118c236b6bdeb56831e3ddb Author: Jacob Kopczynski <jkop@google.com> Date: Wed Aug 29 16:31:25 2018
,
Sep 2
Backup created successfully.
,
Sep 2
*Backup dump
,
Sep 2
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/db1c89fca3128a7328e3f9c19d9015f819af895a commit db1c89fca3128a7328e3f9c19d9015f819af895a Author: Jacob Kopczynski <jkop@google.com> Date: Sun Sep 02 22:42:24 2018
,
Sep 5
Creation of the replica DB on c-bh-02 is in progress
,
Sep 6
,
Sep 8
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/1a05dc0eb0d7ba2d0289024d09810f7444d6ac27 commit 1a05dc0eb0d7ba2d0289024d09810f7444d6ac27 Author: Jacob Kopczynski <jkop@google.com> Date: Sat Sep 08 01:37:00 2018 Add replication as allowable backup type BUG= chromium:871808 TEST=untested Change-Id: Ibf1d83f81716b425a263c146a55ec06339789642 Reviewed-on: https://chromium-review.googlesource.com/1201323 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/1a05dc0eb0d7ba2d0289024d09810f7444d6ac27/site_utils/backup_mysql_db.py
,
Sep 26
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/93cf7d6646ffb3cf9f4a048fa94be74b5e039a88 commit 93cf7d6646ffb3cf9f4a048fa94be74b5e039a88 Author: Jacob Kopczynski <jkop@google.com> Date: Wed Sep 26 17:32:38 2018 autotest: Add replication point to weekly backup Change the weekly scheduled backups to include a replication point. This allows a replica to be created from them. Pros: A new replica can be created and brought up at most a week out of date without unexpected downtime. Cons: Backups impose a write lock on the DB for their duration, which is ~5 hours. This affects the replica only and is outside business hours, but it creates downtime. BUG= chromium:871808 BUG=chromium:878507 TEST=untested Change-Id: I8c00a571f0cca6150ded32eefb5c3151cf3eb8ea Reviewed-on: https://chromium-review.googlesource.com/1182311 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/93cf7d6646ffb3cf9f4a048fa94be74b5e039a88/site_utils/backup_mysql_db.py |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by jrbarnette@chromium.org
, Aug 7Status: Assigned (was: Available)