mysql_stats.py should determine and record Seconds_behind_master ; viceroy dashboard should display it |
|||||||||
Issue descriptionPossible blocker for using slave replicas for mission critical queries like in Issue 810965 , we should instrument Seconds_behind_master so that we can be sure that adding load to these slaves does not make them so far behind that they start serving incorrect results.
,
Mar 12 2018
,
Apr 17 2018
Bumping to P1 since this is an important correctness guarantee during the rollout of the shard change.
,
Apr 18 2018
There is a cronjob to collect this already in place, but it's broken. Should be working quickly.
,
Apr 18 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/cddbfe99db3e0a4e2e26a708161d1c88dffb10ba commit cddbfe99db3e0a4e2e26a708161d1c88dffb10ba Author: Jacob Kopczynski <jkop@google.com> Date: Wed Apr 18 20:11:16 2018
,
Apr 20 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/35fa24c0b35b403ed0a30d36774432672c2a8981 commit 35fa24c0b35b403ed0a30d36774432672c2a8981 Author: Jacob Kopczynski <jkop@google.com> Date: Fri Apr 20 04:59:06 2018 autotest: change db metric to match other names Also some whitespace cleanup. BUG= chromium:810966 TEST=None Change-Id: I35e02da58ee5dae7834c589433c8db5a8b08228c Reviewed-on: https://chromium-review.googlesource.com/1016169 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> Reviewed-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/35fa24c0b35b403ed0a30d36774432672c2a8981/site_utils/stats/mysql_stats.py [modify] https://crrev.com/35fa24c0b35b403ed0a30d36774432672c2a8981/site_utils/check_slave_db_delay.py
,
May 3 2018
I'm seeing this logging on the cautotest master server. (is there where we intend to run this daemon from?) Passwords redacted below, though they are written in plaintext in the log. chromeos-test@cros-full-0036:/var/log$ less check_slave_db.log <snip> ERROR 2005 (HY000): Unknown MySQL server host 'undef' (0) 05/02 17:17:03.092 DEBUG|check_slave_db_del:0054| Seconds_Behind_Master of server cros-bighd-0003.mtv.corp.google.com is 0. 05/02 17:17:03.187 DEBUG|check_slave_db_del:0054| Seconds_Behind_Master of server cros-bighd-0002.mtv.corp.google.com is 0. 05/02 17:17:03.188 INFO |check_slave_db_del:0110| Finished checking. 05/02 18:17:02.572 INFO |check_slave_db_del:0097| Start checking Seconds_Behind_Master of slave databases 05/02 18:17:02.986 ERROR|check_slave_db_del:0063| Failed to get slave status of server undef. Traceback (most recent call last): File "/usr/local/autotest/site_utils/check_slave_db_delay.py", line 47, in check_delay result = utils.run_sql_cmd(server, user, password, SLAVE_STATUS_CMD) File "/usr/local/autotest/client/bin/utils.py", line 2421, in run_sql_cmd return utils.run(cmd, verbose=False).stdout File "/usr/local/autotest/client/common_lib/utils.py", line 748, in run "Command returned non-zero exit status") CmdError: Command <mysql -ucros-infra-admin -p<redacted> --host undef -e "show slave status\G"> failed, rc=1, Command returned non-zero exit status * Command: mysql -ucros-infra-admin -p<redacted> --host undef -e "show slave status\G" Exit status: 1 Duration: 0.408904075623 stderr: ERROR 2005 (HY000): Unknown MySQL server host 'undef' (0) 05/02 18:17:03.377 ERROR|check_slave_db_del:0063| Failed to get slave status of server undef. Traceback (most recent call last): File "/usr/local/autotest/site_utils/check_slave_db_delay.py", line 47, in check_delay result = utils.run_sql_cmd(server, user, password, SLAVE_STATUS_CMD) File "/usr/local/autotest/client/bin/utils.py", line 2421, in run_sql_cmd return utils.run(cmd, verbose=False).stdout File "/usr/local/autotest/client/common_lib/utils.py", line 748, in run "Command returned non-zero exit status") CmdError: Command <mysql -ucros-infra-admin -p<redacted> --host undef -e "show slave status\G"> failed, rc=1, Command returned non-zero exit status * Command: mysql -ucros-infra-admin -p<redacted> --host undef -e "show slave status\G" Exit status: 1 Duration: 0.382756948471 stderr: ERROR 2005 (HY000): Unknown MySQL server host 'undef' (0) 05/02 18:17:03.614 DEBUG|check_slave_db_del:0054| Seconds_Behind_Master of server cros-bighd-0003.mtv.corp.google.com is 0. 05/02 18:17:03.714 DEBUG|check_slave_db_del:0054| Seconds_Behind_Master of server cros-bighd-0002.mtv.corp.google.com is 0. 05/02 18:17:03.714 INFO |check_slave_db_del:0110| Finished checking. Also unsure if related tangentially to Issue 839028, discovered while poking around due to it.
,
May 3 2018
,
May 3 2018
Yes, that's related. I noticed it today while checking in on why the metric still isn't emitted. crrev.com/i/619496 has the fix.
,
May 3 2018
As for it being the correct location: It's where it was put a year or so ago when the script was first written. Moving it elsewhere (sentinel?) would be a trivial puppet change.
,
May 3 2018
Ok, just wasn't sure if it was designed to run on the slave itself. Sounds like no.
,
May 4 2018
Issue 682489 has been merged into this issue.
,
May 9 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/4300f252c66d2e1b08baffdf08fc3399cd90c87f commit 4300f252c66d2e1b08baffdf08fc3399cd90c87f Author: Jacob Kopczynski <jkop@google.com> Date: Wed May 09 21:40:21 2018 autotest: ts_mon to fix slave delay metrics Metrics calls have existed for a year or so, but without a ts_mon invocation in the calling script they can't be emitted. Add one. BUG=chromium:682489 BUG= chromium:810966 TEST=tried it briefly on live server Change-Id: I1af80ec71fa41faa7556dba313ab7f85fcbb1339 Reviewed-on: https://chromium-review.googlesource.com/1043060 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Allen Li <ayatane@chromium.org> [modify] https://crrev.com/4300f252c66d2e1b08baffdf08fc3399cd90c87f/site_utils/check_slave_db_delay.py
,
May 18 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2049b853d0ccf2e2e549fcf83785bba01cb3cd7f commit 2049b853d0ccf2e2e549fcf83785bba01cb3cd7f Author: Jacob Kopczynski <jkop@google.com> Date: Fri May 18 02:34:56 2018 autotest: Fix ts_mon invocation for slave delay Metric was not being sent properly. BUG= chromium:810966 TEST=Very briefly tested in prod. Change-Id: I4040b4a16e5a89c5323d0b22ed42ac9667d15750 Reviewed-on: https://chromium-review.googlesource.com/1058075 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/2049b853d0ccf2e2e549fcf83785bba01cb3cd7f/site_utils/check_slave_db_delay.py
,
May 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/1a60d5c8cc8c71260589bc728f7f2e4d9dbbc5cf commit 1a60d5c8cc8c71260589bc728f7f2e4d9dbbc5cf Author: Jacob Kopczynski <jkop@google.com> Date: Wed May 23 01:45:49 2018 autotest: Convert slave delay to float We care about the delay at a specific time, not the cumulative delay over the course of a period. BUG= chromium:810966 TEST=None Change-Id: I11f9545df698d31e321545104e639ca7d1e3ec23 Reviewed-on: https://chromium-review.googlesource.com/1066927 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Jacob Kopczynski <jkop@chromium.org> [modify] https://crrev.com/1a60d5c8cc8c71260589bc728f7f2e4d9dbbc5cf/site_utils/check_slave_db_delay.py
,
May 23 2018
This would also fulfill a postmortem-follow-up for our recent outage, by adding liveness metrics to our slave. To fully fulfill the postmortem's follow-up, we should also add an alert.
,
May 29 2018
Metric is functioning, alert still pending.
,
May 29 2018
Please link to metric. What I see is http://shortn/_EhCv3Ot3lR which suggests that we should up the cronjob frequency (getting an update once per hour might not be enough.
,
May 30 2018
,
Jun 4 2018
,
Jun 4 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/478ee2c80286be985b6c907d5fd6e4035e69e62a commit 478ee2c80286be985b6c907d5fd6e4035e69e62a Author: Jacob Kopczynski <jkop@google.com> Date: Mon Jun 04 21:47:06 2018
,
Jun 8 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/be1dd03a8dcf22ed6d2c8c77da7eaa5462b5c0cf commit be1dd03a8dcf22ed6d2c8c77da7eaa5462b5c0cf Author: Jacob Kopczynski <jkop@google.com> Date: Fri Jun 08 18:54:14 2018
,
Jul 12
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by akes...@chromium.org
, Feb 10 2018