New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 810966 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Jul 12
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 810965



Sign in to add a comment

mysql_stats.py should determine and record Seconds_behind_master ; viceroy dashboard should display it

Project Member Reported by akes...@chromium.org, Feb 10 2018

Issue description

Possible blocker for using slave replicas for mission critical queries like in Issue 810965 , we should instrument Seconds_behind_master so that we can be sure that adding load to these slaves does not make them so far behind that they start serving incorrect results.
 
Blocking: 810965

Comment 2 by jkop@chromium.org, Mar 12 2018

Owner: jkop@chromium.org
Status: Assigned (was: Untriaged)
Labels: -Pri-2 Pri-1
Bumping to P1 since this is an important correctness guarantee during the rollout of the shard change.

Comment 4 by jkop@chromium.org, Apr 18 2018

Status: Started (was: Assigned)
There is a cronjob to collect this already in place, but it's broken. Should be working quickly.
Project Member

Comment 5 by bugdroid1@chromium.org, Apr 18 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/cddbfe99db3e0a4e2e26a708161d1c88dffb10ba

commit cddbfe99db3e0a4e2e26a708161d1c88dffb10ba
Author: Jacob Kopczynski <jkop@google.com>
Date: Wed Apr 18 20:11:16 2018

Project Member

Comment 6 by bugdroid1@chromium.org, Apr 20 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/35fa24c0b35b403ed0a30d36774432672c2a8981

commit 35fa24c0b35b403ed0a30d36774432672c2a8981
Author: Jacob Kopczynski <jkop@google.com>
Date: Fri Apr 20 04:59:06 2018

autotest: change db metric to match other names

Also some whitespace cleanup.

BUG= chromium:810966 
TEST=None

Change-Id: I35e02da58ee5dae7834c589433c8db5a8b08228c
Reviewed-on: https://chromium-review.googlesource.com/1016169
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>
Reviewed-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/35fa24c0b35b403ed0a30d36774432672c2a8981/site_utils/stats/mysql_stats.py
[modify] https://crrev.com/35fa24c0b35b403ed0a30d36774432672c2a8981/site_utils/check_slave_db_delay.py

I'm seeing this logging on the cautotest master server. (is there where we intend to run this daemon from?)

Passwords redacted below, though they are written in plaintext in the log.


chromeos-test@cros-full-0036:/var/log$ less check_slave_db.log
<snip>
ERROR 2005 (HY000): Unknown MySQL server host 'undef' (0)
05/02 17:17:03.092 DEBUG|check_slave_db_del:0054| Seconds_Behind_Master of server cros-bighd-0003.mtv.corp.google.com is 0.
05/02 17:17:03.187 DEBUG|check_slave_db_del:0054| Seconds_Behind_Master of server cros-bighd-0002.mtv.corp.google.com is 0.
05/02 17:17:03.188 INFO |check_slave_db_del:0110| Finished checking.
05/02 18:17:02.572 INFO |check_slave_db_del:0097| Start checking Seconds_Behind_Master of slave databases
05/02 18:17:02.986 ERROR|check_slave_db_del:0063| Failed to get slave status of server undef.
Traceback (most recent call last):
  File "/usr/local/autotest/site_utils/check_slave_db_delay.py", line 47, in check_delay
    result = utils.run_sql_cmd(server, user, password, SLAVE_STATUS_CMD)
  File "/usr/local/autotest/client/bin/utils.py", line 2421, in run_sql_cmd
    return utils.run(cmd, verbose=False).stdout
  File "/usr/local/autotest/client/common_lib/utils.py", line 748, in run
    "Command returned non-zero exit status")
CmdError: Command <mysql -ucros-infra-admin -p<redacted> --host undef  -e "show slave status\G"> failed, rc=1, Command returned non-zero exit status
* Command: 
    mysql -ucros-infra-admin -p<redacted> --host undef  -e "show slave
    status\G"
Exit status: 1
Duration: 0.408904075623

stderr:
ERROR 2005 (HY000): Unknown MySQL server host 'undef' (0)
05/02 18:17:03.377 ERROR|check_slave_db_del:0063| Failed to get slave status of server undef.
Traceback (most recent call last):
  File "/usr/local/autotest/site_utils/check_slave_db_delay.py", line 47, in check_delay
    result = utils.run_sql_cmd(server, user, password, SLAVE_STATUS_CMD)
  File "/usr/local/autotest/client/bin/utils.py", line 2421, in run_sql_cmd
    return utils.run(cmd, verbose=False).stdout
  File "/usr/local/autotest/client/common_lib/utils.py", line 748, in run
    "Command returned non-zero exit status")
CmdError: Command <mysql -ucros-infra-admin -p<redacted> --host undef  -e "show slave status\G"> failed, rc=1, Command returned non-zero exit status
* Command: 
    mysql -ucros-infra-admin -p<redacted> --host undef  -e "show slave
    status\G"
Exit status: 1
Duration: 0.382756948471

stderr:
ERROR 2005 (HY000): Unknown MySQL server host 'undef' (0)
05/02 18:17:03.614 DEBUG|check_slave_db_del:0054| Seconds_Behind_Master of server cros-bighd-0003.mtv.corp.google.com is 0.
05/02 18:17:03.714 DEBUG|check_slave_db_del:0054| Seconds_Behind_Master of server cros-bighd-0002.mtv.corp.google.com is 0.
05/02 18:17:03.714 INFO |check_slave_db_del:0110| Finished checking.


Also unsure if related tangentially to Issue 839028, discovered while poking around due to it.
Cc: dgarr...@chromium.org

Comment 9 by jkop@chromium.org, May 3 2018

Yes, that's related. I noticed it today while checking in on why the metric still isn't emitted. crrev.com/i/619496 has the fix.

Comment 10 by jkop@chromium.org, May 3 2018

As for it being the correct location: It's where it was put a year or so ago when the script was first written. Moving it elsewhere (sentinel?) would be a trivial puppet change.
Ok, just wasn't sure if it was designed to run on the slave itself. Sounds like no.

Comment 12 by jkop@chromium.org, May 4 2018

Issue 682489 has been merged into this issue.
Project Member

Comment 13 by bugdroid1@chromium.org, May 9 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/4300f252c66d2e1b08baffdf08fc3399cd90c87f

commit 4300f252c66d2e1b08baffdf08fc3399cd90c87f
Author: Jacob Kopczynski <jkop@google.com>
Date: Wed May 09 21:40:21 2018

autotest: ts_mon to fix slave delay metrics

Metrics calls have existed for a year or so, but without a ts_mon
invocation in the calling script they can't be emitted. Add one.

BUG=chromium:682489
BUG= chromium:810966 
TEST=tried it briefly on live server

Change-Id: I1af80ec71fa41faa7556dba313ab7f85fcbb1339
Reviewed-on: https://chromium-review.googlesource.com/1043060
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Allen Li <ayatane@chromium.org>

[modify] https://crrev.com/4300f252c66d2e1b08baffdf08fc3399cd90c87f/site_utils/check_slave_db_delay.py

Project Member

Comment 14 by bugdroid1@chromium.org, May 18 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2049b853d0ccf2e2e549fcf83785bba01cb3cd7f

commit 2049b853d0ccf2e2e549fcf83785bba01cb3cd7f
Author: Jacob Kopczynski <jkop@google.com>
Date: Fri May 18 02:34:56 2018

autotest: Fix ts_mon invocation for slave delay

Metric was not being sent properly.

BUG= chromium:810966 
TEST=Very briefly tested in prod.

Change-Id: I4040b4a16e5a89c5323d0b22ed42ac9667d15750
Reviewed-on: https://chromium-review.googlesource.com/1058075
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/2049b853d0ccf2e2e549fcf83785bba01cb3cd7f/site_utils/check_slave_db_delay.py

Project Member

Comment 15 by bugdroid1@chromium.org, May 23 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/1a60d5c8cc8c71260589bc728f7f2e4d9dbbc5cf

commit 1a60d5c8cc8c71260589bc728f7f2e4d9dbbc5cf
Author: Jacob Kopczynski <jkop@google.com>
Date: Wed May 23 01:45:49 2018

autotest: Convert slave delay to float

We care about the delay at a specific time, not the cumulative delay
over the course of a period.

BUG= chromium:810966 
TEST=None

Change-Id: I11f9545df698d31e321545104e639ca7d1e3ec23
Reviewed-on: https://chromium-review.googlesource.com/1066927
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Jacob Kopczynski <jkop@chromium.org>

[modify] https://crrev.com/1a60d5c8cc8c71260589bc728f7f2e4d9dbbc5cf/site_utils/check_slave_db_delay.py

Cc: cra...@chromium.org
This would also fulfill a postmortem-follow-up for our recent outage, by adding liveness metrics to our slave. To fully fulfill the postmortem's follow-up, we should also add an alert.

Comment 17 by jkop@chromium.org, May 29 2018

Labels: Chase-Pending
Metric is functioning, alert still pending. 
Please link to metric.

What I see is http://shortn/_EhCv3Ot3lR which suggests that we should up the cronjob frequency (getting an update once per hour might not be enough.
Labels: cros-infra-pm-2018-05-21

Comment 20 by jkop@chromium.org, Jun 4 2018

Labels: -Chase-Pending
Project Member

Comment 21 by bugdroid1@chromium.org, Jun 4 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/478ee2c80286be985b6c907d5fd6e4035e69e62a

commit 478ee2c80286be985b6c907d5fd6e4035e69e62a
Author: Jacob Kopczynski <jkop@google.com>
Date: Mon Jun 04 21:47:06 2018

Project Member

Comment 22 by bugdroid1@chromium.org, Jun 8 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/be1dd03a8dcf22ed6d2c8c77da7eaa5462b5c0cf

commit be1dd03a8dcf22ed6d2c8c77da7eaa5462b5c0cf
Author: Jacob Kopczynski <jkop@google.com>
Date: Fri Jun 08 18:54:14 2018

Status: Fixed (was: Started)
Fixed in cl/202019979

Sign in to add a comment