New issue
Advanced search Search tips
Starred by 1 user

Issue metadata

Status: Started
Owner:
EstDays: ----

Blocked on:
issue chromium:901524

Blocking:
issue 3989


Participants' hotlists:
Monoperf-Q3


Sign in to add a comment
link

Issue 4188: Look into abnormally high metrics at p90, p95, p99

Reported by jeffcarp@chromium.org, Aug 23 Project Member

Issue description

According to GA our Q3 average issue update latency is 1,376.44 seconds. I'm pretty sure this is a bug, otherwise there would be more angry mobs.

Time to Create Issue is also a bit high at 12.18 seconds.
 

Comment 1 by jeffcarp@chromium.org, Aug 28

Project Member
Description: Show this description

Comment 2 by jeffcarp@chromium.org, Aug 31

Project Member
My first guess for the issue update problem is that this is happening:
1. A user submits the issue update form and navigates away before the next page loads
2. The same user (eventually) visits another issue detail page, causing the logEnd event to be recorded.

One way to solve this is to throw out values that are obviously too high, e.g. 1,376.44 seconds.

Comment 3 by jeffcarp@chromium.org, Aug 31

Project Member
Status: Started (was: Accepted)

Comment 4 by jeffcarp@chromium.org, Sep 7

Project Member
Summary: Look into abnormally high Issue Create and Update metrics in GA (was: Look into abnormally high Issue Update metrics in GA)

Comment 5 by bugdroid1@chromium.org, Sep 12

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/b81c9033cdd82ca45d46dd4aefc1792f83f0a8c8

commit b81c9033cdd82ca45d46dd4aefc1792f83f0a8c8
Author: Jeff Carpenter <jeffcarp@chromium.org>
Date: Wed Sep 12 23:14:53 2018

[Monorail] Add maximum threshold for ClientLogger

To guard against outlier data points caused by unexecuted JS
getting into our distribution.

Here's the scenario this prevents:
If a user submits the issue update form, this begins a timer. If they
navigate away from the form submit before the next detail page can load,
this timer never gets stopped. Then next time they visit any issue
detail page, ClientLogger sees that there's an in-progress timer
and records that value, which is huge and skews our metrics. By limiting
these values to 2*GAE timeouts (=2*60 seconds), we can guard against
false outliers making their way into our data.

Bug: monorail:4188
Change-Id: I8e3f6e3b6777ad48af336457c0fa074c93518d50
Reviewed-on: https://chromium-review.googlesource.com/1200152
Reviewed-by: Sean McCullough <seanmccullough@chromium.org>
Commit-Queue: Jeff Carpenter <jeffcarp@chromium.org>
Cr-Commit-Position: refs/heads/master@{#17552}
[modify] https://crrev.com/b81c9033cdd82ca45d46dd4aefc1792f83f0a8c8/appengine/monorail/static/js/framework/ga.js
[modify] https://crrev.com/b81c9033cdd82ca45d46dd4aefc1792f83f0a8c8/appengine/monorail/templates/tracker/issue-detail-page.ezt

Comment 6 by bugdroid1@chromium.org, Sep 20

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/e36b88b7cfe8e3303e4749638b53e33f498ac5c9

commit e36b88b7cfe8e3303e4749638b53e33f498ac5c9
Author: Jeff Carpenter <jeffcarp@chromium.org>
Date: Thu Sep 20 17:54:55 2018

[Monorail] ClientLogger: also apply 120s timeout to new-issue

Bug: monorail:4188
Change-Id: Id8379fc471cedcd25552f6c45e6a4973c5ee1fe7
Reviewed-on: https://chromium-review.googlesource.com/1235147
Auto-Submit: Jeff Carpenter <jeffcarp@chromium.org>
Reviewed-by: Jason Robbins <jrobbins@chromium.org>
Commit-Queue: Jason Robbins <jrobbins@chromium.org>
Cr-Commit-Position: refs/heads/master@{#17761}
[modify] https://crrev.com/e36b88b7cfe8e3303e4749638b53e33f498ac5c9/appengine/monorail/templates/tracker/issue-detail-page.ezt

Comment 7 by jeffcarp@chromium.org, Sep 20

Project Member
Status: FixPending (was: Started)

Comment 8 by jeffcarp@chromium.org, Sep 21

Project Member
Status: Started (was: FixPending)
Unfortunately I don't think the 120s timeout is working, today after the deploy I still see values of 37,329.13s and 7,321.12s alongside values of 2.23s and 2.34s.

Such an order of magnitude difference makes me think it's not a mix-up of seconds vs. milliseconds. It looks like the max-120s enforcing is definitely not happening.

Comment 9 by bugdroid1@chromium.org, Oct 2

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/25d2e8aeef92598cc0275d7e9b58dc73f712b253

commit 25d2e8aeef92598cc0275d7e9b58dc73f712b253
Author: Jeff Carpenter <jeffcarp@chromium.org>
Date: Tue Oct 02 22:33:56 2018

[Monorail] ClientLogger: add max threshold to labeled events too

I missed this when implementing the max threshold.

Bug: monorail:4188
Change-Id: I0c7cb98833c80f8aa12f5dfe1aaa59434eb2606e
Reviewed-on: https://chromium-review.googlesource.com/c/1257266
Reviewed-by: Sean McCullough <seanmccullough@chromium.org>
Commit-Queue: Jeff Carpenter <jeffcarp@chromium.org>
Cr-Commit-Position: refs/heads/master@{#18012}
[modify] https://crrev.com/25d2e8aeef92598cc0275d7e9b58dc73f712b253/appengine/monorail/static/jsm/client-logger.mjs

Comment 10 by jeffcarp@chromium.org, Oct 17

Project Member
Labels: -Priority-Medium Priority-High
Looking at the metrics now coming in from ts_mon, it looks like this is also true for those, so I was mistaken and the outlier data is likely a result of the way the metric is coded in JS. Bumping up priority.

Comment 11 by jeffcarp@google.com, Oct 20

Looks like this is also happening for autocomplete now.

Comment 12 by bugdroid1@chromium.org, Nov 1

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/1be84298810a20aabefa50c9d8e1243399ea38b4

commit 1be84298810a20aabefa50c9d8e1243399ea38b4
Author: Jeff Carpenter <jeffcarp@chromium.org>
Date: Thu Nov 01 20:55:56 2018

[Monorail] Monitoring: make ts_mon.js respect max threshold

Bug: monorail:4188
Change-Id: I97bec1516bf26f96a1c1a28f47b61058bc650b89
Reviewed-on: https://chromium-review.googlesource.com/c/1312397
Commit-Queue: Jeff Carpenter <jeffcarp@chromium.org>
Reviewed-by: Edward Lesmes <ehmaldonado@chromium.org>
Cr-Commit-Position: refs/heads/master@{#18740}
[modify] https://crrev.com/1be84298810a20aabefa50c9d8e1243399ea38b4/appengine/monorail/static/js/monitoring/client-logger.js
[modify] https://crrev.com/1be84298810a20aabefa50c9d8e1243399ea38b4/appengine/monorail/elements/test/client-logger_test.html

Comment 13 by jeffcarp@chromium.org, Nov 6

Project Member
Capping this at 120s seems to help at p50 and p90 but we still see some of this at p95 and p99. An idea for defeating this in a more solid way:

- On every issue create/update form submit, generate a UUID and attach it as a URL param to the form POST. Also store it in sessionStorage.
- Preserve the URL param through POST and redirect
- On the front-end, only accept the timing information if the URL param matches the UUID stored in sessionStorage.

Comment 14 by jrobbins@chromium.org, Nov 8

Project Member
Yeah, I think passing along a UUID to group requests into user-centric operations is a good idea.  However, I'm surprised that discarding measurements over 120s didn't work to approximate the same thing.  Furthermore, autocomplete is not a sequence of POSTs and GETs, it is just API requests in parallel.

Your comment after sitting with Emil was insightful.  When the user is dealing with many tabs (all monorail or maybe a mix of tabs from other sites) there might be long delays in queueing XHRs or executing JS on a non-active tab.

There's also the possibility that uploading attachments takes a lot of time in some of these cases.

Comment 15 by jeffcarp@chromium.org, Nov 19

Project Member
Blockedon: chromium:901524
Adding chromium:901524 as a blocking bug. The idea behind is that currently the volume is metrics is relatively low because we lose a lot from navigations before the 60 second flush timeout. If we can increase the volume of good metrics, we can push the bad ones from p99 up to p99.9 or p99.99.

Comment 16 by jeffcarp@chromium.org, Nov 30

Project Member
Summary: Look into abnormally high Issue Create and Update metrics (was: Look into abnormally high Issue Create and Update metrics in GA)

Comment 17 by jeffcarp@chromium.org, Dec 6

Project Member
It looks like this is still being exhibited at p99 but p90 looks relatively stable now.

Comment 18 by jeffcarp@chromium.org, Dec 19

Project Member
Blocking: 3989

Comment 19 by jeffcarp@chromium.org, Dec 19

Project Member
Summary: Look into abnormally high metrics at p90, p95, p99 (was: Look into abnormally high Issue Create and Update metrics)

Comment 20 by bugdroid1@chromium.org, Jan 2

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/a43818d80e537c25a487fafa394bf8dd2211545f

commit a43818d80e537c25a487fafa394bf8dd2211545f
Author: Jeff Carpenter <jeffcarp@chromium.org>
Date: Wed Jan 02 23:40:14 2019

[Monorail] Add document_visible field to domContentLoaded metric

One hypothesis we have about the metrics inconsistency problem is that it could
be caused by background tabs loading very slowly. This gives us a way to
segment metrics sent from tabs that are currently visible vs. those that are
not.

Note: this will cause a HTTP 400 XHR response on some clients during deployment
due to older clients submitting metrics to servers running updated metric code.

Bug: monorail:4188
Change-Id: Idd4ec914eb58eecae7d8cb03f9b041a364ecc341
Reviewed-on: https://chromium-review.googlesource.com/c/1385680
Reviewed-by: Jason Robbins <jrobbins@chromium.org>
Commit-Queue: Jeff Carpenter <jeffcarp@chromium.org>
Cr-Commit-Position: refs/heads/master@{#19749}
[modify] https://crrev.com/a43818d80e537c25a487fafa394bf8dd2211545f/appengine/monorail/framework/ts_mon_js.py
[modify] https://crrev.com/a43818d80e537c25a487fafa394bf8dd2211545f/appengine/monorail/static/js/monitoring/ts-mon.js

Comment 21 by bugdroid1@chromium.org, Jan 3

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/994cada2bf2e03e39f2a2184cd73029d30c37d35

commit 994cada2bf2e03e39f2a2184cd73029d30c37d35
Author: Jeff Carpenter <jeffcarp@chromium.org>
Date: Thu Jan 03 20:19:46 2019

[Monorail] Monitoring: add maxThresholdMs to page load metrics

Bug: monorail:4188
Change-Id: I21d0e15e07abd76a8dad34ee706fa2f590f1d862
Reviewed-on: https://chromium-review.googlesource.com/c/1366343
Reviewed-by: Edward Lesmes <ehmaldonado@chromium.org>
Commit-Queue: Jeff Carpenter <jeffcarp@chromium.org>
Cr-Commit-Position: refs/heads/master@{#19757}
[modify] https://crrev.com/994cada2bf2e03e39f2a2184cd73029d30c37d35/appengine/monorail/elements/test/ts-mon_test.html
[modify] https://crrev.com/994cada2bf2e03e39f2a2184cd73029d30c37d35/appengine/monorail/static/js/monitoring/ts-mon.js

Comment 22 by bugdroid1@chromium.org, Jan 11

Project Member
The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/42b8778cad9600b4c75a1c7bb04740932753571c

commit 42b8778cad9600b4c75a1c7bb04740932753571c
Author: Jeff Carpenter <jeffcarp@chromium.org>
Date: Fri Jan 11 00:49:30 2019

[Monorail] Monitoring: add document_visible to all FE metrics

Since filtering only metrics recorded while the page is visible helps us
eliminate metric inconsistency dramatically, this CL adds it to all remaining
FE metrics.

Bug: monorail:4188
Change-Id: I8a1fe6e88cec3de7352e4aa6a567e659d3bddafe
Reviewed-on: https://chromium-review.googlesource.com/c/1403890
Commit-Queue: Jeff Carpenter <jeffcarp@chromium.org>
Reviewed-by: Edward Lesmes <ehmaldonado@chromium.org>
Cr-Commit-Position: refs/heads/master@{#19916}
[modify] https://crrev.com/42b8778cad9600b4c75a1c7bb04740932753571c/appengine/monorail/static/js/monitoring/ts-mon_test.html
[modify] https://crrev.com/42b8778cad9600b4c75a1c7bb04740932753571c/appengine/monorail/framework/ts_mon_js.py
[modify] https://crrev.com/42b8778cad9600b4c75a1c7bb04740932753571c/appengine/monorail/static/js/monitoring/ts-mon.js
[modify] https://crrev.com/42b8778cad9600b4c75a1c7bb04740932753571c/appengine/monorail/framework/test/ts_mon_js_test.py

Comment 23 by jeffcarp@chromium.org, Jan 11

Project Member
Update: the document_visible metric field helps reduce inconsistency dramatically. Writing a CL to alert only on visible metrics.

Sign in to add a comment