Regression in Net.HttpTimeToFirstByte on Android Canary |
||||||
Issue descriptionWe're seeing a pretty big regression in Net.HttpTimeToFirstByte on Android Canary (From ~110 ms to ~145 ms). Unclear what's going on. We should keep an eye on this, and see if it goes back down, or if it appears on other channels. Internal link to histograms: https://uma.googleplex.com/timeline_v2?q=%7B%22day_count%22%3A%2232%22%2C%22end_date%22%3A%222017%2F08%2F27%22%2C%22entries%22%3A%5B%7B%22bucket%22%3A%22%22%2C%22logScale%22%3Afalse%2C%22measure%22%3A%22percentile%22%2C%22percentile%22%3A%2250%22%2C%22showLowVolumeData%22%3Atrue%2C%22zeroBased%22%3Afalse%7D%5D%2C%22filters%22%3A%5B%7B%22fieldId%22%3A%22platform%22%2C%22operator%22%3A%22EQ%22%2C%22selected%22%3A%5B%22A%22%5D%7D%2C%7B%22fieldId%22%3A%22channel%22%2C%22operator%22%3A%22EQ%22%2C%22selected%22%3A%5B%221%22%5D%7D%2C%7B%22fieldId%22%3A%22milestone%22%2C%22operator%22%3A%22GE%22%2C%22selected%22%3A%5B%2261%22%5D%7D%5D%2C%22histograms%22%3A%5B%22Net.HttpTimeToFirstByte%22%5D%2C%22window_size%22%3A3%7D&source=chirp%3ATimeline&email_id=45811
,
Aug 30 2017
+alexilin it looks like https://chromium-review.googlesource.com/c/chromium/src/+/612380 is in the regression range, would you please take a look? Note that PageLoad timing metrics are similarly being (slightly) affected: https://uma.googleplex.com/timeline_v2?sid=fb52c44fb02ed80444c96321f71973a2 in what appears to be the same regression range.
,
Aug 30 2017
Hm, actually looking at the split, I think the regression is between 3194 and 3196 https://uma.googleplex.com/timeline_v2?sid=54daa7a249e61d75cb5e68bb46dc0f7a There is the related CL https://chromium-review.googlesource.com/c/chromium/src/+/628522 which could be a cause but I haven't looked into it.
,
Aug 30 2017
Here's the regression range I think the culprit is in: https://chromium.googlesource.com/chromium/src/+log/62.0.3194.0..62.0.3196.0?pretty=fuller&n=10000
,
Aug 30 2017
Given that this CL also triggers some DCHECKs ( Issue 757458 ), we probably should do a speculative revert and see if the metrics drop.
,
Aug 30 2017
OK I'll go ahead
,
Aug 30 2017
It appears to be dropping back to normal on its own: https://uma.googleplex.com/timeline_v2?sid=6797a8ea8a5c1b0ef8f5e3899139334f
,
Aug 30 2017
It looks like 3198 is showing OK metrics. https://uma.googleplex.com/timeline_v2?sid=0ec6690ef4f3cc0c48d33d861aedddcb It's looking like returning to the previous median but I don't see anything in the range between 3196 and 3198 to explain it like a revert of something that landed in 3196. It's possible there was an optimization in 3198 that skewed these metrics. Maybe r497288? +zhongyi
,
Aug 30 2017
BTW in 3196 it looks like all the weight came from the 0ms bucket. It has ~5x fewer counts than in other versions.
,
Aug 31 2017
The change I landed in r497288 was to delay TCP if we are on the startup and QUIC requires confirmation, which would lead to more usage of QUIC. I could this change affects Android more as Android restarts more often. If QUIC improves this metric significantly, it's possible that this change might lead to the drop. However, I didn't see a huge difference between QUIC vs non-QUIC in finch experiments: https://uma.googleplex.com/p/chrome/variations/?sid=ca3bf47717059683ee8445e08bddfbfe. I remembered Buck mentioned hanging gets affect metrics like HttpJob.Totaltime*, could this also be affected by that? +ckrasic
,
Aug 31 2017
The first landed CL https://chromium-review.googlesource.com/c/chromium/src/+/612380 could affect the metrics but it was landed in 3194 which is OK so it's hardly a reason. The CL https://chromium-review.googlesource.com/c/chromium/src/+/628522 is unlikely to be the cause of the regression because the metrics dropped back in 3198 before the revert happened in 3201 + the change isn't really relevant to the problem. I'm planning to reland this CL with the DCHECKs issue fix.
,
Aug 31 2017
#11, SGTM. I'm going to keep an eye on the metrics to make sure the regression really is going down in 3198. Still, very mysterious :)
,
Sep 1 2017
This spike is also within historical norms: https://uma.googleplex.com/timeline_v2?sid=72477afab914a05ef29d15d69eaadd8d I think it'd be worth having links to 365 day (all versions) charts in chirp reports, as I typically find that most chirp reports don't fall outside of historical norms.
,
Sep 1 2017
It doesn't really look within historical norms to me, especially if you look at 1 day aggregation. The previous highest median was ~m58 at 130ms. The peak of this spike is 158ms. That's over a 20% increase.
,
Sep 3 2017
Ah, I had somehow had canary+dev in my link. I agree, on canary it's abnormal: https://uma.googleplex.com/timeline_v2?sid=2d6e13a994dd15daea31388168ec1466
,
Sep 21 2017
DNS.AttemptSuccessDuration seems to have gone back down: https://uma.googleplex.com/timeline_v2?sid=f2bcafe4d096a5fb6fcbec3f058f6eb1 Net.HttpTimeToFirstByte has certainly recovered, but it's not clear if it's quite returned to historical norms yet? I suspect it has, but maybe we should wait for a bit more data to be sure. https://uma.googleplex.com/timeline_v2?sid=2b9e65e29d344fbe7720a3d1ae22cd23
,
Sep 25 2017
Both have now recovered, thankfully |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by mmenke@chromium.org
, Aug 30 2017