Issue metadata
Sign in to add a comment
|
platform_BootPerfServer/seconds_power_on_to_login regression on Veyron boards |
||||||||||||||||||||||
Issue descriptionWe are seeing a 0.2-4 seconds boot time regression on Veyron boards. This is a different issue than Issue 609235. This regression was introduced in M51 immediately after the M50 branch cut. Diff: https://crosland.corp.google.com/log/7978.0.0..8061.0.0 Logs: http://cautotest/tko/retrieve_logs.cgi?job=/results/56621416-haddowk/chromeos4-row4-rack5-host5 http://cautotest/tko/retrieve_logs.cgi?job=/results/56621443-haddowk/chromeos4-row6-rack10-host1 http://cautotest/tko/retrieve_logs.cgi?job=/results/56613135-haddowk/chromeos4-row4-rack6-host3 Regression link: https://cros-goldeneye.corp.google.com/console/listCrosbolt?baselineMilestoneChannel=50%20%3A%20STABLE&comparisonMilestoneChannel=51%20%3A%20BETA&comparisonLimitResults=5&analysisType=BUILD_TO_BUILD&selectedDevices&graphSKU=mighty_rk3288_2Gb&graphTest=platform_BootPerfServer%2Fseconds_power_on_to_login.seconds_power_on_to_login
,
May 16 2016
Doug knows everything about Veyron so we could start with him
,
May 16 2016
Hrm, lots of 3.14 kernel changes in that range. Luckily I think we can narrow down because many of these were merged back to R50. Anything merged back to R50 _can't_ be responsible since the regression doesn't show on R50. === Ones I'm pretty sure we can eliminate: f234ce9eca1b CHROMIUM: rockchip: dmc: Avoid sleeping while atomic at probe time -> Too simple and too small of a change. 1ed7bb2ee63b PRIVATE: mac80211: Fix accounting for HT packets in rx_stats histogram 56d6c4e03530 PRIVATE: mac80211: Fix buffer size for rx_stats histogram 4348de2781f2 PRIVATE: ath10k: Ignore survey update during on going scan 8a9dc53f4560 PRIVATE: mesh: export mesh routing stats to debugfs 6969c23cbefa PRIVATE: mesh: populate hop_count in mpath structure c2f21275c0a1 CHROMIUM: config: ipq806x: enable 80211 debug fs statistics -> This code isn't compiled for veyron. 0e283781e1ca CHROMIUM: alt-syscall: Whitelist times() for Android -> Not used in boot I think. 2b05248f5290 CHROMIUM: clk: rockchip: Add low jitter 356 MHz for better 118.68 HDMI 576c24b359e3 CHROMIUM: clk: rockchip: Add low jitter 273 MHz for better 68.25 HDMI 217ed2975708 CHROMIUM: drm: rockchip/dw_hdmi: Make a slightly better 25.175 MHz 59672dfdba79 CHROMIUM: clk: rockchip: Add low jitter 428 MHz for better 25.175 HDMI -> No way. === These seem a bit unlikely: bf6dd2b21925 UPSTREAM: mm: add !pte_present() check on existing hugetlb_entry callbacks -> I don't think we run this code, but not sure. 47f09f118034 UPSTREAM: tracing/filter: Do not allow infix to exceed end of string -> Seems unlikely. === Possible and would only affect veyron, though still seems odd it would have that big of an impact. Previous boot regression was USB related for whatever reason. e7f01df20ae8 UPSTREAM: usb: dwc2: host: If using uframe scheduler, end splits better 832b4a203d38 UPSTREAM: usb: dwc2: host: Totally redo the microframe scheduler b618b69491bc UPSTREAM: usb: dwc2: host: Properly set even/odd frame 9eed6b97e08c UPSTREAM: usb: dwc2: host: Add dwc2_hcd_get_future_frame_number() call 196460f8c3d8 UPSTREAM: usb: dwc2: host: Manage frame nums better in scheduler 1a20801d0ee7 UPSTREAM: usb: dwc2: host: Add scheduler logging for missed SOFs 84fbde5bd200 UPSTREAM: usb: dwc2: host: Split code out to make dwc2_do_reserve() cf6ad1d2afd1 UPSTREAM: usb: dwc2: host: Reorder things in hcd_queue.c d95b830a3eed UPSTREAM: usb: dwc2: host: Rename some fields in struct dwc2_qh f992e2f4c0af UPSTREAM: usb: dwc2: host: Use periodic interrupt even with DMA
,
May 16 2016
Note: probably a chunk of work to: * Reproduce the problems. * Bisect * Dig apart whichever USB change caused the problem and figure out why it caused this regression. I'm already spending more time on veyron than I have available. If this can wait a while I can put it on my background list. If not, hopefully we can find someone else? I'm happy to walk through any code with that person to help them.
,
May 20 2016
FYI ... first two devices,chromeos4-row4-rack5-host5 & chromeos4-row6-rack10-host1 had stale FW (see crosbug.com/534891 for details) which could possibly influence boot time ... can't confirm the 3rd. I'd caution spending too much time on debugging this until FW updates can be resolved.
,
May 23 2016
<putting on my data jockey hat> -- OK, I did a whole bunch of analysis and wrote it all up, but then I realize is that all I need to do is post a link to: https://chromeperf.appspot.com/report?sid=b1b6b19a13fac09c6d35a55ec35524cbcda7bb898bf5dd6c8bed0a050ac76efe&start_rev=26240000784400000&end_rev=27390040833700100 ...if you can't see the graph, make sure you're logged in with your @google account. Explaining the graph: * we have no regression in terms of "seconds_kernel_to_chrome_exec", only in terms of "seconds_kernel_to_chrome_main" * veyron_jerry, peach_pit, peach_pi, and nyan_kitty (ARM) all regress at the same time. * auron_yuna, peppy, samus (x86) doesn't regress * regression is already fixed on R52. So this problem was at a chrome and/or graphics level and someone already found and fixed this problem. Since the problem affects nyan, peach, and veyron it's likely _not_ a kernel-3.14 problem nor even a Mali GPU problem. -- Given how late we are in R51, I would be OK with saying that we accept the regression in R51 and move forward. However, if someone happens to know what the fix is and they think a backport is appropriate, I think we could do it (with PM approval, of course). Adding a bunch of graphics guys since they would be most likely to be familiar with this. -- Removing ReleaseBlock tag. Possible we could just mark it as R52-Fixed if folks want.
,
May 23 2016
+Ilja who knows how to read these graphs
,
May 23 2016
This bug is about the tiny little bump on the left of the graph below. The bump in #6 is the large bump on the right. Overall we still are about half a second slower than we used to be. https://cros-goldeneye.corp.google.com/console/listCrosbolt?baselineMilestoneChannel=50%20%3A%20STABLE&comparisonMilestoneChannel=52%20%3A%20DEV&analysisType=BUILD_TO_BUILD&selectedDevices&graphSKU=mighty_rk3288_2Gb&graphTest=platform_BootPerfServer%2Fseconds_power_on_to_login.seconds_power_on_to_login
,
May 24 2016
@8: Dangit, I knew that when I started looking at this bug, but then lost track of it when I saw that huge spike. Sorry for that... :( -- Looking at chromeperf graphs is still interesting, though, and still points to a problem that's not in the BIOS or kernel. https://chromeperf.appspot.com/report?sid=d4fa73159bd9e8b4da269c85d3c4e1e84220059823142f0b16441a6df73d6070&start_rev=26180000782100000&end_rev=27150000823600000 -- Looking at the above graphs: * In "seconds_kernel_to_chrome_exec" there is no regression in this time period on veyron. * In "seconds_kernel_to_login" there IS a regression in veyron. AKA: problems happen after chrome starts. * Oddly, veyron_minnie shows no regression in this time period. Speedy shows, though, so probably not Broadcom v. Marvell * peach_pi and peach_pit show about .2 - .3 second regression during this period. They share mali_ddk w/ veyron * x86 bots are a bit noisy, but possibly peppy shows a .1 second regression? Unrelated??? -- I can't explain the minnie difference, but since we have nearly the same regression at the same time on peach devices and since this is in the blame list: src/partner_private/mali-ddk 5c31f911 250987 51084 Fri Mar 11 10:02:21 2016 salva.climent@arm.com MALI: r8p0-03dev0 TX011 TX004X06X DDK I'll make a guess that this is graphics related. What do other's think? --- In any case since we think this _isn't_ fixed, adding blocker back in.
,
May 25 2016
I looked at this and wasted 40 minutes. I don't know more than before. There is a big gap in the data and I can't narrow it more than https://crosland.corp.google.com/log/7978.0.0..8061.0.0 And yes, it is possible that the mali drop regressed more than we know. As for impact, I don't see this as a blocker.
,
May 25 2016
@10: Right, I think without the data someone would need to run tests manually to get full data Matt: is there any way we can get the missing data filled in? --- OK, but by trawling the existing logs, I can get at least better data. See below. tl;dr: A) My assertion that the regression was between "seconds_kernel_to_chrome_exec" to "seconds_kernel_to_chrome_main" is wrong. It's actually between "seconds_kernel_to_chrome_main" and "seconds_kernel_to_signin_start". Should be obvious from this chart. https://chromeperf.appspot.com/report?sid=e6ce0c4e2534f61906a2923bc4f8a88a8844cafc3a16dd25245467e1b58cb7d9&start_rev=26180000782100000&end_rev=27150000823600000 B) Regression range can be narrowed down to: <https://crosland.corp.google.com/log/8038.0.0..8061.0.0> <https://chromium.googlesource.com/chromium/src/+log/50.0.2657.0..51.0.2672.0?pretty=fuller&n=10000> This has one bump of chrome (obviously this is still a lot of changes) and a week of root filesystem changes. It _does_ include the Mali bump. --- FULL DETAILS I seem to see some data here: https://wmatrix.googleplex.com/platform/unfiltered?hide_missing=True&tests=platform_BootPerfServer&days_back=365&releases=51&platforms=veyron_jerry It's not complete but more than we have in the graphs. For instance, the Mali uprev happened at <https://crosland.corp.google.com/log/8053.0.0..8054.0.0>. We don't have those versions, but we do have data for 8038.0.0 and 8064.0.0. I can follow through and find R51-8038.0.0: https://wmatrix.googleplex.com/testrun/unfiltered?test_ids=275640386,275799715 Start with 275640386 (per build, 10 iterations). Go to job logs: http://ubercautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/56342906-chromeos-test Eventually navigate to tarball: https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/56342906-chromeos-test/chromeos4-row4-rack5-host15/platform_BootPerfServer.tgz I can look at: /platform_BootPerfServer/results/keyval (missing the .json file, which is maybe why no graphs???) >>> s = open("/b/tip/tmp/keyval", "r").read() >>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_exec" in x] >>> sum(l)/len(l) 3.006 >>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_main" in x] >>> sum(l)/len(l) 3.2409999999999997 >>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_login" in x] >>> sum(l)/len(l) 5.7299999999999995 We'll look at 275799715 too. Tarball: https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/56377258-chromeos-test/chromeos4-row4-rack5-host14/platform_BootPerfServer.tgz >>> s = open("/b/tip/tmp/keyval", "r").read() >>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_exec" in x] >>> sum(l)/len(l) 2.95141414141414 >>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_main" in x] >>> sum(l)/len(l) 3.175454545454545 >>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_login" in x] >>> sum(l)/len(l) 5.662020202020202 --- Now 8064.0.0 https://wmatrix.googleplex.com/testrun/unfiltered?test_ids=277824486,277118826 Start with 277118826, which didn't abort. Tarball: https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/56756034-chromeos-test/chromeos4-row4-rack5-host21/platform_BootPerfServer.tgz >>> s = open("/b/tip/tmp/keyval", "r").read() >>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_exec" in x] >>> sum(l)/len(l) 3.0379999999999994 >>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_main" in x] >>> sum(l)/len(l) 3.2230000000000003 >>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_login" in x] >>> sum(l)/len(l) 6.011 Check 277824486. Tarball has 22 items: https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/56840333-chromeos-test/chromeos4-row4-rack5-host14/platform_BootPerfServer.tgz >>> s = open("/b/tip/tmp/keyval", "r").read() >>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_exec" in x] >>> sum(l)/len(l) 2.975454545454546 >>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_main" in x] >>> sum(l)/len(l) 3.148636363636364 >>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_login" in x] >>> sum(l)/len(l) 5.928181818181819 --- You can see from the above that there _is_ a regression in kernel_to_login of about .3 - .4 seconds, but no regression in seconds_kernel_to_chrome_exec / main. === Someone could more fully put the blame on a given CL by trying to reproduce these results with and without that CL.
,
May 27 2016
,
Jun 6 2016
Unfortunately, there's no easy way for us to recoup the missing data at the moment. Unless we have a small list of CL's that we can try to bisect this against, I think we may need to accept this as a lost cause. Assigning this back to Doug to determine if there are any CL's that pop out as likely candidates, and if not, to close the bug as Won'tFix due to insufficient data. Given that we are still w/in HWReq's and this regression is relatively small, I'd recommend not putting too much time into debugging
,
Jun 6 2016
@13: from above, there is a CL that pops out as a likely candidate. src/partner_private/mali-ddk 5c31f911 250987 51084 Fri Mar 11 10:02:21 2016 salva.climent@arm.com MALI: r8p0-03dev0 TX011 TX004X06X DDK
,
Jun 6 2016
Thank Doug! @Ilja - Do you have cycles to test out perf before and after the mali drop?
,
Jun 8 2016
Gurchetan, could you please do the experiment? Please run platform_BootPerfServer on veyron with current mali and with mali reverted to and report the numbers.
,
Jun 9 2016
I tried installing Veyron Jerry test images from Goldeneye for 8053.0.0 and 8054.0.0. Unfortunately, Chrome doesn't load on those images, explaining the missing test results.
With a developer image, I emerged and deployed various revisions of the mali-drivers package. I got noisy results. For example, with commit 5c31f91 I got two divergent values on the test run (each test has 10 samples which are averaged).
>>> s = open("/tmp/veyron_jerry/BootPerfServer-5c31f91/results-1-BootPerfServer/platform_BootPerfServer/results/keyval", "r").read()
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_login" in x]
>>> sum(l)/len(l)
9.540000000000001
>>> s = open("/tmp/veyron_jerry/BootPerfServer-5c31f91_2/results-1-BootPerfServer/platform_BootPerfServer/results/keyval", "r").read()
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_login" in x]
>>> sum(l)/len(l)
9.757000000000001
>>> s = open("/tmp/veyron_jerry/BootPerfServer-9b307af/results-1-BootPerfServer/platform_BootPerfServer/results/keyval", "r").read()
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_login" in x]
>>> sum(l)/len(l)
9.498
The 3.5 sec difference between these numbers and the tarballs referenced in #11 are probably due the difference between test and developer images. Since we can't run the official test images, it's not possible to report accurate numbers.
,
Jun 9 2016
At this stage I don't see low hanging fruit here. Either P3 or WontFix.
,
Jun 10 2016
Thanks for trying all! Let's consider this a WontFix due to insufficient data to debug in reasonable time
,
Jul 31 2016
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by bhthompson@chromium.org
, May 16 2016