New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 609239 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Jun 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug-Regression



Sign in to add a comment

platform_BootPerfServer/seconds_power_on_to_login regression on Veyron boards

Project Member Reported by rohi...@chromium.org, May 4 2016

Issue description

Owner: snanda@chromium.org
Sameer, whom is the right person to look at this?

Comment 2 by snanda@chromium.org, May 16 2016

Cc: olofj@chromium.org
Owner: diand...@chromium.org
Doug knows everything about Veyron so we could start with him
Hrm, lots of 3.14 kernel changes in that range.

Luckily I think we can narrow down because many of these were merged back to R50.  Anything merged back to R50 _can't_ be responsible since the regression doesn't show on R50.

===

Ones I'm pretty sure we can eliminate:

f234ce9eca1b CHROMIUM: rockchip: dmc: Avoid sleeping while atomic at probe time
-> Too simple and too small of a change.

1ed7bb2ee63b PRIVATE: mac80211: Fix accounting for HT packets in rx_stats histogram
56d6c4e03530 PRIVATE: mac80211: Fix buffer size for rx_stats histogram
4348de2781f2 PRIVATE: ath10k: Ignore survey update during on going scan
8a9dc53f4560 PRIVATE: mesh: export mesh routing stats to debugfs
6969c23cbefa PRIVATE: mesh: populate hop_count in mpath structure
c2f21275c0a1 CHROMIUM: config: ipq806x: enable 80211 debug fs statistics
-> This code isn't compiled for veyron.

0e283781e1ca CHROMIUM: alt-syscall: Whitelist times() for Android
-> Not used in boot I think.

2b05248f5290 CHROMIUM: clk: rockchip: Add low jitter 356 MHz for better 118.68 HDMI
576c24b359e3 CHROMIUM: clk: rockchip: Add low jitter 273 MHz for better 68.25 HDMI
217ed2975708 CHROMIUM: drm: rockchip/dw_hdmi: Make a slightly better 25.175 MHz
59672dfdba79 CHROMIUM: clk: rockchip: Add low jitter 428 MHz for better 25.175 HDMI
-> No way.

===

These seem a bit unlikely:

bf6dd2b21925 UPSTREAM: mm: add !pte_present() check on existing hugetlb_entry callbacks
-> I don't think we run this code, but not sure.

47f09f118034 UPSTREAM: tracing/filter: Do not allow infix to exceed end of string
-> Seems unlikely.

===

Possible and would only affect veyron, though still seems odd it would have that big of an impact.  Previous boot regression was USB related for whatever reason.

e7f01df20ae8 UPSTREAM: usb: dwc2: host: If using uframe scheduler, end splits better
832b4a203d38 UPSTREAM: usb: dwc2: host: Totally redo the microframe scheduler
b618b69491bc UPSTREAM: usb: dwc2: host: Properly set even/odd frame
9eed6b97e08c UPSTREAM: usb: dwc2: host: Add dwc2_hcd_get_future_frame_number() call
196460f8c3d8 UPSTREAM: usb: dwc2: host: Manage frame nums better in scheduler
1a20801d0ee7 UPSTREAM: usb: dwc2: host: Add scheduler logging for missed SOFs
84fbde5bd200 UPSTREAM: usb: dwc2: host: Split code out to make dwc2_do_reserve()
cf6ad1d2afd1 UPSTREAM: usb: dwc2: host: Reorder things in hcd_queue.c
d95b830a3eed UPSTREAM: usb: dwc2: host: Rename some fields in struct dwc2_qh
f992e2f4c0af UPSTREAM: usb: dwc2: host: Use periodic interrupt even with DMA



Note: probably a chunk of work to:
* Reproduce the problems.
* Bisect
* Dig apart whichever USB change caused the problem and figure out why it caused this regression.


I'm already spending more time on veyron than I have available.  If this can wait a while I can put it on my background list.  If not, hopefully we can find someone else?  I'm happy to walk through any code with that person to help them.

Comment 5 by tbroch@chromium.org, May 20 2016

Cc: haddowk@chromium.org
FYI ... first two devices,chromeos4-row4-rack5-host5 & chromeos4-row6-rack10-host1 had stale FW (see crosbug.com/534891 for details) which could possibly influence boot time ... can't confirm the 3rd.

I'd caution spending too much time on debugging this until FW updates can be resolved.



Cc: diand...@chromium.org dbehr@chromium.org tfiga@chromium.org marc...@chromium.org h...@chromium.org
Components: -OS>Kernel OS>Kernel>Graphics
Labels: -ReleaseBlock-Stable
Owner: ----
Status: Available (was: Untriaged)
<putting on my data jockey hat>

--

OK, I did a whole bunch of analysis and wrote it all up, but then I realize is that all I need to do is post a link to:

https://chromeperf.appspot.com/report?sid=b1b6b19a13fac09c6d35a55ec35524cbcda7bb898bf5dd6c8bed0a050ac76efe&start_rev=26240000784400000&end_rev=27390040833700100

...if you can't see the graph, make sure you're logged in with your @google account.


Explaining the graph:

* we have no regression in terms of "seconds_kernel_to_chrome_exec", only in terms of "seconds_kernel_to_chrome_main"
* veyron_jerry, peach_pit, peach_pi, and nyan_kitty (ARM) all regress at the same time.
* auron_yuna, peppy, samus (x86) doesn't regress
* regression is already fixed on R52.


So this problem was at a chrome and/or graphics level and someone already found and fixed this problem.  Since the problem affects nyan, peach, and veyron it's likely _not_ a kernel-3.14 problem nor even a Mali GPU problem.

--

Given how late we are in R51, I would be OK with saying that we accept the regression in R51 and move forward.  However, if someone happens to know what the fix is and they think a backport is appropriate, I think we could do it (with PM approval, of course).

Adding a bunch of graphics guys since they would be most likely to be familiar with this.

--

Removing ReleaseBlock tag.  Possible we could just mark it as R52-Fixed if folks want.
Cc: ihf@chromium.org
+Ilja who knows how to read these graphs
Labels: ReleaseBlock-Stable
@8: Dangit, I knew that when I started looking at this bug, but then lost track of it when I saw that huge spike.  Sorry for that...  :(  

--

Looking at chromeperf graphs is still interesting, though, and still points to a problem that's not in the BIOS or kernel.

https://chromeperf.appspot.com/report?sid=d4fa73159bd9e8b4da269c85d3c4e1e84220059823142f0b16441a6df73d6070&start_rev=26180000782100000&end_rev=27150000823600000

--

Looking at the above graphs:

* In "seconds_kernel_to_chrome_exec" there is no regression in this time period on veyron.
* In "seconds_kernel_to_login" there IS a regression in veyron.  AKA: problems happen after chrome starts.
* Oddly, veyron_minnie shows no regression in this time period.  Speedy shows, though, so probably 
  not Broadcom v. Marvell
* peach_pi and peach_pit show about .2 - .3 second regression during this period.  They share mali_ddk w/ veyron
* x86 bots are a bit noisy, but possibly peppy shows a .1 second regression?  Unrelated???

--

I can't explain the minnie difference, but since we have nearly the same regression at the same time on peach devices and since this is in the blame list:

src/partner_private/mali-ddk
5c31f911 250987 51084 Fri Mar 11 10:02:21 2016 salva.climent@arm.com MALI: r8p0-03dev0 TX011 TX004X06X DDK


I'll make a guess that this is graphics related.  What do other's think?

---

In any case since we think this _isn't_ fixed, adding blocker back in.

Comment 10 by ihf@chromium.org, May 25 2016

Labels: -ReleaseBlock-Stable
I looked at this and wasted 40 minutes. I don't know more than before. There is a big gap in the data and I can't narrow it more than
https://crosland.corp.google.com/log/7978.0.0..8061.0.0

And yes, it is possible that the mali drop regressed more than we know.

As for impact, I don't see this as a blocker.
Owner: mshe...@chromium.org
@10: Right, I think without the data someone would need to run tests manually to get full data

Matt: is there any way we can get the missing data filled in?  

---

OK, but by trawling the existing logs, I can get at least better data.  See below.

tl;dr:

A) My assertion that the regression was between "seconds_kernel_to_chrome_exec" to "seconds_kernel_to_chrome_main" is wrong.  It's actually between "seconds_kernel_to_chrome_main" and "seconds_kernel_to_signin_start".  Should be obvious from this chart.

  https://chromeperf.appspot.com/report?sid=e6ce0c4e2534f61906a2923bc4f8a88a8844cafc3a16dd25245467e1b58cb7d9&start_rev=26180000782100000&end_rev=27150000823600000

B) Regression range can be narrowed down to:
   
   <https://crosland.corp.google.com/log/8038.0.0..8061.0.0>
   <https://chromium.googlesource.com/chromium/src/+log/50.0.2657.0..51.0.2672.0?pretty=fuller&n=10000>

This has one bump of chrome (obviously this is still a lot of changes) and a week of root filesystem changes.  It _does_ include the Mali bump.

---

FULL DETAILS

I seem to see some data here:

  https://wmatrix.googleplex.com/platform/unfiltered?hide_missing=True&tests=platform_BootPerfServer&days_back=365&releases=51&platforms=veyron_jerry


It's not complete but more than we have in the graphs.  For instance, the Mali uprev happened at <https://crosland.corp.google.com/log/8053.0.0..8054.0.0>.  We don't have those versions, but we do have data for 8038.0.0 and 8064.0.0.

I can follow through and find R51-8038.0.0: 

  https://wmatrix.googleplex.com/testrun/unfiltered?test_ids=275640386,275799715

Start with 275640386 (per build, 10 iterations).  Go to job logs:
  http://ubercautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/56342906-chromeos-test

Eventually navigate to tarball:

  https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/56342906-chromeos-test/chromeos4-row4-rack5-host15/platform_BootPerfServer.tgz

I can look at: /platform_BootPerfServer/results/keyval (missing the .json file, which is maybe why no graphs???)

>>> s = open("/b/tip/tmp/keyval", "r").read()
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_exec" in x]
>>> sum(l)/len(l)
3.006
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_main" in x]
>>> sum(l)/len(l)
3.2409999999999997
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_login" in x]
>>> sum(l)/len(l)
5.7299999999999995

We'll look at 275799715 too.  Tarball:
  https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/56377258-chromeos-test/chromeos4-row4-rack5-host14/platform_BootPerfServer.tgz

>>> s = open("/b/tip/tmp/keyval", "r").read()
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_exec" in x]
>>> sum(l)/len(l)
2.95141414141414
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_main" in x]
>>> sum(l)/len(l)
3.175454545454545
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_login" in x]
>>> sum(l)/len(l)
5.662020202020202

---

Now 8064.0.0
  https://wmatrix.googleplex.com/testrun/unfiltered?test_ids=277824486,277118826

Start with 277118826, which didn't abort.  Tarball:
  https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/56756034-chromeos-test/chromeos4-row4-rack5-host21/platform_BootPerfServer.tgz

>>> s = open("/b/tip/tmp/keyval", "r").read()
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_exec" in x]
>>> sum(l)/len(l)
3.0379999999999994
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_main" in x]
>>> sum(l)/len(l)
3.2230000000000003
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_login" in x]
>>> sum(l)/len(l)
6.011


Check 277824486.  Tarball has 22 items:
  https://pantheon.corp.google.com/m/cloudstorage/b/chromeos-autotest-results/o/56840333-chromeos-test/chromeos4-row4-rack5-host14/platform_BootPerfServer.tgz

>>> s = open("/b/tip/tmp/keyval", "r").read()
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_exec" in x]
>>> sum(l)/len(l)
2.975454545454546
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_chrome_main" in x]
>>> sum(l)/len(l)
3.148636363636364
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_login" in x]
>>> sum(l)/len(l)
5.928181818181819

---

You can see from the above that there _is_ a regression in kernel_to_login of about .3 - .4 seconds, but no regression in seconds_kernel_to_chrome_exec / main.

===

Someone could more fully put the blame on a given CL by trying to reproduce these results with and without that CL.



Comment 12 by olofj@chromium.org, May 27 2016

Cc: -olofj@chromium.org
Owner: diand...@chromium.org
Status: Assigned (was: Available)
Unfortunately, there's no easy way for us to recoup the missing data at the moment. Unless we have a small list of CL's that we can try to bisect this against, I think we may need to accept this as a lost cause.

Assigning this back to Doug to determine if there are any CL's that pop out as likely candidates, and if not, to close the bug as Won'tFix due to insufficient data. Given that we are still w/in HWReq's and this regression is relatively small, I'd recommend not putting too much time into debugging
Owner: mshe...@chromium.org
@13: from above, there is a CL that pops out as a likely candidate.

src/partner_private/mali-ddk
5c31f911 250987 51084 Fri Mar 11 10:02:21 2016 salva.climent@arm.com MALI: r8p0-03dev0 TX011 TX004X06X DDK

Owner: ihf@chromium.org
Thank Doug!

@Ilja - Do you have cycles to test out perf before and after the mali drop?

Comment 16 by ihf@chromium.org, Jun 8 2016

Owner: gurcheta...@chromium.org
Gurchetan, could you please do the experiment? Please run platform_BootPerfServer on veyron with current mali and with mali reverted to and report the numbers.
I tried installing Veyron Jerry test images from Goldeneye for 8053.0.0 and 8054.0.0.  Unfortunately, Chrome doesn't load on those images, explaining the missing test results.

With a developer image, I emerged and deployed various revisions of the mali-drivers package.  I got noisy results.  For example, with commit 5c31f91 I got two divergent values on the test run (each test has 10 samples which are averaged).     


>>> s = open("/tmp/veyron_jerry/BootPerfServer-5c31f91/results-1-BootPerfServer/platform_BootPerfServer/results/keyval", "r").read()
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_login" in x]
>>> sum(l)/len(l)
9.540000000000001
>>> s = open("/tmp/veyron_jerry/BootPerfServer-5c31f91_2/results-1-BootPerfServer/platform_BootPerfServer/results/keyval", "r").read()
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_login" in x]
>>> sum(l)/len(l)
9.757000000000001
>>> s = open("/tmp/veyron_jerry/BootPerfServer-9b307af/results-1-BootPerfServer/platform_BootPerfServer/results/keyval", "r").read()
>>> l = [float(x.partition('=')[-1]) for x in s.splitlines() if "seconds_kernel_to_login" in x]
>>> sum(l)/len(l)
9.498

The 3.5 sec difference between these numbers and the tarballs referenced in #11 are probably due the difference between test and developer images.  Since we can't run the official test images, it's not possible to report accurate numbers.

Comment 18 by ihf@chromium.org, Jun 9 2016

At this stage I don't see low hanging fruit here. Either P3 or WontFix.
Status: WontFix (was: Assigned)
Thanks for trying all! Let's consider this a WontFix due to insufficient data to debug in reasonable time
Cc: -mshe...@chromium.org

Sign in to add a comment