top_25_smooth performance is slower for USB GigE (vs USB 100Mbps)
Reported by
dave.rod...@arm.com,
Nov 7 2017
|
||||||||||||
Issue descriptionUserAgent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.45 Safari/537.36 Platform: 10082.0.2017_10_31_1154 (Test Build - davrod01) developer-build elm Steps to reproduce the problem: 1. connect to network using a USB-ethernet adapter, or WiFi 2. run top_25_smooth benchmark on elm: run_benchmark --browser=cros-chrome smoothness.top_25_smooth --remote=... 3. benchmark results vary depending on adapter What is the expected behavior? Results should be constant w.r.t. choice of network adapter. What went wrong? Some network adapters, including the built-in WiFi, cause a performance regression. Did this work before? N/A Chrome version: 64.0.3253.0 Channel: dev OS Version: Flash Version: Disabled The overall (i.e., geometric mean of percentage_smooth for each page) score for top_25_smooth on elm regresses by 16% when using built-in WiFi (compared to using a specific brand of USB-ethernet adapter). I have tried 5 different USB adapters - 3 showed the same regression as WiFi; 2 performed well. Other benchmarks and chromebooks may be affected as well, although I don't yet have clear data on this. page_cycler and speedometer appear to show 7-8% variation on cyan which could be attributed to network adapters. Adapters tested on elm are: Built-in WiFi - Bad StarTech USB21000S2 - Bad Samzhe USB-C ??? - Bad Belkin F2CU040 - Bad LogiLink UA0025C - OK StarTech USB2106S - OK I believe that Google MobLab (https://www.chromium.org/chromium-os/testing/moblab) used to recommend the USB21000S2 - this has since been updated, so maybe they're aware of this issue?
,
Nov 8 2017
+Grant -- I think he's been looked at NIC performance on CrOS In general -- I'm not surprised that some are bad, probably not a bug
,
Nov 8 2017
> Results should be constant w.r.t. choice of network adapter. Am I correct in assuming that this means the test results should not vary based on network performance? e.g. it is measuring how long it takes to run a blob of JS code locally
,
Nov 8 2017
re #3 - no it's not correct -- some of the tests mentioned are pulling contents over the network
,
Nov 8 2017
Yes the network latency can affect the Telemetry benchmark performance. That's why we prefer to run them directly on the DUT (from /usr/local/telemetry).
,
Nov 8 2017
Dave, Thanks for posting the bug. ARM chipsets in general perform very poorly on USB ethernet. I've had more than one bug open on this in the past. This one in particular: https://bugs.chromium.org/p/chromium/issues/detail?id=317899 If WIFI performance is "broken" for this benchmark, please file a new bug (by cloning this one mostly) and include a few more details about the throughput found vs throughput expected (with the benchmark). I want to separate performance issues between WIFI and USB ethernet as different problems since the interconnect is likely not the same. Wifi is often PCIe (Intel) or SDIO on ARM platforms (except Rockchip which recently switched to PCIe). I'm told the WIFI chip is connected via SDIO on elm (MediaTek MT8173 chipset). My theory was USB control data wasn't mapped to a cacheable address and thus setting up USB transactions was very (CPU) expensive - essentially a cache miss for every memory reference. I don't have enough "ARM foo" (using a PMU) to verify how many cache misses each RX or TX packet will see like I did with Itanium systems in 2002 (https://www.kernel.org/doc/ols/2002/ols2002-pages-183-190.pdf). Perhaps you know how to do this (or know someone who knows?). The adapters are listed as "Bad" or "Good". Could I get more specific info? E.g. does the benchmark post some number? At least one of the listed USB ethernet devices is USB2 and will never get more than about 300 Mbps throughput. Please don't use those unless you can fix the benchmark to accommodate the slower network link. Type-A vs Type-C shouldn't really matter. I want to see 5000Mhz (USB3) link rate in both cases. 1) I don't have any of the listed devices. Can you provide more details about each device (both "Good" and "Bad")? a) "lsusb" will list the port, device IDs and description b) "lsusb -vt" will list the USB link rate. c) "ethtool -i ethX" will list the driver used. 2) Can you run netperf as described in crbug.com/317899 for any (or all) of them and post the output in this bug? I suggest looking at CPU utilization (run "top" and type "1") on the elm device (via ssh) since most likely one core will be saturated if it's the same problem. Using "-T" parameter to netperf will allow one to bind the netperf server/client processes to (different) specific cores. This is described in the bug above. Also look at moving USB (xhci?) interrupts to a different core (typically everything lands on CPU0) by getting the XHCI driver IRQ # from /proc/interrupts and then "echo $CPU_bitmask > /proc/irq/$XHCI_IRQ/smp_affinity".
,
Nov 8 2017
@#1 as you run the control server on a HOST instead, the network performance definitely plays an importance impact on the final results. That's why we saw a huge difference between network interfaces. You could take a look at how the telemetry benchmark runner is implemented, you could see it actually executes run_benchmark on DUT directly instead (i.e. you could use test_that command on HOST instead of calling run_benchmark directly). In addition, the prerecorded web pages are stored locally and web traffic is proxyed with the control network traffic shaper to better measure the Chrome performance in a controlled environment. Please reopen issue if it is not what we expect.
,
Nov 10 2017
Hi Grant,
Some details on the adapters. The top two are the high-performing two (both are USB2). I'll try to answer some more of your questions next week.
StarTech USB2106S
Bus 001 Device 008: ID 9710:7830 MosChip Semiconductor MCS7830 10/100 Mbps Ethernet adapter
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci-mtk/2p, 480M
|__ Port 2: Dev 3, If 0, Class=Hub, Driver=hub/2p, 480M
|__ Port 1: Dev 8, If 0, Class=Vendor Specific Class, Driver=MOSCHIP usb-ethernet driver, 480M
driver: MOSCHIP usb-ethernet driver
version: 22-Aug-2005
firmware-version: MOSCHIP 7830/7832/7730 usb-NET
bus-info: usb-11270000.usb-2.1
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no
LogiLink UA0025C
Bus 001 Device 007: ID 9710:7830 MosChip Semiconductor MCS7830 10/100 Mbps Ethernet adapter
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci-mtk/2p, 480M
|__ Port 2: Dev 3, If 0, Class=Hub, Driver=hub/2p, 480M
|__ Port 1: Dev 7, If 0, Class=Vendor Specific Class, Driver=MOSCHIP usb-ethernet driver, 480M
driver: MOSCHIP usb-ethernet driver
version: 22-Aug-2005
firmware-version: MOSCHIP 7830/7832/7730 usb-NET
bus-info: usb-11270000.usb-2.1
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no
StarTech USB21000S2
Bus 001 Device 010: ID 0424:7500 Standard Microsystems Corp. LAN7500 Ethernet 10/100/1000 Adapter
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci-mtk/2p, 480M
|__ Port 2: Dev 3, If 0, Class=Hub, Driver=hub/2p, 480M
|__ Port 1: Dev 10, If 0, Class=Vendor Specific Class, Driver=smsc75xx, 480M
driver: smsc75xx
version: 22-Aug-2005
firmware-version: smsc75xx USB 2.0 Gigabit Ethern
bus-info: usb-11270000.usb-2.1
supports-statistics: no
supports-test: no
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no
Samzhe:
Bus 002 Device 008: ID 0bda:8153 Realtek Semiconductor Corp.
/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci-mtk/1p, 5000M
|__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/2p, 5000M
|__ Port 2: Dev 8, If 0, Class=Vendor Specific Class, Driver=r8152, 5000M
driver: r8152
version: v1.08.3
firmware-version:
bus-info: usb-11270000.usb-1.2
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
Belkin F2CU040
Bus 002 Device 004: ID 0bda:8153 Realtek Semiconductor Corp.
/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci-mtk/1p, 5000M
|__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/2p, 5000M
|__ Port 2: Dev 4, If 0, Class=Vendor Specific Class, Driver=r8152, 5000M
driver: r8152
version: v1.08.3
firmware-version:
bus-info: usb-11270000.usb-1.2
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
,
Nov 10 2017
Dave, thanks for providing more details. If any USB2.0 dongles are "Good", then we aren't talking about the same problem. And as cywang observed in comment #7, the test can be run locally (no network latency). If you have time to pursue issues with RTL8153, please open a new bug - including the same dongle information and dmesg output from the duration of the test run. I am concerned the RTL8153 dongles are _slower_ than USB 2.0. I have to wonder if this is a signal integrity issue (USB3.0 running @ 5000Mhz clock) similar what we've seen in http://crbug.com/766217. But we should continue this conversation in a new bug.
,
Nov 24 2017
Re #4, #5, #7: my findings indicate that network latency and bandwidth are not a factor in top_25_smooth. Should we re-open this issue, as cywang suggests in #7? I've only analysed one page (booking.com), but for this page, performance varies for different ethernet adapters (by around 13%), but is unaffected by network performance (bandwidth & latency) with a "known-good" ethernet adapter. I looked at network transfer during scrolling. In the first 1 second of scrolling, under 3 kB is transferred. Data is transferred (or at least mostly transferred) during page loading, before scrolling starts. I then tried introducing latency with a known-good ethernet adapter, using netem (https://wiki.linuxfoundation.org/networking/netem). percentage_smooth score was unaffected, even up to 1 second of latency. Finally, I introduced bandwidth limits with a known-good ethernet adapter, using wondershaper. percentage_smooth score was unaffected even at 100 KB/s.
,
Nov 25 2017
Dave, Sure - lets reopen this but with a more specific/descriptive title since it's probably not a "browser" issue. Right now, I just want to understand why benchmark performance with a USB 2.0/100Mbps device is higher than with a USB3.0/GigE device. I've never seen this before with the USB NICs. Can you repeat adding latency or limiting throughput until the test does show a difference? I don't know what this test is actually measuring. It's possible the test was designed to NOT measure network latency or throughput in order to make it easier to compare results between machines. I've also never used either netem or wondershaper and so I'm also naturally suspicious that the granularity they can "shape" at is larger than what the test is doing. I've seen tools that work better with multiple streams and not a single source of load. I should make clear that I appreciate that you've checked 2 of the 3 things I measure for evaluating network performance of a NIC: latency and throughput. The 3rd metric is CPU utilization. The comparison of "good" vs "bad" seems to boil down to "Fast ethernet" (100mbps) vs "Gigabit Ethernet) (1000Mbps). I believe CPU utilization is going to be different (more "bursty") because of higher throughput rates. It's possible GigE USB adapter drivers (r8152 and smsc75xx) are structured differently and may be holding on to a CPU longer when processing packets. "top" (hit "1" after starting it) or timechart might help see that difference: https://www.chromium.org/chromium-os/how-tos-and-troubleshooting/a-brief-perf-how-to Lastly, it would also be interesting to know if ASIX AX88179 devices also show this difference.
,
Nov 27 2017
Hi Dave, Please include the steps you ran the benchmark(the steps you ran in #1 definitely affected the test results by your network environment - adapter, driver, network bandwidth, routing...). If this is network adapter related, we should rename the title instead as the benchmark top_25_smooth is definitely not for measuring network metrics. Thanks~
,
Nov 27 2017
I'm happy to be proven wrong, but right now my theory here is the same as my theory in bug #785930 , comment 11. Specifically my theory is that all the problems here are caused by bad interactions between big.LITTLE and the Interactive governor. According to that theory, the LEAST efficient network driver (the one that causes the most interrupts, sends the smallest payloads, and does the most copying of bytes) will cause the best performance. Specifically, my theory is that all the extra calculation will bump up the CPU frequency and thus make the device perform better. As per the other bug: > IMHO our solution here is not to waste lots of time on tweaking the Interactive > governor vs. big.LITTLE, though. We should find a way to get the proper governor > to elm, either by uprevving elm's kernel or backporting the scheduler patches. --- In general if the choice of network adapter doesn't affect the test when the test is using "Userspace" governor then I'd call this a WontFix or perhaps a Duplicate of the other bug. One could be extra sure by testing the Userspace governor over a range of frequencies. If the choice of network adapter doesn't affect the test regardless of the frequency then we know the only difference is how the governor is behaving.
,
Nov 27 2017
re #13: I don't think it's due to some drivers bumping the CPU frequency. The performance difference is also reflected in ping times after running "stop ui", i.e. when the DUT is under almost no load at all. In this case I would be surprised if the CPU ever leaves the lowest frequency - but the adapters which perform worse in benchmarks have ping times which are about 60% worse.
,
Nov 27 2017
@14: can you humor me and test with userspace governor?
,
Nov 27 2017
Regarding granularity in the network shaping tools: I don't know how the tool works, but in any case I've seen (using WireShark) that only around 3 kB of data is transferred during scrolling (which takes a couple of seconds), so for the network adapter bandwidth to affect the test, the adaptor would have to fall below ~1.5 kB/s, i.e. orders of magnitude slower than advertised - I think it would be obvious if this were the case. For latency, it's easy to see using ping that every single ping packet has the appropriate latency applied, so I think that granularity isn't an issue there either. I've taken some CPU frequency measurements during scrolling. Frequencies (on big) are about 7.5% *lower* on the high-performing adapter (0.5% lower on little). These adapters also generate about 60% fewer interrupts (about 10k/s vs 16.5k/s). The stand-out difference in the perf top results is that the high-performing adapter spent about 13% of the time in cpuidle_enter_state, vs. 65-70% for the low-performing adapter. I could continue to throttle latency and bandwidth to try and find the limit, but this makes running the test *very* slow as it takes a long time to start the test and send the trace data back after completion. If we trust the tools are doing the right thing in terms of shaping traffic, I'm not sure this would tell us anything further? Incidentally, I've created an "infinite scroll" benchmark to assist with investigating this. It behaves similarly to top_25_smooth but will scroll a page, then scroll in the other direction and repeat forever. This is useful for recording with perf, looking at CPU frequency, etc (but obviously will never complete so it's not useful as a benchmark). I could push a patch to share this if you'd find this useful?
,
Nov 27 2017
Dave, If you are comfortable with netem and wondershaper, then I'd rather you humor Doug - he's usually right - and try userspace governor to set CPU freq policy manually. Posting the "infinite scroll" code would certainly be useful since this won't be the last time that scroll performance is investigated (I know this has been an issue several times in the past). Feel free to append as a patch to this bug, upload to chromium.org's code review, or email to chromium-os-dev@chromium.org. Thanks!
,
Nov 28 2017
Thanks all for the suggestions - I'm very happy to help provide this data. I'll upload a patch for the infinite scrolling benchmark shortly. I've set the userspace governor and tried scrolling booking.com at a range of fixed frequencies for two different adapters. I've captured the percentage smooth score for two different adaptors: Freq (little/big) A B 507/507 89.8 72.3 702/702 92.1 70.2 1001/1001 95.8 72.1 1105/1209 95.8 71.3 1209/1404 96.4 69.5 1300/1612 97.0 73.7 1508/1807 96.4 74.1 1703/2106 97.0 72.9
,
Nov 28 2017
@18: OK, fair enough. So adapter "B" causes bad performance on the test even at the highest CPU frequency. Thus it makes sense that this particular adapter needs to be debugged. Do all 3 of the bad adapters you listed above behave this way? Certainly the WiFi _didn't_ behave this way (from bug #785930 ), right? AKA if you ran the test above and used WiFi as "B" then you would actually get good results, right? ...so it's possible that not all 3 of the Ethernet adapters do.
,
Dec 6 2017
I've done some more testing. This time I've run many more iterations (25) on all governors, for three adaptors (wifi, and two under-performing USB adaptors, one "good" adaptor), to try to get more confidence in the numbers. They show two very surprising results: 1. setting the frequency to minimum (userspace governor) performs significantly better than maximum (but only for the under-performing adaptors). For the "good" adaptors, it's the other way around (as expected). 2. for WiFi, userspace set to min or max frequency behaves differently to the powersave and performance governors Numbers show mean percentage smooth score for 25 iterations of booking.com +/- std deviation on elm. Governor Wifi A B C userspace_min: 77.0 +/- 2.1 76.8 +/- 2.4 77.2 +/- 2.0 90.6 +/- 1.2 userspace_max: 72.1 +/- 2.2 71.2 +/- 2.5 71.9 +/- 2.2 96.1 +/- 0.5 interactive: 78.2 +/- 2.8 77.0 +/- 2.0 76.8 +/- 2.6 91.1 +/- 1.2 powersave: 78.0 +/- 2.0 76.0 +/- 3.0 76.8 +/- 3.0 90.6 +/- 1.3 performance: 77.5 +/- 1.9 72.0 +/- 1.7 72.1 +/- 1.0 96.3 +/- 0.4 ondemand: 76.8 +/- 3.1 83.3 +/- 4.6 83.0 +/- 3.7 95.7 +/- 0.6
,
Dec 6 2017
@20 That's pretty baffling. I guess the last thing to check is whether big vs. little cores matters. AKA: if you have all big or all little cores does that make a difference?
,
Dec 6 2017
I've uploaded the infinite scroll tool/benchmark here: https://chromium-review.googlesource.com/c/chromium/src/+/810745
,
Dec 7 2017
Running the test above with big cores disabled (can't do this for little as disabling core 0 crashes elm). Data for all cores is as per #20. Governor A (all cores) A (little only) C (all cores) C (little only) userspace_min: 76.8 +/- 2.4 61.6 +/- 3.1 90.6 +/- 1.2 53.4 +/- 2.5 userspace_max: 71.2 +/- 2.5 92.8 +/- 1.1 96.1 +/- 0.5 94.2 +/- 1.1 interactive: 77.0 +/- 2.0 90.0 +/- 1.8 91.1 +/- 1.2 90.9 +/- 1.5 powersave: 76.0 +/- 3.0 61.9 +/- 3.1 90.6 +/- 1.3 53.0 +/- 2.8 performance: 72.0 +/- 1.7 92.9 +/- 1.2 96.3 +/- 0.4 94.1 +/- 1.4 ondemand: 83.3 +/- 4.6 92.1 +/- 1.5 95.7 +/- 0.6 92.8 +/- 1.2 When frequency is variable or high, disabling big cores dramatically improves A (~30%) so that A is very nearly as good as C. When frequency is low, disabling big cores makes A outperform C.
,
Dec 7 2017
Repeat of #18 - this has more (25) repetitions so I'm more confident in this data: Freq (little/big) C (all cores) C (little only) 507/507 76.2 +/- 3.5 62.6 +/- 2.5 702/702 74.5 +/- 1.9 80.5 +/- 2.2 1001/1001 72.7 +/- 2.4 87.6 +/- 5.6 1105/1209 72.3 +/- 2.0 90.5 +/- 1.5 1209/1404 70.5 +/- 2.4 91.1 +/- 2.1 1300/1612 72.2 +/- 2.3 91.5 +/- 1.1 1508/1807 71.5 +/- 2.4 92.7 +/- 1.5 1703/2106 71.7 +/- 2.1 92.5 +/- 1.1
,
Dec 7 2017
That's all super interesting data. ...but a bit baffling. === What I was trying to figure out by asking for all the experiments above is a general category for why speed was slow. From all the above discussions, I think that theories people have had were these (am I missing any?): A) Different Ethernet adapters (or their drivers) were able to get data from the network at different speeds. Thus adapters that were slow at getting data from the network resulted in bad test scores. These "slow" Ethernet adapters could be slow because of bad latency / throughput or because of dropped packets (and thus things would be slow because higher levels would need to retry). If the problem was at the driver level, one theory was that drivers might be dealing with DMA incorrectly or weakly ordered memory. B) Some Ethernet consumed system resources and prevented the rest of the system from running. With 4 cores it seems unlikely that they would be hogging all CPU cycles, but if they caused interrupt storms they could block CPU 0 which could impede other things in the system from happening. ...or if they held a shared resource they could block others things in the system from making forward progress. C) Different Ethernet adapters affected the scheduler in different ways. For instance, they could trigger CPU Frequency bumps, or push other processes onto cores that are running more slowly. D) Running with just slightly different timing affects the test a lot. For instance, if you have something that polls every "jiffy" for data and HZ=100, then you'll poll every 10 ms. If something was ready at 10.001 ms then you won't get it until 20 ms has passed. ...but if it was ready at 9.999 ms then you'll get it at 10 ms. There are other similar cases where having a mismatch in timing can cause a big performance delta. E) An external factor, like thermal throttling. === In general the previous discussion made me feel like A) was unlikely. It seemed that you had tried to slow down the network and it didn't affect the test much. I guess one other thing you could try would be to run a "ping" in another window and ping the local router while running the test. If you see lots of dropped packets from ping (or lots of really high latency pings) then that could point to something like A). In theory for Ethernet you should see near 0 dropped packets for a ping to the local router. Theory B) also seems a bit unlikely too, right? If we're running both big and little clusters at max frequency then it seems a bit unlikely the Ethernet adapter driver could hog too many resources. Theory C) is somewhat disproved by the fact that running 1703 / 2106 doesn't show good performance but running just littles at 1703 does. If 1703 littles is enough for good performance then adding the bigs shouldn't make things worse. EXCEPT: there's one other frequency in the system: the GPU frequency. We should probably try setting that GPU to a constant frequency too. I don't know how the GPU driver scales its frequency, but it's sorta possible the CPU speed could affect things. This is slightly related to D). If the CPU requests X GPU transacations every period, perhaps it's not enough to bump up the GPU frequency. However, if the CPU requests "X + 1" in one period and "X - 1" in the next period then the "X + 1" might be enough to trigger a bump up in GPU frequency (and the X - 1 isn't enough to cause it to go back down). You still have 2X transactions in 2 periods, but in one case you bump the GPU freq but not in the other. Theory D) is somewhat plausible (even separate from the discussion above for theory C). Maybe you could try HZ=1000 instead of HZ=250 to see if it helps? Theory E) is somewhat plausible. If you have the big cores available it's possible they generate heat and slow the system down (perhaps even cause GPU throttling). === So thus, future tests: 0. Make sure GPU speed is constant for the tests to rule out that. 1. Try running a "ping" to the local gateway in a 2nd window at the same time as the test. Does one adapter show significantly worse ping times or dropped packets? 2. Always monitor the actual frequencies as the test is running and/or see if thermal throttling happened. 3. Adjust JIFFIES. Does that matter at all? 4. Make sure that you're not running tests in a consistent order. For instance, if you always test adapter A, then B, then C, then it's possible that testing A heats up the system or chews through memory. By the time you run test C, the system may be hot or out of memory. For the heat problem, even a reboot won't help. You just need to wait till things cool off.
,
Dec 21 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4bae44ecfc5d0c1d079b920df3cd67cc79948853 commit 4bae44ecfc5d0c1d079b920df3cd67cc79948853 Author: Dave Rodgman <dave.rodgman@arm.com> Date: Thu Dec 21 13:26:09 2017 Add --scroll-forever option to top_25_smooth This is intended to be used for debugging purposes; it causes the benchmark to run forever and not produce a score. It will scroll the selected page up and down forever, which is useful for debugging scrolling issues (e.g., for use with perf). Use with --story-filter to select a page. Bug: 782187 Change-Id: I07497cf11aefc84972493e94b6ee3ae08eebec08 Signed-off-by: Dave Rodgman <dave.rodgman@arm.com> Reviewed-on: https://chromium-review.googlesource.com/810745 Reviewed-by: Ned Nguyen <nednguyen@google.com> Reviewed-by: Victor Miura <vmiura@chromium.org> Commit-Queue: Ned Nguyen <nednguyen@google.com> Cr-Commit-Position: refs/heads/master@{#525681} [modify] https://crrev.com/4bae44ecfc5d0c1d079b920df3cd67cc79948853/tools/perf/benchmarks/smoothness.py [modify] https://crrev.com/4bae44ecfc5d0c1d079b920df3cd67cc79948853/tools/perf/page_sets/top_25_smooth.py
,
Jan 4 2018
I've done some more testing and have found a solution for this. My comment earlier about interrupts was inaccurate - I've measured again and found that the high-performing adaptors generate much more interrupts than the low-performing ones (about 800 vs. 70 per second). This interacts with entering the idle-state on little cores and causes the results we've seen. The issue is that the firmware specifies values for time to enter and exit sleep state, and minimum expected time in sleep state to break even on power. For the little core, these appear to be too low, so the kernel would presumably sleep and then fail to wake up in time for Chrome to meet its deadlines. This explains why performance improved with big cores disabled, or frequency held low - increased load meant the system was sleeping less. It also explains why smoothness (which is very sensitive to latency) showed this issue where other benchmarks did not. I've done some experimentation with different values for the timings related to idle state and found a sweet spot which: improves scrolling performance by ~20% [1], improves battery life by ~0.5% [2]; increases page load times by ~1.4% [2]. [1] based on running the booking.com page in smoothness.top_25_smooth, score goes from 76.1% to 91.3% (disabling sleep entirely to get an idea of the theoretical upper limit gives a score of 93.0%). [2] based on running power_LoadTest.1hour Obviously there is scope to do more detailed testing and get better numbers - in particular, it would be good to run the full power_LoadTest, but this obviously takes quite a long time. I could submit a patch for this as-is, or alternatively do further testing to fine-tune the values. Fine-tuning will take quite some time though, due to the time to run a full power_LoadTest - and based on these numbers I'm not sure we would be able to get much more benefit.
,
Jan 8 2018
Dave, that is excellent news. I just want to warn about attributing cause/effect to the number of interrupts vs performance. More packets typically generate more interrupts and thus faster throughput. Handling more packets usually isn't due to transitioning CPU power state. Also, switching the CPU power state based on the frequency of interrupts seem risky strategy to me since we could have many different timing patterns of packets arriving AND different interrupt mitigation strategies by the USB NICs to reduce their interrupt service time. Penalizing the more efficient USB NICs (or other devices) with a latency penalty when the drivers don't exceed some arbitrary CPU utilization threshold seems wrong to me.
,
Jan 9 2018
Hi Grant, To clarify, the interrupts from the adapters are inhibiting sleep. This turns out to have a beneficial effect, because the kernel is not well configured in terms of sleep behaviour (on this platform). The adapter issue led us to the root cause (sleep behaviour), but it's really nothing to do with the adapters. So the fix is to adjust the kernel configuration to bring the timings more in line with the observed behaviour of the SoC. There's no particular penalty or special behaviour for any of the adapters. With the kernel configured more optimally, it will schedule sleep/wakeup better (regardless of what the adapters are doing with interrupts), which is where the performance comes from.
,
Jan 16 2018
Hi Dave! Yes, I understood the interrupts from the USB host controller "interfered" with the CPU power state transition (going to "sleep" state). I've seen this sort of behavior before on itanium (perfmon also keeps CPU in high power state) and x86. My point was interrupt patterns are very fragile behaviors and power state transitions aren't taking latency into consideration for IO work. I'm curious what "kernel configuration" you are referring to since it sounds promising. :)
,
Jan 16 2018
The patch is to values in the MT8173 dtsi file (so maybe this is considered more firmware configuration than kernel configuration)? But please see https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/866842 for details.
,
Jan 16 2018
Todd, is there a chromium.org engineer who should be looking at the proposed change? I see a list of reviewers but I don't know if any of those are familiar with the details of (struct cpu_idle_state) exit_latency.
,
Jan 16 2018
Think Dan is probably the right owner based on, https://chromium-review.googlesource.com/273995
,
Jan 20 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/1351e46a14d0811cb8689debdb60a0f4622be1e4 commit 1351e46a14d0811cb8689debdb60a0f4622be1e4 Author: Dave Rodgman <dave.rodgman@arm.com> Date: Sat Jan 20 01:18:11 2018 CHROMIUM: ARM: dts: mt8173: improve idle timing Adjust idle state timings for entering, exiting and minimum-residency. Timings for little cores have been separated from big cores, and increased. Previous values for little appear to be too low, causing the kernel to wake up later than intended. This has a large effect on latency-sensitive workloads, such as smooth scrolling in Chromium. Impact on power appears negligible (slightly beneficial). BUG= chromium:782187 TEST=run top_25_smooth on elm. Expect overall improvement of ~10%, with improvement of up to ~20% for simpler pages (e.g., booking.com). Change-Id: I3b19c9ddb52084dcf0c732d8d472c9b78fabf797 Signed-off-by: Dave Rodgman <dave.rodgman@arm.com> Reviewed-on: https://chromium-review.googlesource.com/866842 Commit-Ready: Daniel Kurtz <djkurtz@chromium.org> Reviewed-by: Daniel Kurtz <djkurtz@chromium.org> [modify] https://crrev.com/1351e46a14d0811cb8689debdb60a0f4622be1e4/arch/arm64/boot/dts/mediatek/mt8173.dtsi
,
Jan 22 2018
Is there some sort of test case we can use to make sure we don't end up in this state again? Requiring people to notice that perf gets better with certain types of Ethernet adapters seems very non-ideal.
,
Jan 23 2018
Issue 785930 has been merged into this issue.
,
Feb 5 2018
,
Mar 19 2018
After the patch @34 landed, I tested again and discovered that there was a regression. Apologies for the delay, I've been trying to get to the bottom of what's happened and capture more data so we can have confidence in a fix. Previous results showed that the previous patch raised smoothness (for booking.com) from ~76.5% to 91.3%, i.e. ~20% benefit. Following the regression, I now measure 84.3% with the previous patch (+10%). I wasn't able to figure out what has caused the regression. Digging deeper into the idle-state behaviour, I came to the conclusion that my earlier comments about under-estimating wake-up delay were inaccurate: most (84%) of Chrome's sleeps are unbounded, i.e. waiting for a message, so would not be affected by under-estimation. I experimented with some test programs which used multiple threads communicating over a pipe to time wake-up delay; results showed that wake-up time after unbounded sleep is less than 225us, and that mean wake-up time (which relates to how often we enter idle state) is somewhat correlated (-0.77) with benchmark score. I also looked at memory performance after wake-up (i.e., looking for a caching issue) - this showed performance similar/better to other, non-affected platforms. I tested performance on big after a sleep on little (looking for cache-snooping or memory performance effects); there was no evidence that there was any impact on big. SoC errata didn't indicate any likely causes. Finally I looked at the 4.4 kernel: this partially mitigates the issue (smoothness score of 85.7 with interactive governor, 87.7 with sched); it's not clear why (idle-state timings are unchanged). Following this, I looked at adjusting the timings further in order to discourage sleep, both in terms of what's effective and what the power impact is. I concluded that adjusting the minimum-residency is the preferred way to inhibit sleep - so I've focused on that. The attached graph shows that smoothness improves up to a minimum-residency of ~15ms; after that, there's no further improvement. I propose setting minimum residency to 17ms (i.e., slightly longer than a frame), and returning the other numbers to their original values, as this set of values gave good, reliable results. I gathered some power data from power_LoadTest - given the time to run this test, I've only been able to do 13 runs, so the data is somewhat noisy. Discarding a couple of obvious outliers, std dev seemed to be around 1.5% of the mean for most metrics. On that basis, it looks like the power / battery life benefit of this patch is small and probably not significant. The benefit for page loading and smoothness is signifcant. Numbers below are the relative improvement (higher indicates improvement, lower indicates regression in all cases) over the baseline (patch 1 is the patch that was landed previously; patch 2 is the proposed patch). Smoothness number is from running booking.com in the smoothness benchmark seperately from PLT (table shows relative improvement - absolute value with patch 2 is 93.1%). Disabling sleep entirely (all cores) yields a score of 93.4%, i.e., patch 2 gets very close to the theoretical limit. All numbers are from a build using Chromium 66 metric patch 1 patch 2 battery life -0.2% 0.4% energy rate 4.1% 3.2% page loading 4.2% 3.1% % smooth 10.2% 21.7% Based on these numbers, I will push patch 2. I would like to get to the bottom of this issue, but don't think I will have time to investigate further (although I'm happy to answer any questions or share data / ideas etc). Finally, in answer to Doug's question in @35: yes, I think a simple test which runs smoothness normally, and then with sleep completely disabled (echo 1 > /sys/devices/system/cpu/cpu{0..n}/cpuidle/state1/disable), and looks at the difference in the results could identify this. I'll investigate putting a test together if that sounds sensible?
,
Mar 19 2018
I've uploaded a patch - PTAL: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/968509
,
Mar 19 2018
@38: your suggested test sounds useful. --- @39: I haven't spent much time w/ the scheduler nor big.LITTLE tuning myself, but I added a few people to the patch. ...I think Dan may know a bit more about it than I do.
,
Mar 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/5aa18df2bd886c70e8516a95d8a2afc6c1e2b121 commit 5aa18df2bd886c70e8516a95d8a2afc6c1e2b121 Author: Dave Rodgman <dave.rodgman@arm.com> Date: Fri Mar 23 19:10:17 2018 CHROMIUM: ARM: dts: mt8173: discourage sleep on little Adjust idle-state timings for little cores. This discourages the little cores from sleeping, which benefits latency-sensitive workloads (e.g., smooth scrolling in Chromium). Throughput-sensitive workloads (e.g. page loading) benefit slightly. Power / battery life is not significantly affected. BUG= chromium:782187 TEST=run top_25_smooth on elm on booking.com page. Score should improve from ~84% to 91% Change-Id: Ib79d432dd3952399dd443f240925591b95a27b89 Signed-off-by: Dave Rodgman <dave.rodgman@arm.com> Reviewed-on: https://chromium-review.googlesource.com/968509 Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Sonny Rao <sonnyrao@chromium.org> Reviewed-by: Daniel Kurtz <djkurtz@chromium.org> [modify] https://crrev.com/5aa18df2bd886c70e8516a95d8a2afc6c1e2b121/arch/arm64/boot/dts/mediatek/mt8173.dtsi
,
Apr 12 2018
I've uploaded a regression test here: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1010124 This test probably needs a bit of discussion to get the sensitivity of the test right, and take the right actions when it does spot a regression (e.g. fail the test? just record the perf keyvals? automatically file a bug?) If I understand the autotest system correctly, I'll also need to follow up with a patch to the ebuild - but I'll do this when we're happy with the test itself.
,
May 31 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d3bacb1a6b80a1290bb33567083919e7a9be5d01 commit d3bacb1a6b80a1290bb33567083919e7a9be5d01 Author: Dave Rodgman <dave.rodgman@arm.com> Date: Thu May 31 19:26:19 2018 autotest: add kernel_IdlePerf test Add test to check for performance regressions associated with idle-state. This test currently only supports Arm 64-bit platforms. BUG= chromium:782187 TEST=test_that <IP> kernel_IdlePerf --args='local=True' Change-Id: I6bdcff1ca8db949db0c578d70845de27c33563e3 Signed-off-by: Dave Rodgman <dave.rodgman@arm.com> Reviewed-on: https://chromium-review.googlesource.com/1010124 Reviewed-by: Grant Grundler <grundler@chromium.org> [add] https://crrev.com/d3bacb1a6b80a1290bb33567083919e7a9be5d01/server/site_tests/kernel_IdlePerf/control [add] https://crrev.com/d3bacb1a6b80a1290bb33567083919e7a9be5d01/server/site_tests/kernel_IdlePerf/kernel_IdlePerf.py
,
Jun 13 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/030da08a31b0bd453c56d5457a63ea090ee399f6 commit 030da08a31b0bd453c56d5457a63ea090ee399f6 Author: Dave Rodgman <dave.rodgman@arm.com> Date: Wed Jun 13 19:59:31 2018 autotest-server-tests: add kernel_IdlePerf CQ-DEPEND=CL:1010124 BUG= chromium:782187 TEST= TESTS=tests_kernel_IdlePerf emerge-<overlay> autotest-tests test_that <IP> kernel_IdlePerf --args='local=True' Change-Id: Ic85c74129e9ab4f700ce05188af90ce838bf3225 Signed-off-by: Dave Rodgman <dave.rodgman@arm.com> Reviewed-on: https://chromium-review.googlesource.com/1062027 Reviewed-by: Sonny Rao <sonnyrao@chromium.org> Reviewed-by: Grant Grundler <grundler@chromium.org> [modify] https://crrev.com/030da08a31b0bd453c56d5457a63ea090ee399f6/chromeos-base/autotest-server-tests/autotest-server-tests-9999.ebuild
,
Jun 14 2018
Now the regression test is merged, I think we can close this bug?
,
Jun 14 2018
Thank you!
,
Jun 14 2018
No problem, we're very happy to help out with Arm performance :-) |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by djkurtz@chromium.org
, Nov 7 2017Components: OS>Performance OS>Kernel OS>Systems>Network
Labels: -Via-Wizard-Other Performance-Network Arch-ARM64 Kernel-3.18
Owner: bccheng@chromium.org
Status: Assigned (was: Unconfirmed)