Compare android runs of memory.top_10_mobile both on and off swarming |
||||||||
Issue descriptionWe need to see if it's feasible to run memory.top_10_mobile on android single device swarming.
,
Apr 7 2017
https://chromeperf.appspot.com/report?sid=5ff7e515a400a9c42aa9ea5ab0ddb2288f12d352da6a7c89aa849e35d3dcc321 is a graph which compares swarmed and non-swarmed bots. Looks like we're using less memory on swarming? I don't know the benchmark well enough to know if I'm using the data correctly.
,
Apr 7 2017
+Juan, Hector: can you help us verify whether the memory data on the swarmed Android bots look sane compared with the existing bots? Look like the total memory metric on swarmed Android bots is less than non-swarmed bots, which is a good thing. But I am not competent enough in this space to make the final say :-)
,
Apr 10 2017
My first suspicion would be device differences. The non-swarmed bot, from the "device status" step, seems to be using: bullhead-userdebug 6.0.1 MMB29Q 2480792 dev-keys https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus5X_Perf__2_%2F3475%2F%2B%2Frecipes%2Fsteps%2Fdevice_status%2F0%2Flogs%2Fjson.output%2F0 The swarmed bot, from the "swarming.summary" (is there a better way to look up this info?), I can see at least: bullhead MDB08O https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf.fyi%2FAndroid_Swarming_N5X_Tester%2F486%2F%2B%2Frecipes%2Fsteps%2Fmemory.top_10_mobile_on_Android%2F0%2Flogs%2Fswarming.summary%2F0 So I would start by making both use the same Android build.
,
Apr 10 2017
To #4: thanks Juan, I filed issue 709970 about that.
,
Apr 17 2017
The bots are now using the same android build. https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf.fyi%2FAndroid_Swarming_N5X_Tester%2F517%2F%2B%2Frecipes%2Fsteps%2Fmemory.top_10_mobile_on_Android%2F0%2Flogs%2Fswarming.summary%2F0 for swarming https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus5X_Perf__2_%2F3558%2F%2B%2Frecipes%2Fsteps%2Fdevice_status%2F0%2Flogs%2Fjson.output%2F0 for non-swarming. You can see the graph bump in https://chromeperf.appspot.com/report?sid=5ff7e515a400a9c42aa9ea5ab0ddb2288f12d352da6a7c89aa849e35d3dcc321
,
Apr 17 2017
perezju@, can you take a look at the graph?
,
Apr 18 2017
Great, now the difference is even larger! I'll have a careful look at device provisioning going on at each of the bots, maybe the difference is there. Will post back here when I have any details.
,
Apr 19 2017
bpastene@ knows about how the devices are provisioned on swarming.
,
Apr 20 2017
Swarming device provisioning and pre/post-task cleanup can be found at https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chromium-swarm/scripts/android.py It's generally a subset of what provision_devices.py does + some cpu governor logic. That might be a culprit here, but IIUC perf tests do their own governor management on devices, so it would probably just overwrite what swarming does.
,
Apr 21 2017
I had a look comparing logs on two runs: Swarming: https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf.fyi%2FAndroid_Swarming_N5X_Tester%2F528%2F%2B%2Frecipes%2Fsteps%2Fmemory.top_10_mobile_on_Android%2F0%2Fstdout Non-swarming: https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus5X_Perf__2_%2F3582%2F%2B%2Frecipes%2Fsteps%2Fmemory.top_10_mobile%2F0%2Fstdout Things that I did check and have no significant changes: - chrome command line flags (including field trials) - hardware details (as reported by _LogBrowserInfo) - tracing configs Generally on the surface those logs look comparable. I compared a few other memory metrics here: https://chromeperf.appspot.com/report?sid=c5b3c0ba0329cd9b9787ec595dcac6b45d4cad4aae500867ad21161313710c42 Highlights: - Flashing the device (in #5 above) did not change how much memory Chrome thinks it's using (by_chrome:effective_size). - The drop after flashing is only visible on by_os:resident_size; wondering if this has something to do with other processes running on the device? - Anyway, even within Chrome, there *are* some significant differences on memory usage (e.g. swarming uses 3.5MiB less on the wikipedia page) I've pulled a couple of traces: Swarming: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_10-2017-04-21_00-17-13-58626.html Non-swarming: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_10-2017-04-20_23-04-31-32035.html But nothing jumps out to me that would explain the differences. +primiano in case he has more ideas on things to check? bpastene@ do we keep logs of what happens during device provisioning in swarming? I couldn't find anything relevant e.g. at: https://chromium-swarm.appspot.com/task?id=35a688cda4689710&refresh=10&show_raw=1
,
Apr 21 2017
actually +primiano, see #11 above
,
Apr 21 2017
To compare OS activities on the phone, we can use the trace of system_health.common_mobile here: android-nexus5X: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_14-2017-04-17_23-27-58-44384.html fyi-android-swarming-n5x https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_14-2017-04-17_23-29-45-58258.html
,
Apr 21 2017
Err, sorry for my silly comment, I realize that we didn't enable atrace on system_health.common_mobile. We currently only enable atrace on power.idle_platform. Lemme see if we can pull out the trace from there.
,
Apr 21 2017
atrace is failing on swarming nexus-5 due to battor command error: (https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf.fyi%2FAndroid_Power_Nexus_5X_Perf%2F655%2F%2B%2Frecipes%2Fsteps%2Fpower.idle_platform%2F0%2Fstdout) there is also some weird warning about atrace command there: (WARNING) 2017-04-20 17:49:51,092 adb_wrapper.Shell:499 exit status of shell command 'atrace -z -b 4096 sched freq gfx view dalvik webview input disk am wm rs --async_dump' missing. However, I was able to grab the trace & recover it manually (in attached file). And here is the atrace of android-nexus5X (non-swarming): https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_0-2017-04-19_02-32-49-98639.html
,
Apr 21 2017
TL;DR swarming is running chrome 64 bit mode instead of 32-bit. I looked at the traces in #11 1) The tracing overhead is 2x (60 MB vs 30MB) in the non-swarming case. This all comes from the TraceBufferVector size. From the trace metadata, in both cases tracing is started with "record-as-much-as-possible", which in turn causes a fixed size of the vector (kTraceEventVectorBigBufferChunks). The only thing that could make a difference that comes to my mind is 32 vs 64 bit. 2) In the trace metadata one is os-arch: "armv8l",the other one is os-arch: "aarch64" 3) If I look at the virtual addresses (select any blue column in memory-infra, and expand the "Stack" section), the swarming trace is definitely running on a 64 bit address space, the non-swarming trace is runing in 32 bit mode This would explain other things like the v8 heap being bigger.
,
Apr 21 2017
Thanks for the analysis Primiano! Annie: are the Android perf bots supposed to use 32-bit or 64-bit? Are we using the same bit-mode of Chrome every where on current Android perf waterfall?
,
Apr 21 2017
I think they are meant to try what we ship, which is 32 bit % Webview, for which I seem to recall we currently test the 64 bit version
,
Apr 21 2017
,
Apr 21 2017
Thanks, I filed issue 714110 to make sure we use 32 bit CHrome on Android swarming bot as well. You know the reason why we use 64 bit webview? Is that the version we ship to our users as well?
,
Apr 21 2017
> You know the reason why we use 64 bit webview? Is that the version we ship to our users as well? Correct. We have to ship WebView in both bitness-es. The fact that we use 32 or 64bit webview is not our decision, but depends on the hosting app (e.g., Gmail/Rss reader/Facebook). You can definitely have a (64-bit) phone where some apps run in 32 bit and some others in 64 bit mode, in which cases they will use different WebView binaries to match the corresponding bitness (yay) Instead, for chrome, we are the "hosting app" so we decide. From a "shipping" viewpoint, there is very little actually these days that we use monochrome. Essentially on modern versions of android (I never remember when we started supporting monochrome, something around L-M) we ship one single apk which contains two libmonochrome.so(s), one for 32 and one for 64 bit. Chrome currently uses 32 bit only* WebView uses both depending on the hosting app. * Actually this is not 100% true. The Chrome Canary channel is 64 bit, and this is done to keep some live coverage and ensure that the 64 bit version of chrome doesn't rot if one day we change our mind and want to move chrome to 64 bit.
,
Apr 23 2017
,
Apr 26 2017
Is there any way to verify that the bot is now running on a 32 bit build? I tried digging around on the two bots on the FYI waterfall, and I wasn't able to see anything obvious. bug 714110 is where I'm tracking the work on this.
,
Apr 26 2017
> Is there any way to verify that the bot is now running on a 32 bit build? Look at a trace? see my comment in #16 (or just attach a trace and I'll check)
,
Apr 26 2017
From the build of 4/26: Non swarming: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_10-2017-04-26_11-25-56-89651.html os-arch: "aarch64", Swarming: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_10-2017-04-26_07-04-59-14071.html os-arch: "armv8l" So it seems like issue 714110 is not fixed yet.
,
May 1 2017
Now that we demistify the differences due to bitness between swarming & non swarming in issue 714110 , I am not sure whether we still need to block this bug on issue 714110 . Since Swarming bots already have the correct config, can we just verify that its number look sane & go ahead with the effort of migrating Android bots to swarming? This issue & issue 705136 are the only two left for us to proceed with migrating Android to swarming. Juan, Primiano: what do you think? *Having said that, issue 714110 could still be important to fix, but we should decouple it from this if possible.
,
May 2 2017
Looking at https://chromeperf.appspot.com/report?sid=5ff7e515a400a9c42aa9ea5ab0ddb2288f12d352da6a7c89aa849e35d3dcc321 I think the numbers appear reasonable, also a regression can be seen tracked on both swarmed/non-swarmed versions, which gives me confidence. It seems however that the swarmed version runs less often (on larger CL batches), is that known/expected?
,
May 2 2017
> I think the numbers appear reasonable, also a regression can be seen tracked on both swarmed/non-swarmed versions, which gives me confidence. +1 Agreed that there doesn't seem to be a bitness problem on swarming.
,
May 2 2017
> It seems however that the swarmed version runs less often (on larger CL batches), is that known/expected? Yes. This is because we have fewer number of machines in the swarmed version, hence longer cycle time between build :-) Thanks for the analysis, Juan & Primiano. I will mark this bug & the system health ones as fixed. Feel free to reopen if you disagree.
,
May 2 2017
Issue 679800 has been merged into this issue. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by martiniss@chromium.org
, Mar 24 2017