New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 705136 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: May 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 2
Type: Bug

Blocked on:
issue 714110
issue 714471

Blocking:
issue 705134



Sign in to add a comment

Compare android runs of memory.top_10_mobile both on and off swarming

Project Member Reported by martiniss@chromium.org, Mar 24 2017

Issue description

We need to see if it's feasible to run memory.top_10_mobile on android single device swarming.

 
Blocking: 705134
Cc: -nednguyen@chromium.org nedngu...@google.com
https://chromeperf.appspot.com/report?sid=5ff7e515a400a9c42aa9ea5ab0ddb2288f12d352da6a7c89aa849e35d3dcc321 is a graph which compares swarmed and non-swarmed bots. Looks like we're using less memory on swarming? I don't know the benchmark well enough to know if I'm using the data correctly.
Cc: hjd@chromium.org
Owner: perezju@chromium.org
Status: Assigned (was: Available)
+Juan, Hector: can you help us verify whether the memory data on the swarmed Android bots look sane compared with the existing bots?

Look like the total memory metric on swarmed Android bots is less than non-swarmed bots, which is a good thing. But I am not competent enough in this space to make the final say :-)
My first suspicion would be device differences.

The non-swarmed bot, from the "device status" step, seems to be using: bullhead-userdebug 6.0.1 MMB29Q 2480792 dev-keys

https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus5X_Perf__2_%2F3475%2F%2B%2Frecipes%2Fsteps%2Fdevice_status%2F0%2Flogs%2Fjson.output%2F0

The swarmed bot, from the "swarming.summary" (is there a better way to look up this info?), I can see at least: bullhead MDB08O

https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf.fyi%2FAndroid_Swarming_N5X_Tester%2F486%2F%2B%2Frecipes%2Fsteps%2Fmemory.top_10_mobile_on_Android%2F0%2Flogs%2Fswarming.summary%2F0

So I would start by making both use the same Android build.
To #4: thanks Juan, I filed  issue 709970  about that.
perezju@, can you take a look at the graph?
Great, now the difference is even larger!

I'll have a careful look at device provisioning going on at each of the bots, maybe the difference is there.

Will post back here when I have any details.
Cc: bpastene@chromium.org
bpastene@ knows about how the devices are provisioned on swarming.
Swarming device provisioning and pre/post-task cleanup can be found at https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chromium-swarm/scripts/android.py

It's generally a subset of what provision_devices.py does + some cpu governor logic. That might be a culprit here, but IIUC perf tests do their own governor management on devices, so it would probably just overwrite what swarming does.
I had a look comparing logs on two runs:

Swarming:
https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf.fyi%2FAndroid_Swarming_N5X_Tester%2F528%2F%2B%2Frecipes%2Fsteps%2Fmemory.top_10_mobile_on_Android%2F0%2Fstdout

Non-swarming:
https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus5X_Perf__2_%2F3582%2F%2B%2Frecipes%2Fsteps%2Fmemory.top_10_mobile%2F0%2Fstdout

Things that I did check and have no significant changes:
- chrome command line flags (including field trials)
- hardware details (as reported by _LogBrowserInfo)
- tracing configs

Generally on the surface those logs look comparable.

I compared a few other memory metrics here:
https://chromeperf.appspot.com/report?sid=c5b3c0ba0329cd9b9787ec595dcac6b45d4cad4aae500867ad21161313710c42

Highlights:
- Flashing the device (in #5 above) did not change how much memory Chrome thinks it's using (by_chrome:effective_size).
- The drop after flashing is only visible on by_os:resident_size; wondering if this has something to do with other processes running on the device?
- Anyway, even within Chrome, there *are* some significant differences on memory usage (e.g. swarming uses 3.5MiB less on the wikipedia page)

I've pulled a couple of traces:

Swarming:
https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_10-2017-04-21_00-17-13-58626.html

Non-swarming:
https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_10-2017-04-20_23-04-31-32035.html

But nothing jumps out to me that would explain the differences.

+primiano in case he has more ideas on things to check?

bpastene@ do we keep logs of what happens during device provisioning in swarming?

I couldn't find anything relevant e.g. at:
https://chromium-swarm.appspot.com/task?id=35a688cda4689710&refresh=10&show_raw=1
Cc: primiano@chromium.org
actually +primiano, see #11 above
Err, sorry for my silly comment, I realize that we didn't enable atrace on system_health.common_mobile.

We currently only enable atrace on power.idle_platform. Lemme see if we can pull out the trace from there.


atrace is failing on swarming nexus-5 due to battor command error:
(https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf.fyi%2FAndroid_Power_Nexus_5X_Perf%2F655%2F%2B%2Frecipes%2Fsteps%2Fpower.idle_platform%2F0%2Fstdout)
there is also some weird warning about atrace command there:
(WARNING) 2017-04-20 17:49:51,092 adb_wrapper.Shell:499  exit status of shell command 'atrace -z -b 4096 sched freq gfx view dalvik webview input disk am wm rs --async_dump' missing.


However, I was able to grab the trace & recover it manually (in attached file).

And here is the atrace of android-nexus5X (non-swarming): https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_0-2017-04-19_02-32-49-98639.html

fyi-android-swarming-n5x-trace.html
3.3 MB View Download
TL;DR swarming is running chrome 64 bit mode instead of 32-bit.

I looked at the traces in #11
1) The tracing overhead is 2x (60 MB vs 30MB) in the non-swarming case. This all comes from the TraceBufferVector size. From the trace metadata, in both cases tracing is started with "record-as-much-as-possible", which in turn causes a fixed size of the vector (kTraceEventVectorBigBufferChunks). The only thing that could make a difference that comes to my mind is 32 vs 64 bit.

2) In the trace metadata one is  os-arch: "armv8l",the other one is  os-arch: "aarch64"

3) If I look at the virtual addresses (select any blue column in memory-infra, and expand the "Stack" section), the swarming trace is definitely running on a 64 bit address space, the non-swarming trace is runing in 32 bit mode

This would explain other things like the v8 heap being bigger.
Thanks for the analysis Primiano!

Annie: are the Android perf bots supposed to use 32-bit or 64-bit? Are we using the same bit-mode of Chrome every where on current Android perf waterfall?
I think they are meant to try what we ship, which is 32 bit 

% Webview, for which I seem to recall we currently test the 64 bit version 
Blockedon: 714110

Comment 20 Deleted

Thanks, I filed  issue 714110  to make sure we use 32 bit CHrome on Android swarming bot as well.

You know the reason why we use 64 bit webview? Is that the version we ship to our users as well?
> You know the reason why we use 64 bit webview? Is that the version we ship to our users as well?
Correct. We have to ship WebView in both bitness-es. The fact that we use 32 or 64bit webview is not our decision, but depends on the hosting app (e.g., Gmail/Rss reader/Facebook). You can definitely have a (64-bit) phone where some apps run in 32 bit and some others in 64 bit mode, in which cases they will use different WebView binaries to match the corresponding bitness (yay)
Instead, for chrome, we are the "hosting app" so we decide.
From a "shipping" viewpoint, there is very little actually these days that we use monochrome. Essentially on modern versions of android (I never remember when we started supporting monochrome, something around L-M) we ship one single apk which contains two libmonochrome.so(s), one for 32 and one for 64 bit.
Chrome currently uses 32 bit only*
WebView uses both depending on the hosting app.

* Actually this is not 100% true. The Chrome Canary channel is 64 bit, and this is done to keep some live coverage and ensure that the 64 bit version of chrome doesn't rot if one day we change our mind and want to move chrome to 64 bit.


Blockedon: 714471
Is there any way to verify that the bot is now running on a 32 bit build? I tried digging around on the two bots on the FYI waterfall, and I wasn't able to see anything obvious.

 bug 714110  is where I'm tracking the work on this.
> Is there any way to verify that the bot is now running on a 32 bit build?
Look at a trace? see my comment in #16 (or just attach a trace and I'll check)
Now that we demistify the differences due to bitness between swarming & non swarming in  issue 714110 , I am not sure whether we still need to block this bug on  issue 714110 .

Since Swarming bots already have the correct config, can we just verify that its number look sane & go ahead with the effort of migrating Android bots to swarming? 

This issue &  issue 705136  are the only two left for us to proceed with migrating Android to swarming.

Juan, Primiano: what do you think?

*Having said that,  issue 714110  could still be important to fix, but we should decouple it from this if possible.
Looking at
https://chromeperf.appspot.com/report?sid=5ff7e515a400a9c42aa9ea5ab0ddb2288f12d352da6a7c89aa849e35d3dcc321

I think the numbers appear reasonable, also a regression can be seen tracked on both swarmed/non-swarmed versions, which gives me confidence.

It seems however that the swarmed version runs less often (on larger CL batches), is that known/expected?
> I think the numbers appear reasonable, also a regression can be seen tracked on both swarmed/non-swarmed versions, which gives me confidence.
+1

Agreed that there doesn't seem to be a bitness problem on swarming.
Status: Fixed (was: Assigned)
> It seems however that the swarmed version runs less often (on larger CL batches), is that known/expected?

Yes. This is because we have fewer number of machines in the swarmed version, hence longer cycle time between build :-)

Thanks for the analysis, Juan & Primiano. I will mark this bug & the system health ones as fixed. Feel free to reopen if you disagree.
Issue 679800 has been merged into this issue.

Sign in to add a comment