New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 600271 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: May 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug-Regression

Blocked on:
issue 592903



Sign in to add a comment

15.9% regression in blink_perf.{canvas|parser} at 384321 - 384621

Project Member Reported by toyoshim@chromium.org, Apr 4 2016

Issue description

See the link to graphs below.
 
All graphs for this bug:
  https://chromeperf.appspot.com/group_report?bug_id=600271

Original alerts at time of bug-filing:
  https://chromeperf.appspot.com/group_report?keys=agxzfmNocm9tZXBlcmZyFAsSB0Fub21hbHkYgICgyM7uvAoM


Bot(s) for this bug's original alert(s):

linux-release
Summary: 15.9% regression in blink_perf.css at 384321 - 384621 (was: 10.5% regression in blink_perf.css at 384640:384665)
Summary: 15.9% regression in blink_perf.{canvas|parser} at 384321 - 384621 (was: 15.9% regression in blink_perf.css at 384321 - 384621)
Cc: esprehn@chromium.org
Owner: esprehn@chromium.org

=== Auto-CCing suspected CL author esprehn@chromium.org ===

Hi esprehn@chromium.org, the bisect results pointed to your CL below as possibly
causing a regression. Please have a look at this info and see whether
your CL be related.


===== BISECT JOB RESULTS =====
Status: completed


===== SUSPECTED CL(s) =====
Subject : WTF::ThreadSpecific should use base::ThreadLocalStorage.
Author  : esprehn
Commit description:
  
Let's use the base abstraction to allocate the
TLS. It's way less code.

Review URL: https://codereview.chromium.org/1849023002

Cr-Commit-Position: refs/heads/master@{#384650}
Commit  : 56641fdb889c6b5d27997465b622d7432c0ba0b8
Date    : Fri Apr 01 19:01:26 2016


===== TESTED REVISIONS =====
Revision                Mean Value  Std. Dev.   Num Values  Good?
chromium@384639         33.417577   0.09856     5           good
chromium@384646         33.452531   0.033532    5           good
chromium@384649         33.292346   0.206809    5           good
chromium@384650         29.75967    0.064093    5           bad         <-
chromium@384651         29.629061   0.227256    4           bad
chromium@384652         29.821315   0.134139    5           bad
chromium@384665         30.038369   0.099122    5           bad

Bisect job ran on: linux_perf_bisect
Bug ID: 600271

Test Command: src/tools/perf/run_benchmark -v --browser=release --output-format=chartjson --also-run-disabled-tests blink_perf.css
Test Metric: SelectorCountScaling/SelectorCountScaling
Relative Change: 10.11%
Score: 99.9

Buildbot stdio: http://build.chromium.org/p/tryserver.chromium.perf/builders/linux_perf_bisect/builds/6368
Job details: https://chromeperf.appspot.com/buildbucket_job_status/9016324266525394080


Not what you expected? We'll investigate and get back to you!
  https://chromeperf.appspot.com/bad_bisect?try_job_id=600271

| O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq
|  X  | for more information addressing perf regression bugs. For feedback,
| / \ | file a bug with component Tests>AutoBisect.  Thank you!

===== BISECT JOB RESULTS =====
Status: completed


=== Bisection aborted ===
The bisect was aborted because The metric values for the initial "good" and "bad" revisions do not represent a clear regression.
Please contact the the team (see below) if you believe this is in error.

=== Warnings ===
The following warnings were raised by the bisect job:

 * Bisect failed to reproduce the regression with enough confidence.

===== TESTED REVISIONS =====
Revision                Mean Value  Std. Dev.   Num Values  Good?
chromium@384311         132.470249  8.926027    18          good
chromium@384648         141.956762  17.609913   18          bad

Bisect job ran on: android_nexus5_perf_bisect
Bug ID: 600271

Test Command: src/tools/perf/run_benchmark -v --browser=android-chromium --output-format=chartjson --also-run-disabled-tests blink_perf.canvas
Test Metric: putImageData/putImageData
Relative Change: 5.38%
Score: 0

Buildbot stdio: http://build.chromium.org/p/tryserver.chromium.perf/builders/android_nexus5_perf_bisect/builds/3555
Job details: https://chromeperf.appspot.com/buildbucket_job_status/9016321568622720480


Not what you expected? We'll investigate and get back to you!
  https://chromeperf.appspot.com/bad_bisect?try_job_id=600271

| O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq
|  X  | for more information addressing perf regression bugs. For feedback,
| / \ | file a bug with component Tests>AutoBisect.  Thank you!
Labels: Needs-Bisect
There's a v8 roll and a skia roll in there, when looking at the regressions that's looking pretty likely. This needs another bisect.
Cc: haraken@chromium.org
Components: Blink>Architecture
I kicked off a few more bisects.

===== BISECT JOB RESULTS =====
Status: completed


===== SUSPECTED CL(s) =====
Subject : WTF::ThreadSpecific should use base::ThreadLocalStorage.
Author  : esprehn
Commit description:
  
Let's use the base abstraction to allocate the
TLS. It's way less code.

Review URL: https://codereview.chromium.org/1849023002

Cr-Commit-Position: refs/heads/master@{#384650}
Commit  : 56641fdb889c6b5d27997465b622d7432c0ba0b8
Date    : Fri Apr 01 19:01:26 2016


===== TESTED REVISIONS =====
Revision                Mean Value  Std. Dev.   Num Values  Good?
chromium@384639         7.5468      0.476513    5           good
chromium@384648         7.6562      0.581595    5           good
chromium@384649         7.3844      0.416415    5           good
chromium@384650         10.9736     0.728737    5           bad         <-
chromium@384651         10.7542     0.883665    5           bad
chromium@384653         10.7615     0.830193    4           bad
chromium@384657         10.4022     0.751494    5           bad
chromium@384674         10.297      0.558605    5           bad

Bisect job ran on: linux_perf_bisect
Bug ID: 600271

Test Command: src/tools/perf/run_benchmark -v --browser=release --output-format=chartjson --also-run-disabled-tests blink_perf.shadow_dom
Test Metric: DistributionWithMultipleShadowRoots/DistributionWithMultipleShadowRoots
Relative Change: 36.44%
Score: 99.9

Buildbot stdio: http://build.chromium.org/p/tryserver.chromium.perf/builders/linux_perf_bisect/builds/6372
Job details: https://chromeperf.appspot.com/buildbucket_job_status/9016275075056859584


Not what you expected? We'll investigate and get back to you!
  https://chromeperf.appspot.com/bad_bisect?try_job_id=600271

| O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq
|  X  | for more information addressing perf regression bugs. For feedback,
| / \ | file a bug with component Tests>AutoBisect.  Thank you!
I reverted my patch in https://codereview.chromium.org/1856793002

Lets see if this recovers. It's hard for me to imagine how using Chromium's TLS abstractions would be slower for linux and android, but not for Mac or Windows, and why that would impact these few benchmarks. Maybe it made Oilpan slower?
sullivan@ How do I know what bots run what tests? The perf dashboard seems to list lots of machines, but the tests only run on a couple, ex. putImageData seems to only run on Android now? :/
The bots seem to be recovering now that my revert has landed. I'm studying the difference between Chromium TLS and Blink TLS:

- Blink uses the pthread API directly on OS X, Android and Linux.
- Blink uses a single TLS value on Windows, with a 256 slot array to store values.
- Chromium always uses a 256 slot array, on every platform.
- Chromium has base::subtle::Atomic32 accesses all over the ThreadLocalStorage code. Both for the slot in each ThreadLocalStorage instance, and for the initialized_ variable which most code, and blink, never uses.

Indeed the Chromium TLS stuff is going to be slower if it needs to do atomic math all the time, and Blink uses this for everything from the AtomicString table, to the Oilpan ThreadState. I'm not sure what to do here.

The atomics were added recently to fix races too, it's a shame Chromium doesn't have micro benchmarks:
https://codereview.chromium.org/1222123002
Cc: michaelbai@chromium.org jsc...@chromium.org robliao@chromium.org jyasskin@chromium.org
Cc: danakj@chromium.org thakis@chromium.org thestig@chromium.org

Comment 15 by ojan@chromium.org, Apr 5 2016

Cc: ojan@chromium.org
Cc: glider@chromium.org amistry@chromium.org
Does perf look better without https://codereview.chromium.org/1222123002 ?
Can someone from the base/ team look into what's going on here?
Release stores and acquire loads would be slower than non-atomic accesses on ARM but equal speed on x86. Is that the pattern you saw?
initialized() appears to only be there for tests, with two possible exceptions? 

1. https://code.google.com/p/chromium/codesearch#chromium/src/components/url_formatter/url_formatter.cc&rcl=1459797464&l=382

2. https://code.google.com/p/chromium/codesearch#chromium/src/base/trace_event/heap_profiler_allocation_context_tracker.cc&rcl=1459797464&l=70

Which both guard calls to Initialize(), maybe those could store the initialize externally?
The good news is that reverting my patch is a consistent 17% difference on my Z620 on ShadowDOM/DistributionWithMultipleShadowRoots. I tried just commenting out the initialize_ code and that made no difference. The performance difference must be somewhere else.

Could someone who works on base look at that benchmark with and without my patch?
Are the Linux perf bots affected by  crbug.com/592903 ?
Unless the android results were a lie, this affected android which uses libc++ also.
It looks like android did not recover when linux did from the revert though: https://chromeperf.appspot.com/report?sid=1c27837a287b6f655d5d36247d96f65f2337184d7fda38b5d03ae00950645ba3&rev=384648

So maybe this is  crbug.com/592903  and the android regression is unrelated.
I'd like to split the (android) canvas regression to another bug, but I don't seem to have permissions on the perf dashboard to do so.
I concur, please break-out the the nexus5 canvas regression into a separate issue, and kick off a bisect job for that specific regression.  You can assign the new issue to me for further triage. Seems I don't have permissions to do this either.
I split the canvas regression into  bug 600786 . Note that there is no ref build for that regression and it only occurred on one bot, so it could be noise. Bisect kicked off.

Also note that after the monorail migration, dashboard switched to having users log in with chromium.org accounts; that might be the reason people are having problems with permissions.
The slowness is from ThreadLocalStorage::StaticSlot::Get(), it does base::subtle::NoBarrier_Load for every single Get(), and blink didn't need to do that.

How can we change base's ThreadLocalStorage to not do atomic ops for every single Get() ?
How fast are mutexes/locks these days?

The alternative approach here is to perform TLS initialization very early in the startup process (e.g. no threads running around), guaranteeing that the TLS key is set and avoiding the need to NoBarrier_Load it. 
Indeed, Blink has initialize/shutdown to avoid having to do things like this at runtime.

Another alternative is to not use the single shared TLS value with 256 slot array except on Windows. Blink did that and used pthread_key_create directly for all other platforms avoiding the need for any globals except on windows.
My understanding is that we also needed this scheme for Android, which is TLS constrained.

For Windows, the minimum guaranteed is 64, but can go up to 1088 (don't know which versions offhand).
Blink didn't special case Android before, it had two modes, Windows and POSIX. http://mobilepearls.com/labs/native-android-api/#pthreads claims Android supports 59 slots, I wonder how close we are to that in practice.

Anyway, it seems like maybe we do want the array, but want separate initialization of the shared slot to avoid the atomics?
In either case, I've been looking at this from the implementation side off and on after I stumbled onto two different implementations of the Chrome TLS system: ThreadLocalStorage and ThreadLocal[Pointer|Boolean]. Oddly enough, ThreadLocal[Pointer|Boolean] uses ThreadLocalStorage only on Android. Native implementations are used for Windows and POSIX. ( http://crbug.com/588824 )

Having the system TLS slot set at process initialization makes a lot of sense to me. 

Cc: jfb@chromium.org
Comment #27 is  issue 592903 , which causes NoBarrier_Load to generate

   0:   0f ae f0                mfence
   3:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 9 <test_atomic_func()+0x9>
   9:   0f ae f0                mfence

instead of the

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 6 <test_racy_func()+0x6>

it ought to generate. The problem isn't that there are atomic operations, it's that they're implemented incredibly badly in the gcc-4.6 standard library that we're using on Linux.
On x86 (with a functioning compiler and sysroot), NoBarrier_Load results in just a load. No barriers or lock instructions. I haven't looked at the code on ARM to see what it does.
Labels: -Needs-Bisect
Navigated to the Graphs and noticed that the alerts are getting recovered.
Could some on please look into the issue and update.
Note: Removing Needs Bisect label as of now, please feel free to add if required.
Thank You.
Blockedon: 592903
Sounds like this is blocked on  bug 592903  then.
This also regressed two Android bots according to the dashboard:

https://chromeperf.appspot.com/report?sid=92d3e9794bd0b21013448e3ddd1ba11bc74cd20e6908864248c7244f77e64d8c&rev=384621

https://chromeperf.appspot.com/report?sid=8b9d507a960ea2ba3e7e319f44f39cab32f09d06df2028a3aa365a903a2af904&rev=384633

We could fork those and run the bisect again, but I why do we think atomics on ARM are also "free" like x86?

Comment 38 by jfb@chromium.org, Apr 6 2016

#37 *relaxed* load/store are cheap on all platforms.
Details: https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
The first android bot recovered but I don't see the related revert for this bug in that run? The second doesn't look like a regression on the graph.
Status: Fixed (was: Assigned)
This is fixed, but we can't use base::ThreadLocalStorage until the underlying cause is fixed.

Comment 41 by tkent@chromium.org, Jun 23 2016

Components: -Blink>Architecture Blink>Internals
Renaming Blink>Architecture to Blink>Internals

Sign in to add a comment