The ItemParallelJob class in v8 is designed to split work into items that can
be executed in parallel, such as the copying of pages from one semi-space to
the other during the page evacuation phase of garbage collection.
The main renderer thread will distribute the items among various worker
threads, contribute to the parallel processing, and then wait for the worker
threads to complete their items before continuing. The number of extra worker
threads used is currently always the number of available CPU cores minus one.
On heterogenous systems, this can cause delays where the main renderer thread,
which generally is already running on a 'big' core must wait for items claimed
by threads running on 'little' cores to finish. If we look at only situations
where, for instance, 3 pages are being evacuated, the mean time taken to
perform this evacuation is roughly 2ms, but has been seen to take up to 8ms.
These timescales are too short for the tasks to be considered for migration.
Modifying the scheduler to be overly aggressive in migration may have other
detrimental effects on the system.
In this case, we consider the use of ItemParallelJobs for parallel page
evacuation.
We try to reduce the worst case and mean evacuation times in two different ways:
1. Reducing the number of available threads when the number of pages is
particularly small.
2. Sorting the items to be handled by the worker threads, so that the main
renderer thread is given priority access to the items that will take longer.
This is done to balance the work done by the different cores.
The first method has been tuned for a Chromebook with two big and four little
cores, and is simply using half the number of threads as pages we need to
evacuate when evacuating less than 12 pages (or more generally, when evacuating
a number of pages that is less than twice the number of cores.)
The second method involves estimating the complexity of any given item. In the
case of page evacuation, we hypothesised that the time taken to evacuate a page
is strongly correlated with the number of live bytes on the page - this metric
is already tracked for each page. Through experimental analysis on speedometer
and speedometer2, it was observed that the Spearman correlation between the
live_bytes count and time taken to evacuate a page was 0.79. We hypothesised
that the number of live objects might hold a stronger correlation, but analysis
of this showed that the object count has a Spearman correlation of 0.81 with
the duration. This slightly increased correlation may not be enough to justify
the extra overhead of tracking the live object count as well, but may be an
option for the future.
The patch we propose has been seen to reduce the evacuation time in Speedometer
on "kevin" as follows:
* 3% average reduction across all counts of pages being evacuated
* 11% average reduction when less than 7 pages are being evacuated
* the worst case across all pages is reduced by 29%
Attached to this bug are some graphs showing the individual changes in mean and
maximum evacuation times for each count of pages in Speedometer. We believe
we're seeing overall a positive effect from this patch, particularly when
considering reduction in worst-case evacuation times.
Furthermore, we have tested this on a Chromebook with two big and two little
cores ("elm"), and also see encouraging results there. Particularly:
* 9% average reduction across all counts of pages being evacuated
* 28% average reduction when less than 7 pages are being evacuated
Detailed results across all page evacuation counts are also attached to this
bug.
This notion of complexity could also hopefully be extended to the other classes
that use the ItemParallelJob class in v8.
A third way could also be explored for reducing the average and worst-case
evacuation times, but this requires the presence of features like SchedTune.
On most heterogeneous systems, the scheduler subsystem that determines where
tasks should be placed, and what frequency is required to execute them, bases
its decisions primarily on the utilisation history of tasks. The
ItemParallelJob work described in this bug is typically performed by
TaskSchedulerForegroundWorker threads, which are part of the task scheduler
system in the Chrome browser. At the point of use of an ItemParallelJob, it is
likely that these threads have a low utilisation history, and so they are
scheduled onto little cores at low frequencies, making GC times longer.
This can be addressed through the use of systems like SchedTune, which can be
used to artifically boost the utilisation history of certain tasks when they
are being scheduled. This third method will therefore require the
TaskSchedulerForegroundWorker threads to be placed in a separate cgroup, and
given a schedtune.boost value (which inflates their utilisation by a percentage
of the remaining capacity of a big CPU) to ensure they are scheduled
favourably. It should also be possibile to dynamically adjust this value so
the boosting only happens during v8's GC periods, if it is seen to be
detrimental at other times, or impacts battery life significantly.
This still needs to be tested, but in the mean time the first and second
methods proposed provide benefit to the GC evacuation phase, particularly on
heterogenous systems where SchedTune is not available.
|
Deleted:
kevin-average-change.png
27.6 KB
|
|
Deleted:
kevin-99pct-change.png
28.7 KB
|
|
Deleted:
kevin-max-change.png
29.1 KB
|
|
Deleted:
elm-average-change.png
27.1 KB
|
|
Deleted:
elm-99pct-change.png
32.3 KB
|
|
Deleted:
elm-max-change.png
29.2 KB
|
Comment 1 by stephen....@arm.com
, Jun 27 2018Components: Blink>JavaScript>GC Internals>TaskScheduler