build *custom* ICU data at build-time |
||||
Issue description
Currently, ICU data bundles (icudt[lb].dat) are pre-built and checked in for Android (smaller) and other platforms (bigger. both Little and Big Endian).
The assembly source files (for Linux, Mac, Android) used to be checked in as well, but not any more. In their places, BigEndian data bundle is checked in to support v8 in big endian cpus.
To cut down the ICU data and include ONLY what's used by Chrome (locale-wise, data-category-wise), a set of shell scripts along with list of locales/data ids/etc (in icu/scripts) are run before generating ICU data bundles.
third_party/icu/source/data used to have this customized version of ICU data source files ('txt' files).
As a step toward streamlining the process and making it easier for non-Chrome users of ICU to tailoring the ICU data, third_party/icu/source/data now has the full source files (+ *local.mk and customized converter specs for HTML5 )
This scheme has served Chrome well, but non-Chrome users of ICU (e.g. v8, node.js, embedders of Blink, Opera) need to make changes in the locale list/data category list in icu/scripts on top of the checkout from Chrome's ICU.
In addition, iOS Chrome may need have different needs as well.
To support all these use cases as well as to simplify ICU data updates (no need to build ICU data and check them in), we'd better switch to building ICU data at build-time. It also has a side-benefit of a smaller git repo.
There was a CL from Opera along this direction, but with that approach, ICU data customization was not as flexible as I want it to be.
Now that switching to GN is well-underway, it may be good time to think about this. (we don't have to deal with two build files any more. :-)).
What needs to be done is :
1) Build ICU tools (e.g. icupkg and others) at build time. Need to write GN files for them (should be straightforward)
2) Translate make files to build ICU data to GN
3) Come up with a data customization spec file format (maybe json?) that can meet Chrome's requirements for fine-grained customization
4) Python script that generates the customized data source files given a data-customization-spec file and the full ICU data tree. The output of this script will be fed to ICU data build tools made in step 1 via rules written in GN (step 2).
One potential issue with GN: To minimize the final data size, the pre-built ICU data is generated in two passes (once to generate optimal pool resource bundles and the second time to generate other resource bundles to utilize pool bundles generated earlier). It might be tricky (or not. I don't know, yet) to translate to GN.
References:
Node.js has tools for data customization.
https://github.com/nodejs/node/blob/master/tools/icu/README.md https://github.com/nodejs/node/tree/master/tools/icu
It operates on the binary ICU data bundle using icupkg and other ICU data manipulation tools. I'm afraid that it does not offer a way to include/exclude records within individual ICU data resource files (e.g. lang/fr/Languages/ach). Instead, it operates at the level of ICU data resource files (e.g. lang/fr.res ). Chrome needs to customize data at 'record-key' levels. So, this tool does not meet Chrome's needs, yet (afaict).
However, ICU data manipulation tools in the upstream can be improved to allow fine-grained data customization as required by Chrome.
,
Jul 27 2016
Chromecast and Chrome's network library users would also benefit from this easier/more flexibile data customization.
,
Jul 28 2016
,
Jul 28 2016
As for the pool resource optimization, see http://bugs.icu-project.org/trac/ticket/12069
,
May 16 2018
In addition to the two ICU bugs (comment 1 and comment 4), there's a Google internal bug on this issue ( b/77698473 ).
,
May 16 2018
> In addition, iOS Chrome may need have different needs as well. We now have pre-built data files for iOS Chrome and Chromecast.
,
Sep 8
One more pre-built data is checked in for fuchsia. THere's an upstream change under review ( https://github.com/unicode-org/icu/pull/82 ), but at least in the current form (if I understand it correctly), that seems to be different from what I envisioned in this bug.
,
Sep 26
issue 369218 and issue 882860 are about more locales to Android. Android may need a 2nd copy of ICU data file with more locales. See bug 882860 for details. Definitely, we need to resolve this issue before long. |
||||
►
Sign in to add a comment |
||||
Comment 1 by js...@chromium.org
, Jul 27 2016