New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 738566 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jul 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Investigate pak file duplicate entries

Project Member Reported by agrieve@chromium.org, Jun 30 2017

Issue description

Running:
unzip ChromePublic.apk assets/en-US.pak
tools/grit/pak_util.py print assets/en-US.pak | cut -d: -f2 | sort | uniq -c | sort -rn

Shows:
      8  This server could not prove that it is <strong>$1</strong>; ...
      4  You attempted to reach <strong>$1</strong>, but the server p...
      4  Status
      4  Reload
      4  Other…
      4  Exp
      4  Details
      4  Block
      4  Allow
      3  Unknown error
      3  Times New Roman
      3  Save
      3  OK
      3  Never
      3  Name
      3  Learn more
      3  Hide details
      3  Google Safe Browsing recently <a href="#" id="diagnostic-lin...
      3  Continue
      3  Chromium
      3  <a jsvalues="href
      3  $1, $2
      2  ZIP code
      2  Year
      2  Week
      2  Waiting for $1…
      2  video
      2  User
      2  Update
      2  Today
      2  This ZIP code format is not recognized. Example of a valid Z...
      2  This ZIP code does not appear to match the rest of this addr...
      2  This postal code format is not recognized. Example of a vali...
      2  This postal code does not appear to match the rest of this a...
      2  The certificate chain for this site contains a certificate s...
      2  Submit
      2  <strong jscontent="hostName"></strong> took too long to resp...
      2  State
      2  Show saved copy
      2  Show original
      2  Settings
      2  Secure connection
      2  {SECONDS, plural, =1 {1 second} other {# seconds}}
      2  Refresh
      2  Province
      2  Prefecture
      2  Postal code
      2  Please lengthen this text to $2 characters or more (you are ...
      2  play on remote device
      2  Parish
      2  Not secure
      2  Notifications
      2  None
      2  Network error
      2  Month
      2  {MINUTES, plural, =1 {1 minute} other {# minutes}}
      2  It may have been moved or deleted.
      2  Island
      2  If you understand the risks to your security, you may <a hre...
      2  {HOURS, plural, =1 {1 hour} other {# hours}}
      2  Guest
      2  Further, this page includes other resources which are not se...
      2  exit full screen
      2  Emirate
      2  District
      2  Deprovisioned
      2  Department
      2  County
      2  {COUNT, plural,\n     =0 {None}\n     =1 {1 item}\n     other {...
      2  control remote playback
      2  Close
      2  Chromium Autofill settings…
      2  Chromium Apps
      2  Back to safety
      2  audio
      2  Area
      2  $1 with $2
      2  {1, plural,\n     =1 {This server could not prove that it is ...
      1  yyyy
<snip>

Running:
strings en-US.pak | grep "Times New Roman"

shows that the strings are actually being stored multiple times.

While some of this duplication is probably unnecessary, duplication of strings for translation can also happen for words that are the same in english, but have different translations depending on context / semantics.

Rather than audit the strings, we should change the .pak file format to accommodate duplicates.

Idea #1:
Store two tables at the start of the file. One that maps IDs->offsets, and the other that maps IDs->IDs for duplicates.
* This would mean doing two binary searches for each ID lookup at runtime.

Idea #2:
Store just one table, but have duplicates point to the same offset within the file.
* This would require us to change the runtime logic that calculates: entry_length = next_offset - my_offset. It would instead become: entry_length = next_offset_that_is_larget_than_my_offset - my_offset.

Idea 2 adds in an O(n) search, but in reality, given the small number of duplicates, I suspect it will always be O(less than 3)



 
Status: Started (was: Available)
I implemented this idea. WIP CL here:
https://codereview.chromium.org/2969123002

For ChromePublic.apk, it reduced the uncompressed pak size by 118kb, but the compressed pak size by just 12kb.

For MonochromePublic.apk reduced uncompressed pak by 130kb and compressed by 32kb.

The compressed savings are somewhat small, but still probably worth it since it's not much effort.
Did a brainstorm with flackr about the file format, and we're going to go with:

1. Store an extra 2 bytes in the header for the number of aliases
2. Store the main table of entries as with today
3. Store a second table of id->index into main table
4. Search first table first, then search alias table
5. To be backwards compatible with v4, set the size of alias table to 0 when loading.
Project Member

Comment 3 by bugdroid1@chromium.org, Jul 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ade347f539a378fb37500f6d17e8835edc1d8ec0

commit ade347f539a378fb37500f6d17e8835edc1d8ec0
Author: agrieve <agrieve@chromium.org>
Date: Thu Jul 20 12:48:12 2017

Add deduplication logic to .pak files

Now, when multiple entries contain the same content, multiple
table-of-contents entries will be created, but only one data region.

As of now, en-US.pak has ~3200 entries, and 350 of them are duplicates.

For MonochromePublic.apk, shrinks uncompressed .pak size by ~130kb, and
compressed .pak size by 32kb.

BUG= 738566 

Review-Url: https://codereview.chromium.org/2969123002
Cr-Commit-Position: refs/heads/master@{#488215}

[modify] https://crrev.com/ade347f539a378fb37500f6d17e8835edc1d8ec0/build/android/resource_sizes.py
[modify] https://crrev.com/ade347f539a378fb37500f6d17e8835edc1d8ec0/tools/grit/grit/format/data_pack.py
[modify] https://crrev.com/ade347f539a378fb37500f6d17e8835edc1d8ec0/tools/grit/grit/format/data_pack_unittest.py
[modify] https://crrev.com/ade347f539a378fb37500f6d17e8835edc1d8ec0/tools/grit/pak_util.py
[modify] https://crrev.com/ade347f539a378fb37500f6d17e8835edc1d8ec0/ui/base/BUILD.gn
[modify] https://crrev.com/ade347f539a378fb37500f6d17e8835edc1d8ec0/ui/base/resource/data_pack.cc
[modify] https://crrev.com/ade347f539a378fb37500f6d17e8835edc1d8ec0/ui/base/resource/data_pack.h
[modify] https://crrev.com/ade347f539a378fb37500f6d17e8835edc1d8ec0/ui/base/resource/data_pack_literal.cc
[add] https://crrev.com/ade347f539a378fb37500f6d17e8835edc1d8ec0/ui/base/resource/data_pack_literal.h
[modify] https://crrev.com/ade347f539a378fb37500f6d17e8835edc1d8ec0/ui/base/resource/data_pack_unittest.cc
[modify] https://crrev.com/ade347f539a378fb37500f6d17e8835edc1d8ec0/ui/base/resource/resource_bundle_unittest.cc

Status: Fixed (was: Started)

Comment 5 by digit@chromium.org, Jul 20 2017

Great work, thanks for adding this :) It's 1000x better than having to manage the strings source files!
Project Member

Comment 6 by bugdroid1@chromium.org, Jul 21 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/59f00a7d06b43ae4127637baaf7d8e8f5ee7847a

commit 59f00a7d06b43ae4127637baaf7d8e8f5ee7847a
Author: achuith <achuith@chromium.org>
Date: Fri Jul 21 02:11:35 2017

Revert of Add deduplication logic to .pak files (patchset #10 id:180001 of https://codereview.chromium.org/2969123002/ )

Reason for revert:
https://bugs.chromium.org/p/chromium/issues/detail?id=747171

Original issue's description:
> Add deduplication logic to .pak files
>
> Now, when multiple entries contain the same content, multiple
> table-of-contents entries will be created, but only one data region.
>
> As of now, en-US.pak has ~3200 entries, and 350 of them are duplicates.
>
> For MonochromePublic.apk, shrinks uncompressed .pak size by ~130kb, and
> compressed .pak size by 32kb.
>
> BUG= 738566 
>
> Review-Url: https://codereview.chromium.org/2969123002
> Cr-Commit-Position: refs/heads/master@{#488215}
> Committed: https://chromium.googlesource.com/chromium/src/+/ade347f539a378fb37500f6d17e8835edc1d8ec0

TBR=flackr@chromium.org,sadrul@chromium.org,agrieve@chromium.org
# Skipping CQ checks because original CL landed less than 1 days ago.
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG= 738566 

Review-Url: https://codereview.chromium.org/2989443002
Cr-Commit-Position: refs/heads/master@{#488554}

[modify] https://crrev.com/59f00a7d06b43ae4127637baaf7d8e8f5ee7847a/build/android/resource_sizes.py
[modify] https://crrev.com/59f00a7d06b43ae4127637baaf7d8e8f5ee7847a/tools/grit/grit/format/data_pack.py
[modify] https://crrev.com/59f00a7d06b43ae4127637baaf7d8e8f5ee7847a/tools/grit/grit/format/data_pack_unittest.py
[modify] https://crrev.com/59f00a7d06b43ae4127637baaf7d8e8f5ee7847a/tools/grit/pak_util.py
[modify] https://crrev.com/59f00a7d06b43ae4127637baaf7d8e8f5ee7847a/ui/base/BUILD.gn
[modify] https://crrev.com/59f00a7d06b43ae4127637baaf7d8e8f5ee7847a/ui/base/resource/data_pack.cc
[modify] https://crrev.com/59f00a7d06b43ae4127637baaf7d8e8f5ee7847a/ui/base/resource/data_pack.h
[modify] https://crrev.com/59f00a7d06b43ae4127637baaf7d8e8f5ee7847a/ui/base/resource/data_pack_literal.cc
[delete] https://crrev.com/b1c22e2d5380b32a6379f43fff578fff2c807930/ui/base/resource/data_pack_literal.h
[modify] https://crrev.com/59f00a7d06b43ae4127637baaf7d8e8f5ee7847a/ui/base/resource/data_pack_unittest.cc
[modify] https://crrev.com/59f00a7d06b43ae4127637baaf7d8e8f5ee7847a/ui/base/resource/resource_bundle_unittest.cc

Project Member

Comment 7 by bugdroid1@chromium.org, Jul 24 2017

Labels: merge-merged-3163
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1595ed49627bfa8af05e54112e27b6c904e160b2

commit 1595ed49627bfa8af05e54112e27b6c904e160b2
Author: achuith <achuith@chromium.org>
Date: Mon Jul 24 19:56:53 2017

Revert of Add deduplication logic to .pak files (patchset #10 id:180001 of https://codereview.chromium.org/2969123002/ )

Reason for revert:
https://bugs.chromium.org/p/chromium/issues/detail?id=747171

Original issue's description:
> Add deduplication logic to .pak files
>
> Now, when multiple entries contain the same content, multiple
> table-of-contents entries will be created, but only one data region.
>
> As of now, en-US.pak has ~3200 entries, and 350 of them are duplicates.
>
> For MonochromePublic.apk, shrinks uncompressed .pak size by ~130kb, and
> compressed .pak size by 32kb.
>
> BUG= 738566 
>
> Review-Url: https://codereview.chromium.org/2969123002
> Cr-Commit-Position: refs/heads/master@{#488215}
> Committed: https://chromium.googlesource.com/chromium/src/+/ade347f539a378fb37500f6d17e8835edc1d8ec0

TBR=flackr@chromium.org,sadrul@chromium.org,agrieve@chromium.org
# Skipping CQ checks because original CL landed less than 1 days ago.
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG= 738566 

Review-Url: https://codereview.chromium.org/2989443002
Cr-Original-Commit-Position: refs/heads/master@{#488554}
Change-Id: I33eff508d4177907d050424bee967d643fe022e4
Reviewed-on: https://chromium-review.googlesource.com/583542
Reviewed-by: Bernie Thompson <bhthompson@chromium.org>
Cr-Commit-Position: refs/branch-heads/3163@{#13}
Cr-Branched-From: ff259bab28b35d242e10186cd63af7ed404fae0d-refs/heads/master@{#488528}
[modify] https://crrev.com/1595ed49627bfa8af05e54112e27b6c904e160b2/build/android/resource_sizes.py
[modify] https://crrev.com/1595ed49627bfa8af05e54112e27b6c904e160b2/tools/grit/grit/format/data_pack.py
[modify] https://crrev.com/1595ed49627bfa8af05e54112e27b6c904e160b2/tools/grit/grit/format/data_pack_unittest.py
[modify] https://crrev.com/1595ed49627bfa8af05e54112e27b6c904e160b2/tools/grit/pak_util.py
[modify] https://crrev.com/1595ed49627bfa8af05e54112e27b6c904e160b2/ui/base/BUILD.gn
[modify] https://crrev.com/1595ed49627bfa8af05e54112e27b6c904e160b2/ui/base/resource/data_pack.cc
[modify] https://crrev.com/1595ed49627bfa8af05e54112e27b6c904e160b2/ui/base/resource/data_pack.h
[modify] https://crrev.com/1595ed49627bfa8af05e54112e27b6c904e160b2/ui/base/resource/data_pack_literal.cc
[delete] https://crrev.com/48aa68957520402a20073d92b6c5f4965479528e/ui/base/resource/data_pack_literal.h
[modify] https://crrev.com/1595ed49627bfa8af05e54112e27b6c904e160b2/ui/base/resource/data_pack_unittest.cc
[modify] https://crrev.com/1595ed49627bfa8af05e54112e27b6c904e160b2/ui/base/resource/resource_bundle_unittest.cc

Project Member

Comment 8 by bugdroid1@chromium.org, Jul 28 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4735fae6caad85ab08433367108d3b96c0f7d646

commit 4735fae6caad85ab08433367108d3b96c0f7d646
Author: agrieve <agrieve@chromium.org>
Date: Fri Jul 28 20:39:08 2017

Reland of Add deduplication logic to .pak files

Revert was here:
https://codereview.chromium.org/2989443002/

Reason for reland:
No changes from origina. Bot issue fixed by:
https://chromium-review.googlesource.com/c/589888/

Now, when multiple entries contain the same content, multiple
table-of-contents entries will be created, but only one data region.

As of now, en-US.pak has ~3200 entries, and 350 of them are duplicates.

For MonochromePublic.apk, shrinks uncompressed .pak size by ~130kb, and
compressed .pak size by 32kb.

TBR=flackr@chromium.org,sadrul@chromium.org,achuith@chromium.org
BUG= 738566 

Review-Url: https://codereview.chromium.org/2984383002
Cr-Commit-Position: refs/heads/master@{#490501}

[modify] https://crrev.com/4735fae6caad85ab08433367108d3b96c0f7d646/build/android/resource_sizes.py
[modify] https://crrev.com/4735fae6caad85ab08433367108d3b96c0f7d646/tools/grit/grit/format/data_pack.py
[modify] https://crrev.com/4735fae6caad85ab08433367108d3b96c0f7d646/tools/grit/grit/format/data_pack_unittest.py
[modify] https://crrev.com/4735fae6caad85ab08433367108d3b96c0f7d646/tools/grit/pak_util.py
[modify] https://crrev.com/4735fae6caad85ab08433367108d3b96c0f7d646/ui/base/BUILD.gn
[modify] https://crrev.com/4735fae6caad85ab08433367108d3b96c0f7d646/ui/base/resource/data_pack.cc
[modify] https://crrev.com/4735fae6caad85ab08433367108d3b96c0f7d646/ui/base/resource/data_pack.h
[modify] https://crrev.com/4735fae6caad85ab08433367108d3b96c0f7d646/ui/base/resource/data_pack_literal.cc
[add] https://crrev.com/4735fae6caad85ab08433367108d3b96c0f7d646/ui/base/resource/data_pack_literal.h
[modify] https://crrev.com/4735fae6caad85ab08433367108d3b96c0f7d646/ui/base/resource/data_pack_unittest.cc
[modify] https://crrev.com/4735fae6caad85ab08433367108d3b96c0f7d646/ui/base/resource/resource_bundle_unittest.cc

Sign in to add a comment