New issue
Advanced search Search tips

Issue 779623 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner: ----
Closed: Jan 10
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: ----



Sign in to add a comment

Do not fetch history of DEPSed repositories with pinned revisions

Project Member Reported by johannko...@google.com, Oct 30 2017

Issue description

When requesting to add a repo to DEPS, one of the issues was the size.

libaom was forked from libvpx and contains git history going back to 2010, encompassing vp8, vp9 development, vp10 investigation, av1 development and massive copies, renames and refactoring between those projects. libaom is almost twice the size of libvpx (58mb vs 110mb).

None of this git history is valuable to building libaom for chromium. If we would like to add av1 support to chromium, it will put an extra 110mb in every single developers checkout.

To mitigate this I would like to request that DEPS entries contain only the pinned revision. If a developer desires development history they would still be able to enter the desired directory and 'git fetch' such information.

While I am not very familiar with every other DEPS entry, my understanding is they are intended to provide a snapshot for chromium to depend on, not be an active point of development. This is certainly the case for libaom and libvpx. Development mostly occurs upstream, and we roll updates into chromium periodically.
 

Comment 1 by s...@google.com, Nov 1 2017

Cc: tandrii@chromium.org aga...@chromium.org hinoka@chromium.org
Components: -Infra>Git>Admin Infra>SDK
Summary: Do not fetch history of DEPSed repositories with pinned revisions (was: DEPS checkouts take up a lot of space)
I guess we should change bot_update not to fetch the entire history when it sees a pinned dependency. I think what we want is:

rm -fr src/third_party/libaom/source/libaom
mkdir src/third_party/libaom/source/libaom
cd src/third_party/libaom/source/libaom
git init
git fetch --depth 1 https://aomedia.googlesource.com/aom 7b06dd5dbf11ee1cd65b974a2e46ec33eab65375
git checkout FETCH_HEAD

I just verified that this is quite fast and a quarter the size. Can we do this for each DEPS entry?
Cc: iannucci@chromium.org
Status: WontFix (was: Untriaged)
There are a few problems with doing this:

1) Creating a shallow checkout (with fetch --depth N) is very expensive for the remote git servers. They have pre-made packs of all the files necessary for a full clone, and can serve them quickly. Serving shallow clones (especially many shallow clones, to all of the bots every time they sync) is much computationally harder.
2) Upgrading a shallow clone to a full clone is even more expensive than that. Although it wouldn't expect to be done often, if you run smut@'s steps above and then run a plain "git fetch", that will be slower than if you had just done a full-depth clone in the first place.

Basically, when we first switched from SVN to Git in 2014, we tried to do exactly this. It didn't go well, and so we put other measures (like the git cache which is used by all bots) in place instead.

It may be worth investigating this solution again, but I am not aware of anyone who really has the cycles to do so. In my opinion, a difference of ~0.075GB (since smut says the shallow clone is 3/4ths the size) and decreasing (since each subsequent shallow fetch leaves the old git objects on disk) is not worth fundamentally changing the way we get checkouts.
Status: Unconfirmed (was: WontFix)
Your performance points are quite unfortunate. However, I'm going to disagree strongly wrt size. While I used libaom as an example, I would suggest using this for all the DEPS repositories. I'm also not primarily interested in this for bots, but for developer checkouts

For libaom:
$ du -hs libaom*
 25M	libaom-7b06dd5dbf11ee1cd65b974a2e46ec33eab65375
104M	libaom-full

There are some much larger gains:
native_client 422mb -> 45mb
skia 347mb -> 94mb
icu 355mb -> 177mb
v8 323mb -> 151mb

I tried to script this but ran into problems with the way buildtools has nested .git directories and so gave up on that one. But generally, it saves about 2gb total. Checkout directory goes from 18gb to 16gb.

#!/bin/bash
for dir in `find . -mindepth 2 -type d -name .git | sort | sed 's/\/\.git//'`; do
  [[ -d ${dir} ]] || echo ${dir} disappeared
  [[ -d ${dir} ]] || exit 1
  pushd ${dir} > /dev/null
  project=$(pwd | awk -F"/" '{print $NF}')
  repo=$(grep -m1 url .git/config | cut -f2 -d'=')
  rev=$(git log | head -n 1 | cut -f2 -d' ')
  echo ${dir} ${project} ${repo} ${rev}

  [[ "${project}" = "buildtools" ]] && popd && continue

  cd ..
  du -hs ${project}
  rm -rf ${project}
  mkdir ${project}
  cd ${project}
  git init
  git fetch --depth 1 ${repo} ${rev}
  git checkout FETCH_HEAD
  cd ..
  du -hs ${project}
  
  popd > /dev/null
done
A simpler script:

fetch --nohooks chromium
du -hs .
echo '#!/bin/bash\nrm -rf .git/objects\nmkdir .git/objects\ngit fetch --depth 1 origin `git rev-parse HEAD`' > /tmp/shallow
chmod +x /tmp/shallow
gclient recurse /tmp/shallow
du -hs .

But even so, the savings are minor:
* It looks like a ~10% savings or so
* A full compile produces more than this amount of executables + debug symbols anyway
* The old git objects aren't deleted when the repo rolls forward
* People with existing checkouts won't see these benefits at all

And performing certain kinds of common-ish actions becomes a much worse experience:
* Moving backwards in time (e.g. for a bisect) gets very slow
* Examining history (e.g. for a blame in a dependency) becomes impossible, or requires a sync first
* Running `git log` in a repo which has 5 out of the last 100 commits (e.g. v8 which you sync once a day) breaks entirely
One last point: network traffic. <project> is a full checkout and <project>-shallow is a git fetch --depth 1 HEAD

98M	./libaom/.git
3.6M	./libaom-shallow/.git

384M	./native_client/.git
5.9M	./native_client-shallow/.git

288M	./skia/.git
20M	./skia-shallow/.git

After doing 'git checkout FETCH_HEAD' the on-disk size is much larger, but I believe these accurately represent what is copied over the network.
Cc: -iannucci@chromium.org iannu...@google.com
Labels: Pri-3
Status: WontFix (was: Unconfirmed)

Sign in to add a comment