cbuildbot: cherry-pick changes faster |
|||||||||
Issue descriptioncbuildbot currently cherry-picks changes very slowly and inefficiently. We can do at least an order of magnitude better. An example is guado moblab build 5814. For that build we had 133 changes and 106 of them were kernel changes. We spent about 50 minutes cherry-picking changes. As per bug #725220 , picking a kernel changes from gerrit one-at-a-time takes between 20 and 25 seconds. === So how can we be faster? One trick I personally do as an individual developer is to "FETCH" the top change in the chain, then cherry-pick by git hash. That particular guado moblab had a huge chain of CLs in it. I count 100 kernel changes from mcchou, and it's clear that many of these are in one big chain in gerrit. It turns out that doing a "git fetch" of the highest change in the list actually fetches all the earlier ones too but still only takes between 20 and 25 seconds. Let's take an example w/ just two patches. Let's say I want to cherry-pick these two: https://chromium-review.googlesource.com/c/474435/6 https://chromium-review.googlesource.com/c/474436/6 474435 is the parent of 474436. I will say that these two commands produce equivalent results but one is twice as fast: git fetch https://chromium.googlesource.com/chromiumos/third_party/kernel refs/changes/35/474435/6 git cherry-pick FETCH_HEAD git fetch https://chromium.googlesource.com/chromiumos/third_party/kernel refs/changes/36/474436/6 git cherry-pick FETCH_HEAD vs. git fetch https://chromium.googlesource.com/chromiumos/third_party/kernel refs/changes/36/474436/6 git cherry-pick FETCH_HEAD~ git cherry-pick FETCH_HEAD NOTE: it's important to be careful in the above. If 474435 has been revved then you'll need to fetch it separately. AKA if 474435/6 is a parent of 474436/6 and #6 is the newest patch set for 474436, but there's a patch set #7 for 474435, then we need to do the extra fetch. If it's fast, we could easily validate git hashes. There's probably a better way, but this shows the concept: hash_474436=$(curl -s https://chromium-review.googlesource.com/changes/474436/revisions/6/patch?download | base64 -d | head -1 | awk '{print $2;}') hash_474435=$(curl -s https://chromium-review.googlesource.com/changes/474435/revisions/6/patch?download | base64 -d | head -1 | awk '{print $2;}') if [[ "$(git show ${hash_474436}~)" == "$(git show ${hash_474435})" ]]; then echo "Good"; else echo "Bad, need a separate fetch"; fi === Another possible solution is much easier but is a bigger change. ...that's to just use the "base 64" patch download. That's always super fast. I'm a bit hesitant to just suggest we do this because it's possible (???) it could give some different results, I think. If we do the full "git fetch" we get the full git history for a given change. That should give "git" more context for how the change should be applied. With just the "PATCH" file I don't think we're guaranteed that git will make the same decisions. Can someone who knows git better confirm? In any case, if we can do this, you'd just need to run: curl -s https://chromium-review.googlesource.com/changes/474433/revisions/6/patch?download | base64 -d | git am curl -s https://chromium-review.googlesource.com/changes/474436/revisions/6/patch?download | base64 -d | git am That's pretty darn fast.
,
May 22 2017
OK, I think I have proved that the "git am" isn't as smart as "git fetch / git cherry-pick" and probably can't be used for the builder.
Basically: having full git history and cherry-picking can handle renames. ...and git am can't. In the below, I do this:
1. Make a change in a file (change text "Doug" -> "Frank")
2. Rename the file
3. Make the opposite change in the file (change text "Frank" -> "Doug")
If I cherry-pick the original change ("Doug" -> "Frank") then it actually works and makes the change in the renamed file.
If I do a "git format-patch" and "git am" of the original change, it doesn't work properly and doesn't change the renamed file.
---
dianders@tictac:tmp$ mkdir gittest
dianders@tictac:tmp$ cd gittest
dianders@tictac:gittest$ git init .
Initialized empty Git repository in /b/tip/tmp/gittest/.git/
dianders@tictac:gittest (master)$ echo "My name is Doug" > name
dianders@tictac:gittest (master)$ git add name
dianders@tictac:gittest (master)$ git commit -am "my name is doug"
[master (root-commit) c933037bac73] my name is doug
1 file changed, 1 insertion(+)
create mode 100644 name
dianders@tictac:gittest (master)$ echo "My name is Frank" > name
dianders@tictac:gittest (master)$ git add name
dianders@tictac:gittest (master)$ git commit -am "my name is now frank"
[master 92e041a7545a] my name is now frank
1 file changed, 1 insertion(+), 1 deletion(-)
dianders@tictac:gittest (master)$ git mv name name.txt
dianders@tictac:gittest (master)$ git commit -m "oops, forgot .txt"
[master 1dbd5ec0f2b1] oops, forgot .txt
1 file changed, 0 insertions(+), 0 deletions(-)
rename name => name.txt (100%)
dianders@tictac:gittest (master)$ echo "My name is Doug" > name.txt
dianders@tictac:gittest (master)$ git add -u
dianders@tictac:gittest (master)$ git commit -am "Just kidding, my name is really Doug"
[master b8adb72f1a23] Just kidding, my name is really Doug
1 file changed, 1 insertion(+), 1 deletion(-)
dianders@tictac:gittest (master)$ git log --oneline
b8adb72f1a23 (HEAD -> master) Just kidding, my name is really Doug
1dbd5ec0f2b1 oops, forgot .txt
92e041a7545a my name is now frank
c933037bac73 my name is doug
dianders@tictac:gittest (master)$ git cherry-pick 92e041a7545a
[master 56b089e4e30c] my name is now frank
Date: Mon May 22 14:25:34 2017 -0700
1 file changed, 1 insertion(+), 1 deletion(-)
dianders@tictac:gittest (master)$ git reset --hard HEAD~
HEAD is now at b8adb72f1a23 Just kidding, my name is really Doug
dianders@tictac:gittest (master)$ git format-patch 92e041a7545a~..92e041a7545a
0001-my-name-is-now-frank.patch
dianders@tictac:gittest (master)$ git am 0001-my-name-is-now-frank.patch
Applying: my name is now frank
error: name: does not exist in index
Patch failed at 0001 my name is now frank
The copy of the patch that failed is found in: .git/rebase-apply/patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
,
May 22 2017
I very much like where this is going, but as nxia@ pointed out in email earlier... how do we detect that a new revision has been uploaded for a subset of the git stack? For example, I upload a stack of three CLs: A - Rev 1 B - Rev 1 C - Rev 1 I then create a new branch locally. Cherry pick B to this new branch, amend it, and upload it to Gerrit. This upgrades B to Rev 2 of the CL (which changes the SHA, resets flags, etc). If you fetch change C, you'll get rev 1 of B, not rev 2. We never want to test Rev 1 after 2 has been uploaded.
,
May 22 2017
Yes you can absolutely run cherry-pick an order-of-magnitude faster. The real cost here is the 'git fetch', not the 'cherry-pick'. You can absolutely speed up the 'git fetch' by skipping all-but-the-last 'git fetch' operations. Note that in order for this to work the whole stack of patches needs to be rebased exactly on top of each other, but we already have code in Android TreeHugger to check for this that you could borrow. I'd probably also recommend implementing some multiprocessing/threading for this code, if you haven't already.
,
May 22 2017
@3: I think I was suggesting a workaround for your exact case above. Let me try to summarize it like this: 1. Find out all git hashes of proper revisions of all CLs you intend to commit. If there's no better way to do this, this can be done semi quickly with the command: curl -s https://chromium-review.googlesource.com/changes/${CLNUM}/revisions/${REVNUM}/patch?download | base64 -d 2. Run this for each CL in _reverse_ order of what we have today: if 'git show ${GITHASH}' works: do nothing else: do the same "git fetch" we'd do today, but not the cherry pick. 3. Run this for each CL in _the same_ order we have today: git cherry-pick ${GITHASH} That's a pretty simple way to get started. --- David James is right that threading would be ideal here. I'm not terribly familiar with the above, but one easy thing to do would be to fork off a thread per project. Then you wouldn't need to mess with the above logic. --- Does anyone have the cbuildbot knowhow to do this quickly? It seems like for someone familiar with the code this wouldn't be crazy hard.
,
May 22 2017
I'm NOT super familiar with the code in question, but when selecting CLs to put into the CQ, we make a Gerrit REST API request per CL. Think Gerrit the review tool, not git. This is used to fetch flag information and such, and we store off that information. That includes CL #, revision, and the "Change-ID", and probably includes the git hash as well. I think we just need to verify (on the master, not per-slave) that the git hash we fetched matches that Gerrit information.
,
May 23 2017
A few comments: 1. I think threading will be insufficient. If you have 100 patches in one chain in one project, one thread per project won't help. 2. We're doubly bit by this by doing stuff for the master and the slave builds. The flipside of that is we can do stuff on the master and tell the slave the optimal set of things that need to be done. 3. There's another option that is potentially more work, but maybe it's just a week or two in total: - master: ask gerrit to create a new branch from ToT called VERSION-cq via Gerrit REST API - master: ask gerrit to cherry-pick all of the changes being tested to VERSION-cq via REST - master/slave: sync VERSION-cq - master: upon commitqueuecompletion, if ToT has not moved and all changes are being accepted, do the equivalent of reset --hard/push cros/master to VERSION-cq via REST; otherwise, use REST to cherry-pick submitted changes to cros/master
,
May 23 2017
Here's my alternative proposal: A. Combine fetches per-project (e.g. fetch 200 refs in one run). According to local tests, it takes 6 seconds to fetch 1 ref and 12 seconds to fetch 200 refs, so it scales very well to fetch that way. B. Thread fetches across projects. With A and B, you'll have the whole fetch happening in ~12 seconds, and then just have to contend with cherry-pick time. Testing locally, it takes ~10-12 seconds to cherry-pick 200 patches (2 seconds less if you combine into a single command). So really, with the threading & the combined fetch, we're talking about 24 seconds for the whole fetch + cherry-pick on the master, and 24 seconds on the slave. If we could get there there'd be no reason to create branches (which I think would be more work/complexity and could lead to new kinds of flake as we try to get the logic right)
,
May 23 2017
+1 to c#8. I think in the long run, I still prefer doing this server side once (especially if we can optimize the commit), but I think if we can collapse all the fetches, it's much simpler and easier to do that now.
,
May 23 2017
Talked with David and he's already started on this :)
,
May 24 2017
,
May 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/78ea03cd8699bd6fda622fb9f8ea8a09df53dfec commit 78ea03cd8699bd6fda622fb9f8ea8a09df53dfec Author: David Riley <davidriley@chromium.org> Date: Thu May 25 12:55:29 2017 validation_pool: Pre-fetch patches when syncing slaves. BUG= chromium:725220 BUG= chromium:725233 TEST=validation_pool_unittest; run_tests Change-Id: I7be8749aaf018f69c75ad0bd6f9f58108dd585ed Reviewed-on: https://chromium-review.googlesource.com/513397 Tested-by: David Riley <davidriley@chromium.org> Trybot-Ready: David Riley <davidriley@chromium.org> Reviewed-by: David James <davidjames@chromium.org> [modify] https://crrev.com/78ea03cd8699bd6fda622fb9f8ea8a09df53dfec/cbuildbot/patch_series.py [modify] https://crrev.com/78ea03cd8699bd6fda622fb9f8ea8a09df53dfec/cbuildbot/validation_pool.py [modify] https://crrev.com/78ea03cd8699bd6fda622fb9f8ea8a09df53dfec/cbuildbot/validation_pool_unittest.py
,
May 25 2017
Change appears to be good and is doing multiple all the fetches for a repo at one time The builds I've looked at haven't been the best examples since they're submitting the same changes repeatedly due to guado_moblab-paladin failing and don't require a fetch in any case. Still waiting to land is the change to get rid of sleeps between cherry-picks https://chromium-review.googlesource.com/#/c/513421/ which will save 1s per patch. The following build ran into some repo sync errors: https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fbeaglebone-paladin%2F13461%2F%2B%2Frecipes%2Fsteps%2FCommitQueueSync%2F0%2Fstdout
,
May 25 2017
Issue 725220 has been merged into this issue.
,
May 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/5c15b33f5f051da0715f3d769692b28ad6bd26c9 commit 5c15b33f5f051da0715f3d769692b28ad6bd26c9 Author: David Riley <davidriley@chromium.org> Date: Fri May 26 00:16:02 2017 Revert "Put sleep between apply_change operations." This reverts commit 250bc41528aa5f90b8aadab998ad4818ddb97499. By pre-fetching the sleep hopefully isn't necessary. BUG= chromium:725233 BUG= chromium:725220 Change-Id: Ibb95ca96daa6fd66d53de7553e60f600b8f4700c Reviewed-on: https://chromium-review.googlesource.com/513421 Commit-Ready: David Riley <davidriley@chromium.org> Tested-by: David Riley <davidriley@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/5c15b33f5f051da0715f3d769692b28ad6bd26c9/cbuildbot/validation_pool.py
,
May 27 2017
The changes I landed might be causing some issues: https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fbeaglebone-paladin%2F13480%2F%2B%2Frecipes%2Fsteps%2FCommitQueueSync%2F0%2Fstdout [1;31m17:36:24: ERROR: <class 'chromite.lib.cros_build_lib.RunCommandError'>: return code: 128; command: git fetch -f https://chromium-review.googlesource.com/chromiumos/platform/crosvm refs/changes/20/509920/6 refs/changes/82/510782/5 fatal: unable to access 'https://chromium-review.googlesource.com/chromiumos/platform/crosvm/': Operation too slow. Less than 1000 bytes/sec transferred the last 300 seconds cmd=['git', 'fetch', '-f', u'https://chromium-review.googlesource.com/chromiumos/platform/crosvm', u'refs/changes/20/509920/6', u'refs/changes/82/510782/5'], cwd=/b/cbuild/src/platform/crosvm Traceback (most recent call last): File "/b/cbuild/chromite/lib/parallel.py", line 602, in TaskRunner task(*x, **task_kwargs) File "/b/cbuild/chromite/lib/parallel.py", line 800, in <lambda> fn = lambda idx, task_args: out_queue.put((idx, task(*task_args))) File "/b/cbuild/chromite/cbuildbot/patch_series.py", line 138, in _FetchChangesForRepo git.RunGit(repo, cmd, print_cmd=True) File "/b/cbuild/chromite/lib/git.py", line 817, in RunGit ['git'] + cmd, **kwargs) File "/b/cbuild/chromite/lib/retry_util.py", line 109, in GenericRetry ret = functor(*args, **kwargs) File "/b/cbuild/chromite/lib/cros_build_lib.py", line 645, in RunCommand raise RunCommandError(msg, cmd_result) RunCommandError: return code: 128; command: git fetch -f https://chromium-review.googlesource.com/chromiumos/platform/crosvm refs/changes/20/509920/6 refs/changes/82/510782/5 fatal: unable to access 'https://chromium-review.googlesource.com/chromiumos/platform/crosvm/': Operation too slow. Less than 1000 bytes/sec transferred the last 300 seconds cmd=['git', 'fetch', '-f', u'https://chromium-review.googlesource.com/chromiumos/platform/crosvm', u'refs/changes/20/509920/6', u'refs/changes/82/510782/5'], cwd=/b/cbuild/src/platform/crosvm [0m
,
May 27 2017
Suggest filing a bug against Gerrit on Borg folks for that -- that failure seems like a Gerrit on borg problem rather than a problem with your CL. On our end, suggest we add a retry to lib/git.py on the error message 'Operation too slow' ? (Currently we don't retry on that)
,
May 29 2017
+phobbs: re c#17: Can infra deputy (phobbs/dgarrett) take on dealing with GoB team?
,
Jun 1 2017
Pinging deputies.
,
Jun 2 2017
Re #19: Are you still seeing the slowness issues? - If we've only seen them on one day: - The primary deputy should have gotten involved on the same day, as this is an operational problem. - If we've seen them multiple times intermittently - Let's collect evidence that this is the case (ping this bug say three times this happens) - At that point, the correct response is to add retry / metrics / involve the GoB team in a non-critical issue. - For this case, someone from infra team is on the hook, but it's not the deputies' job. My reading of the issue in #16 so far is that we've only seen this one. I'd even argue that we shouldn't (yet) add the retry requested in #17 so that we don't start ignoring these failures too early (it'll still cost time).
,
Jun 2 2017
Right now, we have a double set of retries on git commands (long story). I wonder if we were going through all/most of them here, or if they were causing issues, and/or saving us from what would have been GoB based failures?
,
Jun 2 2017
FYI: If there are Git wrapper retries, some text indicating this would have been printed to STDERR.
,
Jun 2 2017
Don, I believe we're not doing retries here, because we don't whitelist "Operation too slow" as a known error condition. Seems worth whitelisting.
,
Jun 2 2017
I've no problem with that change, but in the cases above we are already succeeding. How will more retries make us faster?
,
Jun 2 2017
Produced a CL with that change, but I don't think it's the core problem.
,
Jun 2 2017
Re c#19: Here's some other examples of slow CommitQueueSync's in the sync stage. I have more examples, these were just the first few. Here's some other examples from the same day (May 26): https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fbutterfly-paladin%2F22464%2F%2B%2Frecipes%2Fsteps%2FCommitQueueSync%2F0%2Fstdout https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Ftricky-paladin%2F8372%2F%2B%2Frecipes%2Fsteps%2FCommitQueueSync%2F0%2Fstdout Here's another from today: https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Ftricky-paladin%2F8420%2F%2B%2Frecipes%2Fsteps%2FCommitQueueSync%2F0%2Fstdout
,
Jun 2 2017
This looks like a Gerrit bug -- please file against Gerrit. In those examples, cherry-pick is going very fast so let's file a bug in buganizer against Gerrit team to look at those issues (they look legitimate and the reason why Sync takes 5 minutes is because the timeout for 'too slow' syncs is 5 minutes)
,
Jun 2 2017
Re: c#24: Your fix addresses builds failures like: https://uberchromegw.corp.google.com/i/chromeos/builders/beaglebone-paladin/builds/13480 It doesn't address the slow sync's listed in c#26.
,
Jun 2 2017
Spun off b/62300477.
,
Jun 3 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/d2a44898b7ae8d8f755af8d0c3e0e0e73b2d9510 commit d2a44898b7ae8d8f755af8d0c3e0e0e73b2d9510 Author: Don Garrett <dgarrett@google.com> Date: Sat Jun 03 01:39:39 2017 git: Retry "Operation too slow". Adding another retry condition. BUG= chromium:725233 TEST=None Change-Id: I19c9f605ae6164705e9357579100ce68ab928b88 Reviewed-on: https://chromium-review.googlesource.com/522945 Commit-Ready: Don Garrett <dgarrett@chromium.org> Tested-by: Don Garrett <dgarrett@chromium.org> Reviewed-by: David Riley <davidriley@chromium.org> Reviewed-by: David James <davidjames@chromium.org> [modify] https://crrev.com/d2a44898b7ae8d8f755af8d0c3e0e0e73b2d9510/lib/git.py
,
Jun 3 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/c1ea15d0d447fcdda2c33a3496a18c33694169bd commit c1ea15d0d447fcdda2c33a3496a18c33694169bd Author: Dan Jacques <dnj@chromium.org> Date: Sat Jun 03 14:10:39 2017 [git wrapper] Add "Operation too slow" retry. BUG= chromium:725233 TEST=unit R=dgarrett@chromium.org, estaab@chromium.org, iannucci@chromium.org Change-Id: I4a00636ae95bccc144ecf66d1b38516ec77b7699 Reviewed-on: https://chromium-review.googlesource.com/523385 Reviewed-by: Robbie Iannucci <iannucci@chromium.org> Commit-Queue: Daniel Jacques <dnj@chromium.org> [modify] https://crrev.com/c1ea15d0d447fcdda2c33a3496a18c33694169bd/go/src/infra/tools/git/retry_regexp.go [modify] https://crrev.com/c1ea15d0d447fcdda2c33a3496a18c33694169bd/go/src/infra/tools/git/retry_regexp_test.go |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by diand...@chromium.org
, May 22 2017