"bot_update" is flaky |
|||||||||||||||||
Issue description"bot_update" is flaky. This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is not infrastructure-related (e.g. flaky test), please add Sheriff-Chromium label and change issue status to Untriaged. When done, please remove the issue from Trooper Bug Queue by removing the Infra-Troopers label. We have detected 3 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. Flaky tests should be disabled within 30 minutes unless culprit CL is found and reverted. Please see more details here: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queues#triaging-auto-filed-flakiness-bugs This flaky test/step was previously tracked in issue 587311 .
,
Mar 17 2016
Both cases are an instance of the bot_update hanging because of a sub-command hanging - the process trees are printed. forwarding for hinoka@ for further triage.
,
Mar 17 2016
This looks like Git timeouts. These happen sometimes, can be ignored as long as it's not systemic and sustained.
,
Mar 17 2016
Out of the 10000's of jobs we run, I'm going to assert that ~10 a day failing on git fetches is insignificant. But just for discussion, what do you guys think about wrapping our git calls (particularly git-remote-https) so that they kill themselves early if it detects GoB is keeping a connection open but not really sending a whole lot of data (or sending data too slowly)
,
Mar 17 2016
Even if it happens 10 times per day, if it causes a user visible problem we must file a workaround like timeout (as suggested by tandrii@) + retry.
,
Mar 17 2016
So what do you think of my idea?
,
Mar 17 2016
I agree on principle, though I'm not sure adding yet another wrapper is the best implementation.
,
Mar 17 2016
Well if you have a better idea that doesn't involve killing git processes that could be doing real work, I'd love to know. I'm a bit stumped since I'm convinced you really need to be able to introspect the git-remote-https process to make this determination.
,
Mar 17 2016
I mean, in theory bot_update could run a process list and filter children whose name match "git-remote-https", and then introspect that way, but imo that's even more hacky than yet another wrapper.
,
Mar 17 2016
You have a point here, though your point seems to hold with a wrapper: how do you monitoring network communication of an external process?
,
Mar 17 2016
On the other hand something we should do that does not require a wrapper is to gather data on git calls from bot_update (timing, timeout, number of times each command has been called). We'd be able to tell if hinoka's hunch (#4) is correct or not.
,
Mar 17 2016
I have no clue :). We might be able to hijack and reimplement git-remote-https by injecting a fake one into the $PATH
,
Mar 17 2016
So just to be pendatic, this is a gclient.py issue.
,
Mar 17 2016
Thanks, Philippe for assertiveness :) ~10 failures per day is indeed not significant vs 1000s that end up OK. but the visibility of this shouldn't be the reason for fixing. The real reason should be that this seemingly small probability of 1 failure propagates to higher abstraction levels with 50-100x multiple, ultimately resulting in 10 CLs being delayed by 1 hour because just 1 out of 50 bot_update jobs failed in them.
,
Mar 17 2016
but! bot_update can modify path to replace the git implementation with a fake wrapper to collect telemetry data (to ts_mon?).
,
Mar 17 2016
my point is, if this was an easy problem to solve I would've solved it already. This is why I'm proposing crazier ideas because I firmly believe this requires some crazy and dirty solutions.
,
Mar 17 2016
At the very least we can start by collecting stats, in a similar way Goma does. The system is in place, it's "just" a matter of making bot_update compute stats and write a serialized proto in a temporary directory. hinoka: happy to help you with that.
,
Mar 17 2016
and I'm OK with yet another wrapper or replacement: However, let's be conservative with this and avoid wide outages. That is, I'm asking that deployment is first contained to say .fyi && infra waterfalls for 2 weeks. And very ideally, can we add monitoring to somehow report when such timeouts are detected?
,
Mar 17 2016
The main difference between us and Goma is that we don't control the entire stack. In particular, the part that fail the most (the git binary) is opaque to us. Without some clever hacks we can only treat it like a black box.
,
Mar 17 2016
How is this user-visible? Tryserver transparently retries this. A maybe less crazy idea: since Git operations are incremental, we could just put an absolute timeout / clear lock / retry on the Git process itself and hammer it a few times. Getting into monitoring the network operations of a subprocess is ... well, platform-specific and ugly.
,
Mar 17 2016
I'm only talking about things we can collect outside the binaries: which commands we run, what was the min/max/average running time, number of timeouts, etc. Even that data would be very useful to identify sources of failures and flakiness. bot_update can do that even without a wrapper. Example: if we see that git-remote-https running time is under 5s (making up numbers) for 99.9% of our calls and jumps to 1min 0.1% of the time, then we know we can add a timeout in bot_update around 15s. What I'm saying may not happen but there is no way to know until we measure things.
,
Mar 17 2016
git-remote-https takes a different amount of time depending on what it's checking out. Eg if it's doing a large diff between two versions or a new clone or a shallow clone, it could legitimately take 20+ minutes. bot_update actually can't introspect individual git processes because it shells out to gclient to do (most of) that.
,
Mar 17 2016
,
Mar 17 2016
I agree that we shouldn't drop this. But I want to get a better sense for the size of the problem. I see 10/10000, 10/1000, and 1/50 all mentioned as failure rates so far in this discussion. What's the real rate? Philippe, since you're familiar with existing stat collection infrastructure and you don't see it as costly to set up that sounds like the right place to start. What specific work will need to be done to integrate with it?
,
Mar 18 2016
Detected 3 new flakes for test/step "bot_update". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Trooper Bug Queue (unless already there).
,
Mar 18 2016
estaab: what I said in #17. There is unfortunately no documentation yet, in broad strokes: - instrument bot_update/gclient so they measure the time taken by every git invocation plus other relevant info. Write that data into a serialized proto on disk. - at the end of the bot_update step call send_monitoring_event, passing the name of the file which has previously been written. Example: http://crrev.com/1477283002, in goma_util.py This relies on telling the event-mon server about the proto, which I can tell you about elsewhere.
,
Mar 18 2016
I havn't found a good way to replace git-remote-https (replacing the $PATH) doesn't seem to work since it tries to find the one in it's installation first. What if bot_update launches a transparent proxy (injected via the HTTP_PROXY envvar) that monitors all the git connections. It can kill bot_update early if it detects any of the connections are slower than some low threshold (less than 50kbps for more than 5 minute or something).
,
Mar 19 2016
Detected 3 new flakes for test/step "bot_update". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Trooper Bug Queue (unless already there).
,
Mar 21 2016
Detected 3 new flakes for test/step "bot_update". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Trooper Bug Queue (unless already there).
,
Mar 22 2016
Detected 5 new flakes for test/step "bot_update". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Trooper Bug Queue (unless already there).
,
Mar 28 2016
These are more instances of the same issue: forever-lingering bot_update process. Some thoughts: 1) Is Git fetch incremental? As in, if I run Git fetch for 20 minutes and it fetches some of the data, then I interrupt it and run it again, will it repeat the work that it did in the 20 minutes, or will it build on that work? I think the latter. 2) Why is it randomly taking this long? Network flake? Propose: run gclient in a monitoring process, interrupt it every 20 minutes, clean locks, and repeat. This should be fine b/c of (1).
,
Mar 29 2016
Detected 3 new flakes for test/step "bot_update". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Trooper Bug Queue (unless already there).
,
Mar 30 2016
Detected 4 new flakes for test/step "bot_update". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Trooper Bug Queue (unless already there).
,
Mar 31 2016
Detected 5 new flakes for test/step "bot_update". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Trooper Bug Queue (unless already there).
,
Apr 1 2016
Detected 3 new flakes for test/step "bot_update". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Trooper Bug Queue (unless already there).
,
Apr 11 2016
Detected 3 new flakes for test/step "bot_update". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Trooper Bug Queue (unless already there).
,
Apr 13 2016
Detected 5 new flakes for test/step "bot_update". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Trooper Bug Queue (unless already there).
,
Apr 13 2016
Removing this ticket from the trooper queue.
,
Apr 22 2016
Detected 3 new flakes for test/step "bot_update". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Trooper Bug Queue (unless already there).
,
Apr 23 2016
Detected 7 new flakes for test/step "bot_update". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Trooper Bug Queue (unless already there).
,
Apr 27 2016
,
Apr 27 2016
Detected 3 new flakes for test/step "bot_update". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Trooper Bug Queue (unless already there).
,
May 3 2016
Detected 3 new flakes for test/step "bot_update". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyFQsSBUZsYWtlIgpib3RfdXBkYXRlDA. This message was posted automatically by the chromium-try-flakes app. Since flakiness is ongoing, the issue was moved back into Trooper Bug Queue (unless already there).
,
May 13 2016
Issue 423342 has been merged into this issue.
,
Jul 28 2016
Any progress on this? Do we have plans to make progress on this? There seems to be no more incidents since May 3, so was it actually fixed? I'm removing it from the trooper queue, if it happens again it'll be re-added automatically.
,
Nov 10 2016
,
Nov 16 2016
Ryan, you are set as owner on this, do you have any updates?
,
Nov 16 2016
I do not
,
Nov 16 2016
and it fired early i do have a related update. I believe this issue is similar to 635641, which tandrii tackled for a while and determined that the effort was not worth the impact.
,
Nov 16 2016
tandrii@ - can you comment on hinoka's comment #49? This is listed as a P1, which indicates impact, contradicting that statement.
,
Nov 16 2016
+jparent@ Now, this is a dup of what I tackled, and should be merged once we finish this discussion. This bug's priority may indicate impact, but there is nothing in this bug indicating its priority - it seems to have been set at Pri1 at the time of automatic filing. Now see my comment https://bugs.chromium.org/p/chromium/issues/detail?id=635641#c23 on my analysis of potential impact as of mid August: 10 failures per day over master.tryserver.chromium.*. Now, sadly the URL in the comment points to script which I since changed, so I'll try to redo the analysis.
,
Nov 16 2016
Things got much worse since then :( There are now ~40 failures per work day (assumign nothing happens 2 out of every 7 days). This very roughly means that 4% of CQ attempts (dry or full runs) will have one builder timeout in bot_update and be retried. So, now it seems more impactful than before, though no less difficult. See data + analysis queries here: https://docs.google.com/a/google.com/spreadsheets/d/1kQz0xLwd3_A5OWPC7xL-EmCfvb2NGl27XGR-irEKmFY/edit?usp=sharing
,
Nov 16 2016
I've been throwing around the idea of instrumenting git with a wrapper so that we call the wrapper and it runs git with all the GIT_CURL_TRACE=1 env vars set, so that it's able to detect if something is stuck and kill it. As a bonus, we can get ts_mon metrics of individual git calls.
,
Nov 16 2016
Ok, it sounds like fixing this will have a big effect on our infra failure rate and we should treat it as high priority. Ryan, can you own this? You have the most context with bot_update and it sounds like you know some next steps to try. If not then assign back to me and I'll find someone else.
,
Nov 17 2016
hinoka@ I like the wrapper idea, so long as it runs on Windows :)
,
Nov 21 2016
,
Nov 23 2016
@hinoka Are you owning this?
,
Nov 23 2016
Yes i can own this for now (that is, try the wrapper approach)
,
Dec 3 2016
Mr Hinoka, since you agreed to own ... any updates? Owned P1s with no updates in several weeks make me sad ...
,
Dec 5 2016
No progress.. but i'll take a stab today.
,
Dec 6 2016
,
Dec 8 2016
,
Jan 16 2017
@hinoka - any update on this? Is this something you can maybe spend some time on this week?
,
Jan 17 2017
Just a CL: https://chromium-review.googlesource.com/c/417247/ I can pick it back up this week.
,
Jan 18 2017
,
Jan 25 2017
,
Mar 13 2017
|
|||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||
Comment 1 by chromium...@appspot.gserviceaccount.com
, Mar 16 2016