Build a tool that allows reproducing and debugging flakes on a set of remote machines |
||||||||||
Issue descriptionProblem: it is hard to reproduce a flake because it doesn't happen all the time and also because it may only happen on machine with some specific configuration. Idea: On demand from a developer start a build on multiple remote machines running under gdb (with some breakpoints preset by developer) and once one of the breakpoints is hit, freeze the execution, notify the developer and provide a way to connect to a remote gdb session. Builds on other machines can be interrupted. With swarming and dm, we can even take machines with rare configuration (e.g. bare-metal machines with specific GPUs) out of tryserver or waterfall pools temporarily. One concern is that since developers may be messing up the machines, we will probably want to reimage them after debug session. It is also important to detect when developers forget about frozen slaves (e.g. no activity for X hours) and automatically terminate the session to avoid wasting capacity.
,
Apr 8 2016
Interesting article about rr debugger, which allows to replay a program execution deterministically by recording all interactions with the outer world (system calls, file system accesses etc.): http://fitzgeraldnick.com/weblog/64/. Instead of blocking the machine for the developer to debug things remotely, we could just deliber the recorded dump to the developer machine and let them debug program locally. Unfortunately, rr only works on Linux for now, so we would need to add support for Windows, Mac, Android and iOS, which could be rather non-trivial.
,
Jun 25 2016
M-A has a bug about deterministic builds across all of our infrastructure. I think his effort will help you with this.
,
Jun 25 2016
,
Jun 27 2016
I'm assigning all bugs for the flakiness pipeline to myself, since I am the only person working on this project. That does not mean that I intend to work on this any soon. I usually mark bugs that I'm actively working on as "Started".
,
Jun 29 2016
,
Jul 8 2016
Not planning to work on this soon. Now that it has Infra>Flakiness>Pipeline, no need for it to be assigned to me.
,
Aug 3 2016
,
Jan 24 2017
Nothing internal here.
,
Apr 13 2018
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 13 2018
This was more or less implemented for internal users with the Debug button on Swarming UI [1]. However, another issue that we've discovered is that leased machines are lacking development tools, hence, I've proposed [2] to set up a set up a dedicated pool of machines with an image that contains developer tools and that will be re-imaged after each use. There is nothing that is available for external users yet, so keeping this open. [1]: http://shortn/_LrW5q6x2ru [2]: hhttp://shortn/_PBSFl47qh4
,
May 26 2018
This bug is more CAT than Foundation. This might needs some features from Swarming, but it needs its own design first. |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by serg...@chromium.org
, Apr 8 2016Status: Available (was: Assigned)