New issue
Advanced search Search tips

Issue 601840 link

Starred by 2 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Feature

Blocked on:
issue 314403

Blocking:
issue 545408



Sign in to add a comment

Build a tool that allows reproducing and debugging flakes on a set of remote machines

Project Member Reported by serg...@chromium.org, Apr 8 2016

Issue description

Problem: it is hard to reproduce a flake because it doesn't happen all the time and also because it may only happen on machine with some specific configuration.

Idea: On demand from a developer start a build on multiple remote machines running under gdb (with some breakpoints preset by developer) and once one of the breakpoints is hit, freeze the execution, notify the developer and provide a way to connect to a remote gdb session. Builds on other machines can be interrupted. With swarming and dm, we can even take machines with rare configuration (e.g. bare-metal machines with specific GPUs) out of tryserver or waterfall pools temporarily.

One concern is that since developers may be messing up the machines, we will probably want to reimage them after debug session. It is also important to detect when developers forget about frozen slaves (e.g. no activity for X hours) and automatically terminate the session to avoid wasting capacity.
 
Owner: ----
Status: Available (was: Assigned)
I'm not actually planning to work on this atm.
Interesting article about rr debugger, which allows to replay a program execution deterministically by recording all interactions with the outer world (system calls, file system accesses etc.): http://fitzgeraldnick.com/weblog/64/. Instead of blocking the machine for the developer to debug things remotely, we could just deliber the recorded dump to the developer machine and let them debug program locally. Unfortunately, rr only works on Linux for now, so we would need to add support for Windows, Mac, Android and iOS, which could be rather non-trivial.
Blockedon: 314403
M-A has a bug about deterministic builds across all of our infrastructure. I think his effort will help you with this. 
Components: -Infra Infra>Platform
Owner: serg...@chromium.org
Status: Assigned (was: Available)
I'm assigning all bugs for the flakiness pipeline to myself, since I am the only person working on this project. That does not mean that I intend to work on this any soon. I usually mark bugs that I'm actively working on as "Started".
Components: Infra>Flakiness>Pipeline
Owner: ----
Status: Available (was: Assigned)
Not planning to work on this soon. Now that it has Infra>Flakiness>Pipeline, no need for it to be assigned to me.
Labels: Type-Bug
Labels: -Restrict-View-Google
Nothing internal here.
Project Member

Comment 10 by sheriffbot@chromium.org, Apr 13 2018

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
This was more or less implemented for internal users with the Debug button on Swarming UI [1]. However, another issue that we've discovered is that leased machines are lacking development tools, hence, I've proposed [2] to set up a set up a dedicated pool of machines with an image that contains developer tools and that will be re-imaged after each use.

There is nothing that is available for external users yet, so keeping this open.

[1]: http://shortn/_LrW5q6x2ru
[2]: hhttp://shortn/_PBSFl47qh4

Comment 12 by no...@chromium.org, May 26 2018

Cc: st...@chromium.org
Components: -Infra>Platform
Labels: -Type-Bug Type-Feature
This bug is more CAT than Foundation. This might needs some features from Swarming, but it needs its own design first. 

Sign in to add a comment