New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 846872 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jul 9
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Task
Design-Doc: https://docs.google.com/document/d/1_JdDR-EISy4e4VFMIe-nepOyXhz93232lOMOobf1pVg/edit#heading=h.12depar5iy3d



Sign in to add a comment

Investigate serial access to ganeti lab machines

Project Member Reported by cra...@chromium.org, May 25 2018

Issue description

Summary of action item:

During a recent lab-wide outage, certain critical machines (AFE database hosts) went unreachable and debugging them was quite difficult.  During the course of this time, much effort was spent trying to get serial access to the machines.  Your mission if you choose to accept it is:
  * Investigate if it's possible to get serial access to ganeti machines. 
  * If so, circulate some documentation for how to do this.
  * If not, circulate your findings.

folks that are interested are specifically the chromeos test team, and the chromeos ci team.  More specifically, craigb, mikenichols, jclinton could probably be addressed directly in any discussion.
 
Labels: -Pri-2 -Chase-Pending Chase Pri-1
Owner: akes...@chromium.org

Comment 2 by cra...@chromium.org, May 30 2018

Labels: cros-infra-pm-2018-05-21

Comment 3 by cra...@chromium.org, May 30 2018

I happened upon the "Inability to SSH into your instance" section in this document:
https://g3doc.corp.google.com/company/teams/ganeti-sre/users/user-documentation.md?cl=head#ganeti-sla

This may or may not relevant. :)

That does seem relevant. Experimenting with it:

$ cham cros-full-0004.mtv.corp.google.com
...

yields a login screen:

cros-full-0003 login:


Which allows me to log in via my corp username and corp password.

I will update a document somewhere accordingly.
Wait, 0004 gets a 0003 login?

Also, multiple people tried cham during the outage and it didn't work. Maybe it doesn't work when the Ganeti host has this NFS kernel lock problem?

hmm I think I mistyped when writing in the bug.
I think I logged out uncleanly from my cham session, and now new ones to that host do not open. Oops. Also IIRC only one cham session for a given host can be open at a time.

jclinton@ are you able to access cros-full-0003 now via cham?

$ cham cros-full-0004.mtv

If so, I think we can conclude that it is not an ACL problem, and is instead just that the instance was unavailable do serial access as well.
The login prompt says "Remember to log out when done. Use <newline>~. to quit." but none of my attempts to incant that are working. Any clues?
For the record, because I was ssh'd into my desktop machine and running through this ssh terminal, I typed the following to close the cham session. The enter key. The "~" key. The "~" key again. The "." key.

That worked.

Using only 1 tilde killed my ssh session to desktop rather than the cham session.
Yep, I'm able to get in now. Assuming that the Ganeti host being affected also affects cham.

Ok. Investigation shows that serial access is available if the host is sufficiently healthy, though that it would not have helped during the instigating outage.

Holding open for weekly meeting to communicate this fact, and discuss whether we need a document for it and where.
Components: Infra>Client>ChromeOS>Test
I don't know how we missed this in today's chase discussion.
Discussed this week, but there is still a little confusion around this.

- Did we ping the Ganeti team about this, asking them what the "supported" way to get serial access is?
  - If we did this, please ping either way here. Close bug once confirmed how to get serial access / or even if we confirm there is none.

- Need a link to trusted docs for this from https://sites.google.com/a/google.com/chromeos/for-team-members/infrastructure/chromeos-admin

At the time, it was cham. https://g3doc.corp.google.com/company/teams/cham/users/cham-userguide.md?cl=head . That's what I used and it just hung trying to connect so it was likely that the Ganeti team doesn't handle the case where the host of the VM locks up.
Will put on sites page.

Sign in to add a comment