Issue metadata
Sign in to add a comment
|
Investigate serial access to ganeti lab machines |
||||||||||||||||||||||||
Issue descriptionSummary of action item: During a recent lab-wide outage, certain critical machines (AFE database hosts) went unreachable and debugging them was quite difficult. During the course of this time, much effort was spent trying to get serial access to the machines. Your mission if you choose to accept it is: * Investigate if it's possible to get serial access to ganeti machines. * If so, circulate some documentation for how to do this. * If not, circulate your findings. folks that are interested are specifically the chromeos test team, and the chromeos ci team. More specifically, craigb, mikenichols, jclinton could probably be addressed directly in any discussion.
,
May 30 2018
,
May 30 2018
I happened upon the "Inability to SSH into your instance" section in this document: https://g3doc.corp.google.com/company/teams/ganeti-sre/users/user-documentation.md?cl=head#ganeti-sla This may or may not relevant. :)
,
Jun 1 2018
That does seem relevant. Experimenting with it: $ cham cros-full-0004.mtv.corp.google.com ... yields a login screen: cros-full-0003 login: Which allows me to log in via my corp username and corp password. I will update a document somewhere accordingly.
,
Jun 1 2018
Wait, 0004 gets a 0003 login? Also, multiple people tried cham during the outage and it didn't work. Maybe it doesn't work when the Ganeti host has this NFS kernel lock problem?
,
Jun 1 2018
hmm I think I mistyped when writing in the bug.
,
Jun 1 2018
I think I logged out uncleanly from my cham session, and now new ones to that host do not open. Oops. Also IIRC only one cham session for a given host can be open at a time. jclinton@ are you able to access cros-full-0003 now via cham? $ cham cros-full-0004.mtv If so, I think we can conclude that it is not an ACL problem, and is instead just that the instance was unavailable do serial access as well.
,
Jun 1 2018
The login prompt says "Remember to log out when done. Use <newline>~. to quit." but none of my attempts to incant that are working. Any clues?
,
Jun 1 2018
,
Jun 1 2018
For the record, because I was ssh'd into my desktop machine and running through this ssh terminal, I typed the following to close the cham session. The enter key. The "~" key. The "~" key again. The "." key. That worked. Using only 1 tilde killed my ssh session to desktop rather than the cham session.
,
Jun 1 2018
Yep, I'm able to get in now. Assuming that the Ganeti host being affected also affects cham.
,
Jun 1 2018
Ok. Investigation shows that serial access is available if the host is sufficiently healthy, though that it would not have helped during the instigating outage. Holding open for weekly meeting to communicate this fact, and discuss whether we need a document for it and where.
,
Jun 4 2018
I don't know how we missed this in today's chase discussion.
,
Jun 18 2018
Discussed this week, but there is still a little confusion around this. - Did we ping the Ganeti team about this, asking them what the "supported" way to get serial access is? - If we did this, please ping either way here. Close bug once confirmed how to get serial access / or even if we confirm there is none. - Need a link to trusted docs for this from https://sites.google.com/a/google.com/chromeos/for-team-members/infrastructure/chromeos-admin
,
Jun 18 2018
At the time, it was cham. https://g3doc.corp.google.com/company/teams/cham/users/cham-userguide.md?cl=head . That's what I used and it just hung trying to connect so it was likely that the Ganeti team doesn't handle the case where the host of the VM locks up.
,
Jun 25 2018
Will put on sites page.
,
Jul 9
cham documentation added at https://sites.google.com/a/google.com/chromeos/for-team-members/infrastructure/chromeos-admin/lab-server-access |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by akes...@chromium.org
, May 29 2018Owner: akes...@chromium.org