Admin Docs - Need devserver drain instructions. |
|||
Issue descriptionThese docs: https://sites.google.com/a/google.com/chromeos/for-team-members/infrastructure/chromeos-admin/devserver-management Give instructions on how to remove a devserver, but not how to shut it down without breaking tests. Can we update them with instructions on how to do that? Remove Devservers If a devserver is failing or flaky and causes test failures, it needs to be removed from production environment. Before updating the shadow configs, the first thing to stop the devserver being used in test is to shutdown the devserver, you can either disconnect the network, power down the machine or stop the devserver service running in the machine. Before the shadow config change is merged and deployed to lab, the devserver should not be put back online, or test will start using it again. Here is a sample CL for the changes needed to remove a devserver from production: https://chrome-internal-review.googlesource.com/#/c/374430/ If you plan to add the devserver back to production and its IP address won't be changed, you don't need to make the change in dhcpd.conf and nagios's hosts.cfg. Add infra deputy to the code review for approval. After your change is merged, it might take up to 4 hours for the change to be deployed to lab by puppet. For faster deployment, you can contact infra deputy to force a puppet update in the lab.
,
Jun 9 2017
I don't know exactly what 'shut it down without breaking tests' intend to do for devserver. If a devserver stops working during a test, then the test will be broken. That's bad luck. If a devserver stops working after a test, I think the following test won't be broken since this devserver cannot pass healthy check and won't be chosen for any tests, so no effect. If a devserver is flaky, that will be a problem. We can just remove it from puppet & force deployment of puppet (don't wait for 4 hours). But usually our first attempt is to fix the devserver in this 4 hour window, since losing one devservers may cause some overload issues.
,
Jun 9 2017
No... let's assume a devserver is working well, but we want to shut it down to replace it with a faster machine. How do we do that without breaking any tests that are currently running?
,
Jun 9 2017
PS: I filed this because I have a request for help doing this exact thing. ;>
,
Jun 9 2017
Don's suggest sounds reasonable. I think what Don means is to shutdown the devserver before it is removed from prod, so that it will not used by any test during the removal process and thus not break any test. I will update the instruction about this.
,
Jun 9 2017
Update the instruction to include steps about how to remove a devserver
,
Jun 9 2017
Re #3, oh, we want to replace a healthy devserver. I misunderstood that we want to replace a bad devserver :( We don't have such techniques to "immediately" shut down a devserver without breaking tests running with it. The only way to do it now is to 1. DON'T stop any service on this working devserver. 2. Add the new devserver, remove the old one. 3. Force puppet to remove this devserver from prod immediately, to reduce the possibilities that the old devserver being selected for a test. 4. wait & verify that no one use it for test after a while. This should be our devserver draining solution, actually it's almost similar to devserver removal solution.
,
Jun 9 2017
xixuan: Can we update the admin docs with that? dschimmels: Is that enough to resolve your issues?
,
Jun 9 2017
I think comment 7 is not that correct. Adding new devserver won't help to avoid the old devserver being selected. And from my understanding, devserver is not for running tests on it, it is used for staging images for the test to download. So, if it is a bad devserver, keeping the service running will 1. fail the current test that requesting image 2. open the chance to be selected for new test. So I think the original removal process is correct. There is no draining concept, since there is no running test on devserver.
,
Jun 9 2017
I was under the understand that we do have a draining process, but it's not very good. Something like: Remove the devserver from shadow config, wait 12 hours, consider it drained and shut it down.
,
Jun 9 2017
Re #8: The document is updated. Re #9, 1. I never say that adding new devserver will help avoid old devserver being selected. Removing the bad/to-be-replaced devservers from puppet is a kind of draining actually, agree with #10 that it's not that perfect, we need to wait for some time. 2. Stop the devserver service WILL cause failed tests when we want to drain a 'GOOD' devserver. 3. Please note that if it's already a bad devserver, it won't be selected for staging image/provision even if it's in prod. But to be safe enough, we can stop the devserver service for draining 'BAD/FLAKY' devservers.
,
Jun 9 2017
Hi Everyone, Thanks for the help. We are a little different from the standard Chrome setup. We have a VMware server that host 2 to 3 dev servers. One of the dev servers has crashed. We will need to bring down the VMware server to replace the hard drive for the crashed Dev server. However this will impact the working dev server. This will create the need to drain a 'GOOD' server.
,
Jun 9 2017
Hi, Also is there a way to verify after draining the system to confirm test are no longer running on this system?
,
Jun 9 2017
Re #13, You can check whether there's network traffic for specified devserver: https://viceroy.corp.google.com/chromeos/machines?hostname=android1758-infra-devserver5&duration=1d&refresh=-1 We also have an apache_log_metrics, but it's recently disabled temporarily.
,
Jun 12 2017
,
Jun 19 2017
|
|||
►
Sign in to add a comment |
|||
Comment 1 by dgarr...@chromium.org
, Jun 9 2017