puppet shadow_config change did not take effect until deleting and recreating shadow_config |
|||||
Issue descriptionContext: Issue 803239 a major lab outage Fix attempted: a revert of a puppet change to take a value (lucifer_level: "GATHERING") out of shadow_config: https://chrome-internal-review.googlesource.com/548278 Problem: The value did not disappear from shadow_config despite multiple puppet runs. It only disappeared after deleting shadow_config and running puppet. Some kind of append-only behavior in puppet handling of shadow_config?
,
Jan 18 2018
Severity note: this extended the outage by at least 16 hours.
,
Jan 18 2018
After landing the fix in #1, I see the following logging in a puppet run. This run succeeded in modifying the lucifer_level value. This certainly adds credence to the theory that puppet is doing some kind of append-only updates to shadow_config. Notice: /Stage[main]/Autotest::Shadow_config/Shadow_config[/usr/local/autotest/shadow_config.ini]/Shadow_config::Setting[/usr/local/autotest/shadow_config.ini LUCIFER lucifer_level]/Ini_setting[shadow_config /usr/local/autotest/shadow_config.ini LUCIFER lucifer_level]/value: value changed '[redacted sensitive information]' to '[redacted sensitive information]'
,
Jan 18 2018
This is the Nth time this has happened. Any ideas on how we can have a reasonable catch all state blocking mistaken git-reverts. Point to note is that a few times now people involved in the revert were well aware of the incorrectness of a revert, but we just forgot in the heat of the moment (you're dealing with outages after all). We need some sort of unit-test / pre-submit hook that forces us to think twice before reverting puppet changes. (also, this unit-test / pre-submit should not be part of the CLs adding the puppet state in the first place, or else it'd also get reverted and not catch anything). I didn't find any answers on the first 5 google search links for a general solution to this.
,
Jan 18 2018
Re c#4: Can you expand? Is the issue with explicitly doing a git revert, or with removing a config line and expecting it to be pushed out?
,
Jan 18 2018
re #4, I think it should be doable with Ruby without undue effort, and I think we have enough evidence to justify that it would be valuable to do (specifically for shadow_config, not a general solution for all cases where a resource could have been removed without being set to absent first; however, shadow_config is the primary offender I think). re #5 If you have a CL that adds "Ensure file exists" and then revert it, that is not telling Puppet to remove the file. You would have to land a new CL that changes it to "Ensure file does not exist". Thus, many Puppet changes cannot be reverted automatically. (If Puppet deleted all files that it wasn't told explicitly about, you would have to enumerate every single file on the system.)
,
Jan 18 2018
Re #6: Definitely in favor of landing something for shadow_config if that is easier, but I disagree that that is the primary offender. We've also removed services without actually disabling them, at least twice iirc.
,
Jan 18 2018
Re c#6: But in this case, the file still does exist, it's generated incorrectly. Or by file, does that extend to the contents/configuration within a given a file? I do not understand how we remove the a key that is not supposed to be in shadow_config.ini -- what is the correct procedure? Can we add some sort of check which catches files being deleted or lines being removed if that is never going to work?
,
Jan 18 2018
Doing it for services in Puppet is harder, because Puppet needs some way to identify the set of things for which it can disable; otherwise it will disable all system services that are not explicitly declared (e.g., logrotate or dhcp or some such). A pre-upload check wouldn't work for Gerrit reverts. Maybe something in the CQ. Only potential path found so far: ./run_puppet --noop --write-catalog-summary It's not very good though.
,
Jan 18 2018
This is turning into a lot of separate problems under discussion. Let's scope this bug back specifically to shadow_config. Is the current 'read and update' semantics we are using for shadow_config necessary, or can we make shadow_config fully re-dumped solely from puppet each time?
,
Jan 19 2018
> can we make shadow_config fully re-dumped solely from puppet each time We can, in the sense that Ruby is Turing complete. No, in the sense that I was unwilling to write too much Ruby for custom Puppet modules. I can't find the bug for it, but we have to be careful changing shadow_config in steps because there is a real race condition between autoserv and scheduler restarts (and the devserver list is generated dynamically in a separate step, so re-dumping the file will cause a race condition unless we move the devserver list generation entirely into Ruby, which see previous point). And being able to manage shadow_config by setting vs by whole file yields a huge improvement in complexity across server roles. I'm working on a change to purge unmarked shadow_config settings, which there is a mechanism for.
,
Jan 22 2018
https://chrome-internal-review.googlesource.com/c/chromeos/chromeos-admin/+/549423
,
Jan 22 2018
,
Jan 25 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/0baea1d135c225d742c8336aad34d708bbca80aa commit 0baea1d135c225d742c8336aad34d708bbca80aa Author: Allen Li <ayatane@chromium.org> Date: Thu Jan 25 19:12:02 2018
,
Jan 25 2018
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by bugdroid1@chromium.org
, Jan 18 2018