New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 803496 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

puppet shadow_config change did not take effect until deleting and recreating shadow_config

Project Member Reported by akes...@chromium.org, Jan 18 2018

Issue description

Context:  Issue 803239 a major lab outage

Fix attempted: a revert of a puppet change to take a value (lucifer_level: "GATHERING") out of shadow_config: https://chrome-internal-review.googlesource.com/548278 


Problem:

The value did not disappear from shadow_config despite multiple puppet runs. It only disappeared after deleting shadow_config and running puppet.

Some kind of append-only behavior in puppet handling of shadow_config?
 
Project Member

Comment 1 by bugdroid1@chromium.org, Jan 18 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/d7bfbc148b7af705e33aecbe68063db1aa928492

commit d7bfbc148b7af705e33aecbe68063db1aa928492
Author: Aviv Keshet <akeshet@chromium.org>
Date: Thu Jan 18 16:57:56 2018

Severity note: this extended the outage by at least 16 hours.
After landing the fix in #1, I see the following logging in a puppet run. This run succeeded in modifying the lucifer_level value. This certainly adds credence to the theory that puppet is doing some kind of append-only updates to shadow_config.


Notice: /Stage[main]/Autotest::Shadow_config/Shadow_config[/usr/local/autotest/shadow_config.ini]/Shadow_config::Setting[/usr/local/autotest/shadow_config.ini LUCIFER lucifer_level]/Ini_setting[shadow_config /usr/local/autotest/shadow_config.ini LUCIFER lucifer_level]/value: value changed '[redacted sensitive information]' to '[redacted sensitive information]'
This is the Nth time this has happened.

Any ideas on how we can have a reasonable catch all state blocking mistaken git-reverts. Point to note is that a few times now people involved in the revert were well aware of the incorrectness of a revert, but we just forgot in the heat of the moment (you're dealing with outages after all).

We need some sort of unit-test / pre-submit hook that forces us to think twice before reverting puppet changes. (also, this unit-test / pre-submit should not be part of the CLs adding the puppet state in the first place, or else it'd also get reverted and not catch anything).

I didn't find any answers on the first 5 google search links for a general solution to this.
Re c#4: Can you expand?  Is the issue with explicitly doing a git revert, or with removing a config line and expecting it to be pushed out? 
Status: Assigned (was: Untriaged)
re #4, I think it should be doable with Ruby without undue effort, and I think we have enough evidence to justify that it would be valuable to do (specifically for shadow_config, not a general solution for all cases where a resource could have been removed without being set to absent first; however, shadow_config is the primary offender I think).

re #5  If you have a CL that adds "Ensure file exists" and then revert it, that is not telling Puppet to remove the file.  You would have to land a new CL that changes it to "Ensure file does not exist".  Thus, many Puppet changes cannot be reverted automatically.  (If Puppet deleted all files that it wasn't told explicitly about, you would have to enumerate every single file on the system.)
Re #6: Definitely in favor of landing something for shadow_config if that is easier, but I disagree that that is the primary offender. We've also removed services without actually disabling them, at least twice iirc.
Re c#6:

But in this case, the file still does exist, it's generated incorrectly.  Or by file, does that extend to the contents/configuration within a given a file?

I do not understand how we remove the a key that is not supposed to be in shadow_config.ini -- what is the correct procedure?

Can we add some sort of check which catches files being deleted or lines being removed if that is never going to work?
Doing it for services in Puppet is harder, because Puppet needs some way to identify the set of things for which it can disable; otherwise it will disable all system services that are not explicitly declared (e.g., logrotate or dhcp or some such).

A pre-upload check wouldn't work for Gerrit reverts.  Maybe something in the CQ. Only potential path found so far: ./run_puppet --noop --write-catalog-summary  It's not very good though.
This is turning into a lot of separate problems under discussion.

Let's scope this bug back specifically to shadow_config.

Is the current 'read and update' semantics we are using for shadow_config necessary, or can we make shadow_config fully re-dumped solely from puppet each time?
> can we make shadow_config fully re-dumped solely from puppet each time

We can, in the sense that Ruby is Turing complete.  No, in the sense that I was unwilling to write too much Ruby for custom Puppet modules.  I can't find the bug for it, but we have to be careful changing shadow_config in steps because there is a real race condition between autoserv and scheduler restarts (and the devserver list is generated dynamically in a separate step, so re-dumping the file will cause a race condition unless we move the devserver list generation entirely into Ruby, which see previous point).  And being able to manage shadow_config by setting vs by whole file yields a huge improvement in complexity across server roles.

I'm working on a change to purge unmarked shadow_config settings, which there is a mechanism for.
Labels: -Chase-Pending Chase
Status: Assigned (was: Started)
Project Member

Comment 14 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/0baea1d135c225d742c8336aad34d708bbca80aa

commit 0baea1d135c225d742c8336aad34d708bbca80aa
Author: Allen Li <ayatane@chromium.org>
Date: Thu Jan 25 19:12:02 2018

Status: Fixed (was: Assigned)

Sign in to add a comment