Configuration Management #
Configuration management keeps systems consistent, secure, and auditable as environments scale.
Overview #
A strong configuration-management practice provides:
- Repeatable system state across environments.
- Fast, low-risk changes using version-controlled automation.
- Drift detection and remediation workflows.
- Evidence for security and compliance controls.
It also reduces operational dependency on individual operators by codifying system behavior into reusable, reviewed modules.
Core principles #
- Desired state as code: define the intended end state in version control.
- Idempotency: repeated runs produce the same result without side effects.
- Small safe changes: prefer frequent, scoped updates over large batch changes.
- Observability-first: capture execution logs and post-change health signals.
- Rollback readiness: every rollout path should have a tested rollback or replacement strategy.
When to use configuration management / decision criteria #
Use configuration management when you need:
- Consistent host or cluster configuration across many nodes.
- Standardized hardening baselines and patching patterns.
- Repeatable bootstrapping for ephemeral or replacement infrastructure.
Choose strategy by operating model:
- Declarative convergence for policy-driven steady-state control.
- Imperative orchestration for sequenced changes and workflows.
- Immutable pattern where services are rebuilt/replaced instead of patched in place.
As a rule of thumb, converge mutable shared services continuously, and use immutable patterns for stateless or horizontally scaled workloads.
Architecture patterns #
1) Agentless push model #
- Typical with Ansible.
- Good for smaller fleets and controlled change windows.
- Requires secure orchestrator access to managed nodes.
- Works well when teams want explicit deployment control.
2) Agent-based pull model #
- Typical with Puppet/Chef/Salt in enterprise fleets.
- Good for continuous convergence and large scale.
- Requires robust PKI, node registration, and agent lifecycle management.
- Best when nodes must self-correct drift on a recurring cadence.
3) Immutable image pattern #
- Build hardened images in CI.
- Replace nodes via rolling updates.
- Pair with startup bootstrap for dynamic runtime values.
- Ideal for reducing configuration drift in elastic environments.
Repository and module design #
A maintainable structure usually includes:
modules/orroles/for reusable components.environments/for stage-specific variables and inventory.policies/for security and compliance checks.pipelines/for lint, test, plan, and apply workflows.
Design guidance:
- Keep modules small and single-purpose.
- Expose clear input variables with defaults and validation.
- Avoid embedding secrets or environment constants in shared modules.
- Version modules and maintain a changelog for breaking changes.
Security and cost guardrails #
Security baseline #
- Keep playbooks/manifests in version control with code review.
- Enforce signed commits/artifacts for critical automation.
- Store secrets in vault systems, never directly in playbooks.
- Restrict automation credentials to least privilege.
- Log all privileged automation actions for auditability.
Cost and reliability baseline #
- Standardize reusable roles/modules to reduce maintenance overhead.
- Avoid one-off scripts that cannot be tested or reused.
- Measure change failure rate and mean-time-to-recovery for config rollouts.
- Decommission stale modules and orphaned inventories regularly.
Testing and promotion strategy #
Minimum testing pipeline #
- Lint and syntax check all configuration code.
- Run unit/static policy tests for module logic and security rules.
- Apply in ephemeral test environments.
- Validate service health and compliance assertions.
- Promote to staging and production with controlled approvals.
Drift and rollback operations #
- Run scheduled drift detection and alert on high-severity deviations.
- Auto-remediate low-risk drift where safe.
- Require human approval for high-impact corrective actions.
- Keep known-good artifacts/manifests to support rapid rollback.
Implementation examples #
Example Ansible task snippet #
- name: Ensure unattended upgrades are enabled
ansible.builtin.package:
name: unattended-upgrades
state: present
- name: Enforce sshd baseline
ansible.builtin.template:
src: sshd_config.j2
dest: /etc/ssh/sshd_config
owner: root
group: root
mode: "0600"
notify: Restart sshd
Example Ansible role pattern #
roles/base-hardening: users, SSH, auditd, baseline packages.roles/node-exporter: monitoring agent installation/config.- Environment variable files for per-stage differences.
Example pipeline integration #
- Lint and syntax-check automation code.
- Run security/policy tests.
- Apply in ephemeral test environment.
- Promote to staging, then production with approvals.
Example governance checklist #
- Every change linked to an issue or service request.
- Mandatory peer review for production-impacting changes.
- Emergency change path with after-action review.
- Periodic access review for automation accounts.
Maturity roadmap (practical) #
- Level 1 - Scripted: basic automation exists but is inconsistent.
- Level 2 - Standardized: shared module patterns and enforced reviews.
- Level 3 - Verified: testing, policy checks, and controlled promotion gates.
- Level 4 - Autonomous: continuous convergence with risk-based remediation.
This model helps teams prioritize reliability and governance before attempting full automation at scale.
Pitfalls / anti-patterns #
- Editing servers manually and skipping source-of-truth updates.
- Mixing environment-specific values directly into shared roles.
- Non-idempotent scripts that create hidden drift.
- Treating configuration changes as untested operational tasks.
- Allowing module sprawl without ownership and lifecycle standards.