Configuration Management

Configuration Management #

Configuration management keeps systems consistent, secure, and auditable as environments scale.

Overview #

A strong configuration-management practice provides:

  • Repeatable system state across environments.
  • Fast, low-risk changes using version-controlled automation.
  • Drift detection and remediation workflows.
  • Evidence for security and compliance controls.

It also reduces operational dependency on individual operators by codifying system behavior into reusable, reviewed modules.

Core principles #

  • Desired state as code: define the intended end state in version control.
  • Idempotency: repeated runs produce the same result without side effects.
  • Small safe changes: prefer frequent, scoped updates over large batch changes.
  • Observability-first: capture execution logs and post-change health signals.
  • Rollback readiness: every rollout path should have a tested rollback or replacement strategy.

When to use configuration management / decision criteria #

Use configuration management when you need:

  • Consistent host or cluster configuration across many nodes.
  • Standardized hardening baselines and patching patterns.
  • Repeatable bootstrapping for ephemeral or replacement infrastructure.

Choose strategy by operating model:

  • Declarative convergence for policy-driven steady-state control.
  • Imperative orchestration for sequenced changes and workflows.
  • Immutable pattern where services are rebuilt/replaced instead of patched in place.

As a rule of thumb, converge mutable shared services continuously, and use immutable patterns for stateless or horizontally scaled workloads.

Architecture patterns #

1) Agentless push model #

  • Typical with Ansible.
  • Good for smaller fleets and controlled change windows.
  • Requires secure orchestrator access to managed nodes.
  • Works well when teams want explicit deployment control.

2) Agent-based pull model #

  • Typical with Puppet/Chef/Salt in enterprise fleets.
  • Good for continuous convergence and large scale.
  • Requires robust PKI, node registration, and agent lifecycle management.
  • Best when nodes must self-correct drift on a recurring cadence.

3) Immutable image pattern #

  • Build hardened images in CI.
  • Replace nodes via rolling updates.
  • Pair with startup bootstrap for dynamic runtime values.
  • Ideal for reducing configuration drift in elastic environments.

Repository and module design #

A maintainable structure usually includes:

  • modules/ or roles/ for reusable components.
  • environments/ for stage-specific variables and inventory.
  • policies/ for security and compliance checks.
  • pipelines/ for lint, test, plan, and apply workflows.

Design guidance:

  • Keep modules small and single-purpose.
  • Expose clear input variables with defaults and validation.
  • Avoid embedding secrets or environment constants in shared modules.
  • Version modules and maintain a changelog for breaking changes.

Security and cost guardrails #

Security baseline #

  • Keep playbooks/manifests in version control with code review.
  • Enforce signed commits/artifacts for critical automation.
  • Store secrets in vault systems, never directly in playbooks.
  • Restrict automation credentials to least privilege.
  • Log all privileged automation actions for auditability.

Cost and reliability baseline #

  • Standardize reusable roles/modules to reduce maintenance overhead.
  • Avoid one-off scripts that cannot be tested or reused.
  • Measure change failure rate and mean-time-to-recovery for config rollouts.
  • Decommission stale modules and orphaned inventories regularly.

Testing and promotion strategy #

Minimum testing pipeline #

  1. Lint and syntax check all configuration code.
  2. Run unit/static policy tests for module logic and security rules.
  3. Apply in ephemeral test environments.
  4. Validate service health and compliance assertions.
  5. Promote to staging and production with controlled approvals.

Drift and rollback operations #

  • Run scheduled drift detection and alert on high-severity deviations.
  • Auto-remediate low-risk drift where safe.
  • Require human approval for high-impact corrective actions.
  • Keep known-good artifacts/manifests to support rapid rollback.

Implementation examples #

Example Ansible task snippet #

- name: Ensure unattended upgrades are enabled
  ansible.builtin.package:
    name: unattended-upgrades
    state: present

- name: Enforce sshd baseline
  ansible.builtin.template:
    src: sshd_config.j2
    dest: /etc/ssh/sshd_config
    owner: root
    group: root
    mode: "0600"
  notify: Restart sshd

Example Ansible role pattern #

  • roles/base-hardening: users, SSH, auditd, baseline packages.
  • roles/node-exporter: monitoring agent installation/config.
  • Environment variable files for per-stage differences.

Example pipeline integration #

  1. Lint and syntax-check automation code.
  2. Run security/policy tests.
  3. Apply in ephemeral test environment.
  4. Promote to staging, then production with approvals.

Example governance checklist #

  • Every change linked to an issue or service request.
  • Mandatory peer review for production-impacting changes.
  • Emergency change path with after-action review.
  • Periodic access review for automation accounts.

Maturity roadmap (practical) #

  • Level 1 - Scripted: basic automation exists but is inconsistent.
  • Level 2 - Standardized: shared module patterns and enforced reviews.
  • Level 3 - Verified: testing, policy checks, and controlled promotion gates.
  • Level 4 - Autonomous: continuous convergence with risk-based remediation.

This model helps teams prioritize reliability and governance before attempting full automation at scale.

Pitfalls / anti-patterns #

  • Editing servers manually and skipping source-of-truth updates.
  • Mixing environment-specific values directly into shared roles.
  • Non-idempotent scripts that create hidden drift.
  • Treating configuration changes as untested operational tasks.
  • Allowing module sprawl without ownership and lifecycle standards.

References #