Automating Internal Databases Operations at OVHcloud with Ansible

CfgMgmtCamp 2024

Julien RIOU

February 6, 2024



Speaker


Summary


Who are we?


All products rely on internal databases


Managed infrastructure


Cluster example


Mutualized environment


Management tools


Infrastructure as Code

Terraform logo

Using Terraform (Enterprise).

Providers:


Configuration management

puppet

Using Puppet.

Operating system security hardening:


One-shot operations

ansible


Operation examples


Automation


Deep dive into Ansible


Code base

Architecture of a playbook


Reusable tasks


Real-world examples


Schema migrations


Schema migrations

sql-migrate

-- +migrate Up
create table author (
    id   bigserial primary key,
    name text not null
);

create table talk (
    id        bigserial primary key,
    title     text not null,
    author_id bigint not null references author(id)
);

-- +migrate Down
drop table author, talk;

Schema migrations


Playbook overview

- name: check arguments
  hosts: all
  run_once: true
  delegate_to: localhost
  tasks:
    - name: check variable schema_url    # fail fast
    - name: check variable database_name # fail fast
- name: update database to the latest schema migration
  hosts: "{{ database_name }}:&subrole_primary"
  tasks:
    - name: create sql-migrate directories
    - name: create sql-migrate configuration file
    - name: clone schema
    - name: run migrations

Playbook tasks

- name: create sql-migrate directories
  ansible.builtin.file:
    path: "{{ item }}"
    state: directory
  loop:
    - /etc/sqlmigrate
    - /var/lib/sqlmigrate
- name: create sql-migrate configuration file
  ansible.builtin.template:
    src: sqlmigrate/database.yml.j2
    dest: "/etc/sqlmigrate/{{ database_name }}.yml"

Playbook tasks

- name: clone schema repository
  ansible.builtin.git:
    repo: "{{ schema_url }}"
    dest: "/var/lib/sqlmigrate/{{ database_name }}"
    version: "{{ branch|default('master') }}" # branch or tag
    force: true
  environment:
    TMPDIR: /run
- name: run migrations
  ansible.builtin.command:
    cmd: sql-migrate up -config /etc/sqlmigrate/{{ database_name }}.yml

Database creation

Just run CREATE DATABASE.

Easy, right?

Well…


Database creation

  1. Check arguments
  2. Select an available cluster
  3. Create git repository
  4. Run CREATE DATABASE (using a module)
  5. Create secrets
  6. Create roles and users (for applications, humans)
  7. Link the database to the git repository
  8. Run schema migrations

Minor upgrades

Ensure softwares are up-to-date:


Minor upgrades


Minor upgrade (1/2)


Minor upgrade (2/2)


Database migration


Database migration

Move one or more databases from one cluster to another

  1. Setup logical replication
  2. Promote
    • Check
    • Migrate
    • Rollback

Database migration


External collections


Internal collections


Implementation

How we use Ansible


Secure Shell (SSH)

How can we securely connect to remote hosts to perform actions?


The Bastion

The Bastion

Ansible + The Bastion

“Ansible Wrapper”

[ssh_connection]
pipelining = True
private_key_file = ~/.ssh/id_ed25519
ssh_executable = /usr/share/ansible/plugins/bastion/sshwrapper.py
sftp_executable = /usr/share/ansible/plugins/bastion/sftpbastion.sh
transfer_method = sftp
retries = 3

https://github.com/ovh/the-bastion-ansible-wrapper

SCP is deprecated, use SFTP instead.


Inventory

Where can we find our hosts to perform operations?


Consul


Consul

Consul service discovery


Consul


Static configuration


Dynamic configuration


Where is my database?

Consul service


Ansible + Consul


How to use the inventory?

With a limit option

ansible server_type_postgresql -m ping

ansible-playbook -l server_type_postgresql playbook.yml

Group combinaison

ansible-playbook -l 'test:&subrole_primary' playbook.yml
ansible-playbook -l 'server_type_postgresql:server_type_mysql' playbook.yml
ansible-playbook -l 'server_type_postgresql:!cluster_99' playbook.yml

Execution environments

Where Ansible runs?


Admin server


AWX


Concepts


AWX UI



AWX CLI

awx -f human job_templates launch --monitor --extra_vars \
    '{"database_name": "***", "branch": "master", "schema_url": "ssh://***.git"}' \
    database-primary-schema-update


Configuration


Components on Kubernetes


Disclaimer

Part of the issues we have encountered are probably related to our internal implementation (internal services, internal Kubernetes).


Quota on pods

Component Type cpu memory ephemeral-storage Quantity
web request 500m 1Gi 1
limit 2000m 2Gi
task request 1000m 2Gi 1
limit 1500m 4Gi
ee request 1000m 256Mi n
limit 2000m 2Gi 1G

Job execution time

  1. Source Control Update
  2. Inventory Sync
  3. Pod scheduling time (quotas, simultaneous jobs)
  4. Containers starting time (init containers)
  5. Playbook execution time

Job execution time

PING

1 min 45 secs


Solutions

  • Enable SCM update cache
    • scm_update_on_launch (bool)
    • scm_update_cache_timeout (int)
  • Enable inventory cache
    • update_on_launch (bool)
    • update_cache_timeout (int)
  • Check quotas on Kubernetes namespace
  • Analyze playbook performances

Fixed


Custom Vault


Custom Vault and databases migrations


Custom Vault


Custom Vault with application key


Custom Vault with basic auth


Solution

  • Init Container to pull all secrets locally once
  • Lookup vault_secret to read locally (application key)
  • Lookup vault_secret_with_user to bypass the cache (basic auth)

Fixed

Two plugins to avoid breaking changes on the first one.


Network unreachable on Kubernetes

configstore:
provider '***':
Post "https://***/auth/app":
dial tcp:
lookup *** on ***:53:
read udp ***->***:53:
i/o timeout

Breaking the job

Job failed with no output


Cascading break

List of failed jobs


Solution

Replace iptables by nftables on Kubernetes workers

Fixed


Consul Federation


But

Job failed with Consul Federator

Chat channel with jobs in error due to issues with federation


Solution

Fixed


Database connection issue

  • AWX needs a database to run
  • AWX database is hosted on the databases infrastructure
  • AWX can restart load balancers (HAproxy) in front of its own database
  • Database connection is cut


Save HAProxy state


Save HAProxy state

ExecReload=/path/to/haproxy-create-state-files.sh

#!/bin/bash
sock=/run/haproxy/admin.sock
base=/var/lib/haproxy/state
backends=$(socat ${sock} - <<< "show backend" | fgrep -v '#')
for backend in ${backends}
do
    statefile=${base}/${backend}
    socat ${sock} - <<< "show servers state ${backend}" > ${statefile}
done

Handle database connection failure

/etc/tower/conf.d/credentials.py

DATABASES = {
    'default': {
        "ENGINE": "awx.main.db.profiled_pg",
        ...
        "OPTIONS": {
            ...
            "keepalives": 1,
            "keepalives_idle": 5,
            "keepalives_interval": 5,
            "keepalives_count": 5
        },
    }
}

Fixed


Security

Weekly CVE report on Docker images

With JFrog Xray*

*Proprietary software


Security

Also available on Quay.io for base images


Security

Also available on Quay.io for base images


Solutions

Use community-ee-minimal image for execution environment

From 32 to 4 violations

0 critical, 1 high, 2 medium, 1 low

Fixed


Development

How do we work on playbooks?


Environment


ansible@admin.lab ~ $ tree -L 1
├── ansible-jriou
ansible@admin.lab ~ $ cd ansible-jriou/
ansible@admin.lab ~/ansible-jriou $ git branch
* master
ansible@admin.lab ~/ansible-jriou $ vi ping.yml
ansible@admin.lab ~/ansible-jriou $ ansible-playbook ping.yml
ansible@admin.lab ~/ansible-jriou $ git diff > feature.patch

Tests

Trust, but verify.

– Wilfried Roset


Ansible Molecule


Syntax check

molecule/ping
├── molecule.yml    (define the scenario)
└── converge.yml    (run the playbook)

Define the scenario with molecule.yml

driver:
  name: docker

platforms:
  - name: debian11
    image: "docker-registry/debian:bullseye"

scenario:
  test_sequence:
    - lint
    - syntax

lint: |
  set -e
  yamllint ping.yml
  ansible-lint ping.yml

Run the playbook with converge.yml

- name: Include playbook
  ansible.builtin.import_playbook: ../../ping.yml

Result

--> Found config file /path/to/run/.config/molecule/config.yml
--> Test matrix
    
└── ping
    ├── dependency
    └── syntax
    
--> Scenario: 'ping'
--> Action: 'dependency'
--> Scenario: 'ping'
--> Action: 'syntax'
--> Sanity checks: 'docker'
    
    playbook: /path/to/run/molecule/ping/converge.yml

CDS

CDS is an Enterprise-Grade Continuous Delivery & DevOps Automation Open Source Platform.

https://ovh.github.io/cds/


Workflow


CDS UI

Number of tests


What’s next?


Databases on Kubernetes?

Confidence


Thank you


Questions