Automating Internal Databases Operations at OVHcloud with Ansible

CfgMgmtCamp 2024

Julien RIOU

February 6, 2024

Speaker

Julien RIOU
Open Source DBA
https://julien.riou.xyz
@jriou@hachyderm.io

Summary

Who are we?
Managed infrastructure
Management tools
Ansible code base
Real world examples
Implementation
Development
What’s next?

Who are we?

Major cloud provider in Europe
Datacenters worldwide
Baremetal servers, public & private cloud, managed services

All products rely on internal databases

Managed infrastructure

3 DBMS (MySQL, MongoDB, PostgreSQL)
7 autonomous infrastructures worldwide
500+ servers
2000+ databases
100+ clusters
Highly secure environments

Cluster example

Mutualized environment

Management tools

Infrastructure as code
Configuration management
One-shot operations

Infrastructure as Code

Terraform logo

Manage infrastructure lifecycle
- Create, replace, destroy
- Scale up, down
Providers: OVH, vSphere, phpipam, AWS
Use standard providers first

Using Terraform (Enterprise).

Providers:

OVH: baremetal servers, private network, DNS records, load balancers
vSphere: virtual machines on private cloud
phpipam: private IP allocation
AWS: s3 buckets

Configuration management

puppet

Manage operating system security hardening
Install and configure packages (including DBMS)
Agent run manually on internal databases

Using Puppet.

Operating system security hardening:

Kernel
Network flows (iptables)
Audit logs

One-shot operations

ansible

Requests from users
Maintenances
Orchestration of multiple tasks
Acting on external resources

Operation examples

Bootstrap clusters
Create/move/delete databases, users, permissions
Test/apply schema migrations
Minor/major upgrades
Reboot and decrypt servers, clusters
Daily restores

Automation

Reduce human errors
Free human time and energy
Focus on what’s important

Deep dive into Ansible

Code base

Architecture of a playbook

Playbook
- Play
  - include task
  - include task
- Play
  - include task

Reusable tasks

No role, only tasks
Located under tasks directory
One task = one module
Tasks can be included by one or more playbooks
Naming convention is scope-action.yml
Idempotence

TDD for Task-Driven Development (joke)
Sometimes, more than one module are used in a task
We try to keep tasks (very) small
Sometimes, modules are directly used in playbooks

Real-world examples

Schema migrations
Database creation
Minor upgrades
Databases migrations

Schema migrations

Applications evolve all the time
Databases schemas too
Reviewed and applied by DBAs

Schema migrations

sql-migrate

-- +migrate Up
create table author (
    id   bigserial primary key,
    name text not null
);

create table talk (
    id        bigserial primary key,
    title     text not null,
    author_id bigint not null references author(id)
);

-- +migrate Down
drop table author, talk;

Schema migrations

Move forward with sql-migrate up
Rollback with sql-migrate down

Playbook overview

- name: check arguments
  hosts: all
  run_once: true
  delegate_to: localhost
  tasks:
    - name: check variable schema_url    # fail fast
    - name: check variable database_name # fail fast

- name: update database to the latest schema migration
  hosts: "{{ database_name }}:&subrole_primary"
  tasks:
    - name: create sql-migrate directories
    - name: create sql-migrate configuration file
    - name: clone schema
    - name: run migrations

Playbook tasks

- name: create sql-migrate directories
  ansible.builtin.file:
    path: "{{ item }}"
    state: directory
  loop:
    - /etc/sqlmigrate
    - /var/lib/sqlmigrate

- name: create sql-migrate configuration file
  ansible.builtin.template:
    src: sqlmigrate/database.yml.j2
    dest: "/etc/sqlmigrate/{{ database_name }}.yml"

Playbook tasks

- name: clone schema repository
  ansible.builtin.git:
    repo: "{{ schema_url }}"
    dest: "/var/lib/sqlmigrate/{{ database_name }}"
    version: "{{ branch|default('master') }}" # branch or tag
    force: true
  environment:
    TMPDIR: /run

- name: run migrations
  ansible.builtin.command:
    cmd: sql-migrate up -config /etc/sqlmigrate/{{ database_name }}.yml

Database creation

Just run CREATE DATABASE.

Easy, right?

Well…

Database creation

Check arguments
Select an available cluster
Create git repository
Run CREATE DATABASE (using a module)
Create secrets
Create roles and users (for applications, humans)
Link the database to the git repository
Run schema migrations

Minor upgrades

Ensure softwares are up-to-date:

Security
Bugs

Minor upgrades

Upgrade packages (DBMS, system)
Reboot (if needed)
Restart DBMS (if needed)
Order by role criticity

Minor upgrade (1/2)

Minor upgrade (2/2)

Database migration

Cluster is about to reach maximum capacity
Colocate or spread logical divisions
Isolate noisy neighbours
Major upgrades

Database migration

Move one or more databases from one cluster to another

Setup logical replication
Promote
- Check
- Migrate
- Rollback

Database migration

Moved out of a datacenter last year with this method
400+ databases
16.78TiB
Under 30 minutes of downtime for the datacenter move
- Big focus on playbook execution time
Thanks to Ansible

External collections

community.general
community.mysql
community.mongodb
community.postgresql

Internal collections

ovhcloud.internal
ovhcloud.mysqlsh
ovhcloud.patronictl
ovhcloud.sqlmigrate

Implementation

How we use Ansible

Secure Shell (SSH)

How can we securely connect to remote hosts to perform actions?

The Bastion

Fine-grained access to infrastructure
Sessions can be recorded
Used in secure environment
Already used by humans
https://github.com/ovh/the-bastion

The Bastion is a SSH gateway
Secure environments includes PCI DSS and SecNumCloud

Ansible + The Bastion

“Ansible Wrapper”

[ssh_connection]
pipelining = True
private_key_file = ~/.ssh/id_ed25519
ssh_executable = /usr/share/ansible/plugins/bastion/sshwrapper.py
sftp_executable = /usr/share/ansible/plugins/bastion/sftpbastion.sh
transfer_method = sftp
retries = 3

https://github.com/ovh/the-bastion-ansible-wrapper

SCP is deprecated, use SFTP instead.

Inventory

Where can we find our hosts to perform operations?

Consul

Distributed and highly-available data store
Local agent
https://github.com/hashicorp/consul

Consul

Consul service discovery

Consul

Nodes
- name, IP address, meta(data)
Services
- databases
Access control list (ACL) with tokens
Encryption

Static configuration

Node meta
- server_type
  - postgresql, mysql, filer, …
- role
  - node, lb, backup, …
- cluster identifier

Dynamic configuration

Node “subrole”
- primary, replica
Database services

Where is my database?

Consul service

Ansible + Consul

Dynamic inventory
https://github.com/wilfriedroset/consul-awx

How to use the inventory?

With a limit option

ansible server_type_postgresql -m ping

ansible-playbook -l server_type_postgresql playbook.yml

Group combinaison

& for intersection (AND)
: for multiple groups (OR)
! for exclusion (NOT)

ansible-playbook -l 'test:&subrole_primary' playbook.yml
ansible-playbook -l 'server_type_postgresql:server_type_mysql' playbook.yml
ansible-playbook -l 'server_type_postgresql:!cluster_99' playbook.yml

Common patterns

Execution environments

Where Ansible runs?

Admin server

Virtual machine
Access via SSH
Shared environment
No API

AWX

Ansible orchestration
Running on Kubernetes
Personal accounts (via SSO/SAML)
REST API, web interface, CLI
Notifications (alerting, chat)
https://github.com/ansible/awx

Concepts

Organization, projects, teams, users, privileges
Inventory source
Source Control (Git) and Machine (SSH) credentials
Job templates
Scheduled jobs
Notification templates

AWX UI

AWX CLI

awx -f human job_templates launch --monitor --extra_vars \
    '{"database_name": "***", "branch": "master", "schema_url": "ssh://***.git"}' \
    database-primary-schema-update

Configuration

Ansible playbook on the admin server
AWX collection
Terraform AWX provider available

Components on Kubernetes

Web
Task
Execution environment (EE)

Disclaimer

Part of the issues we have encountered are probably related to our internal implementation (internal services, internal Kubernetes).

Quota on pods

Component	Type	`cpu`	`memory`	`ephemeral-storage`	Quantity
web	request	500m	1Gi		1
	limit	2000m	2Gi
task	request	1000m	2Gi		1
	limit	1500m	4Gi
ee	request	1000m	256Mi		n
	limit	2000m	2Gi	1G

Raise resources on “ee” but not too much or the namespace quota will be reached quickly
Raise resources on “task” to handle simultaneous jobs
Raise resources on “web” for a smooth UI and HTTPS API experience

Job execution time

Source Control Update
Inventory Sync
Pod scheduling time (quotas, simultaneous jobs)
Containers starting time (init containers)
Playbook execution time

Job execution time

PING

1 min 45 secs

Solutions

Enable SCM update cache
- scm_update_on_launch (bool)
- scm_update_cache_timeout (int)
Enable inventory cache
- update_on_launch (bool)
- update_cache_timeout (int)
Check quotas on Kubernetes namespace
Analyze playbook performances

Fixed

Custom Vault

Home-made solution inspired by HashiCorp Vault
Designed to be managed by humans, read by robots
Designed to be cached

Custom Vault and databases migrations

70 databases at once
Endpoint included in secrets, need to be updated
4 secrets per database
Every single lookup call took 4 seconds
16 seconds per database
18 minutes to update secrets

Custom Vault

No API route to search by name, only by id
List identifiers with GET /secrets
Different behaviors based on authentication
- Application key
- Basic auth

Custom Vault with application key

Custom Vault with basic auth

Solution

Init Container to pull all secrets locally once
Lookup vault_secret to read locally (application key)
Lookup vault_secret_with_user to bypass the cache (basic auth)

Fixed

Two plugins to avoid breaking changes on the first one.

Network unreachable on Kubernetes

configstore:
provider '***':
Post "https://***/auth/app":
dial tcp:
lookup *** on ***:53:
read udp ***->***:53:
i/o timeout

Name could not be resolved
Secret could not be pulled
Container is marked as Error

Breaking the job

Job failed with no output

A dependent job (“projectupdate”) failed

Cascading break

List of failed jobs

Source Control Update job has failed
Inventory Sync failed because of Source Control Update
Playbook Runs failed because Inventory Sync failed

Solution

Replace iptables by nftables on Kubernetes workers

Fixed

Consul Federation

Single endpoint
Easy to configure

But

Job failed with Consul Federator

Chat channel with jobs in error due to issues with federation

Solution

Fixed

Use local Consul servers
Connect directly to one server
Or deploy a Consul agent in Kubernetes

Database connection issue

AWX needs a database to run
AWX database is hosted on the databases infrastructure
AWX can restart load balancers (HAproxy) in front of its own database
Database connection is cut

Save HAProxy state

server-state-base /var/lib/haproxy/state
load-server-state-from-file local

Save HAProxy state

ExecReload=/path/to/haproxy-create-state-files.sh

#!/bin/bash
sock=/run/haproxy/admin.sock
base=/var/lib/haproxy/state
backends=$(socat ${sock} - <<< "show backend" | fgrep -v '#')
for backend in ${backends}
do
    statefile=${base}/${backend}
    socat ${sock} - <<< "show servers state ${backend}" > ${statefile}
done

Handle database connection failure

/etc/tower/conf.d/credentials.py

DATABASES = {
    'default': {
        "ENGINE": "awx.main.db.profiled_pg",
        ...
        "OPTIONS": {
            ...
            "keepalives": 1,
            "keepalives_idle": 5,
            "keepalives_interval": 5,
            "keepalives_count": 5
        },
    }
}

Fixed

Security

Weekly CVE report on Docker images

With JFrog Xray*

*Proprietary software

Security

Also available on Quay.io for base images

Security

Also available on Quay.io for base images

Solutions

Use community-ee-minimal image for execution environment

From 32 to 4 violations

0 critical, 1 high, 2 medium, 1 low

Fixed

Development

How do we work on playbooks?

Environment

Admin server on a LAB environment
Git clone code base in a unique directory
Edit files
Run playbooks
Create a patch

Create job template on AWX for new playbooks

ansible@admin.lab ~ $ tree -L 1
├── ansible-jriou
ansible@admin.lab ~ $ cd ansible-jriou/
ansible@admin.lab ~/ansible-jriou $ git branch
* master
ansible@admin.lab ~/ansible-jriou $ vi ping.yml
ansible@admin.lab ~/ansible-jriou $ ansible-playbook ping.yml
ansible@admin.lab ~/ansible-jriou $ git diff > feature.patch

Tests

Trust, but verify.

– Wilfried Roset

Wilfried is my manager/partner for 6 years now. He’s not the author of the proverb but he uses it all the time.
History of the proverb

Ansible Molecule

Run a scenario sequence
Designed to test playbooks and roles
Used to test playbooks and tasks
Mostly used to test syntax

Syntax check

molecule/ping
├── molecule.yml    (define the scenario)
└── converge.yml    (run the playbook)

Define the scenario with molecule.yml

driver:
  name: docker

platforms:
  - name: debian11
    image: "docker-registry/debian:bullseye"

scenario:
  test_sequence:
    - lint
    - syntax

lint: |
  set -e
  yamllint ping.yml
  ansible-lint ping.yml

Run the playbook with converge.yml

- name: Include playbook
  ansible.builtin.import_playbook: ../../ping.yml

Result

--> Found config file /path/to/run/.config/molecule/config.yml
--> Test matrix
    
└── ping
    ├── dependency
    └── syntax
    
--> Scenario: 'ping'
--> Action: 'dependency'
--> Scenario: 'ping'
--> Action: 'syntax'
--> Sanity checks: 'docker'
    
    playbook: /path/to/run/molecule/ping/converge.yml

CDS

CDS is an Enterprise-Grade Continuous Delivery & DevOps Automation Open Source Platform.

https://ovh.github.io/cds/

Workflow

CDS UI

Number of tests

What’s next?

Event-Driven Ansible
Better molecule tests
Scheduled unattended minor upgrades

Databases on Kubernetes?

Confidence