Automating Internal Databases Operations at OVHcloud with Ansible

intro_logo

CfgMgmtCamp 2024

Julien RIOU

February 6, 2024

Speaker

Summary

  • Who are we?
  • Managed infrastructure
  • Management tools
  • Ansible code base
  • Real world examples
  • Implementation
  • Development
  • What’s next?

Who are we?

  • Major cloud provider in Europe
  • Datacenters worldwide
  • Baremetal servers, public & private cloud, managed services

Managed infrastructure

  • 3 DBMS (MySQL, MongoDB, PostgreSQL)
  • 7 autonomous infrastructures worldwide
  • 500+ servers
  • 2000+ databases
  • 100+ clusters
  • Highly secure environments

Cluster example

Mutualized environment

Management tools

Infrastructure as Code

Terraform logo

  • Manage infrastructure lifecycle
    • Create, replace, destroy
    • Scale up, down
  • Providers: OVH, vSphere, phpipam, AWS
  • Use standard providers first

Configuration management

puppet

  • Manage operating system security hardening
  • Install and configure packages (including DBMS)
  • Agent run manually on internal databases

One-shot operations

ansible

  • Requests from users
  • Maintenances
  • Orchestration of multiple tasks
  • Acting on external resources

Operation examples

  • Bootstrap clusters
  • Create/move/delete databases, users, permissions
  • Test/apply schema migrations
  • Minor/major upgrades
  • Reboot and decrypt servers, clusters
  • Daily restores

Automation

  • Reduce human errors
  • Free human time and energy
  • Focus on what’s important

Deep dive into Ansible

Code base

Architecture of a playbook

  • Playbook
    • Play
      • include task
      • include task
    • Play
      • include task

Reusable tasks

  • No role, only tasks
  • Located under tasks directory
  • One task = one module
  • Tasks can be included by one or more playbooks
  • Naming convention is scope-action.yml
  • Idempotence

Real-world examples

  • Schema migrations
  • Database creation
  • Minor upgrades
  • Databases migrations

Schema migrations

  • Applications evolve all the time
  • Databases schemas too
  • Reviewed and applied by DBAs

Schema migrations

sql-migrate

-- +migrate Up
create table author (
    id   bigserial primary key,
    name text not null
);

create table talk (
    id        bigserial primary key,
    title     text not null,
    author_id bigint not null references author(id)
);

-- +migrate Down
drop table author, talk;

Schema migrations

  • Move forward with sql-migrate up
  • Rollback with sql-migrate down

Playbook overview

- name: check arguments
  hosts: all
  run_once: true
  delegate_to: localhost
  tasks:
    - name: check variable schema_url    # fail fast
    - name: check variable database_name # fail fast
- name: update database to the latest schema migration
  hosts: "{{ database_name }}:&subrole_primary"
  tasks:
    - name: create sql-migrate directories
    - name: create sql-migrate configuration file
    - name: clone schema
    - name: run migrations

Playbook tasks

- name: create sql-migrate directories
  ansible.builtin.file:
    path: "{{ item }}"
    state: directory
  loop:
    - /etc/sqlmigrate
    - /var/lib/sqlmigrate
- name: create sql-migrate configuration file
  ansible.builtin.template:
    src: sqlmigrate/database.yml.j2
    dest: "/etc/sqlmigrate/{{ database_name }}.yml"

Playbook tasks

- name: clone schema repository
  ansible.builtin.git:
    repo: "{{ schema_url }}"
    dest: "/var/lib/sqlmigrate/{{ database_name }}"
    version: "{{ branch|default('master') }}" # branch or tag
    force: true
  environment:
    TMPDIR: /run
- name: run migrations
  ansible.builtin.command:
    cmd: sql-migrate up -config /etc/sqlmigrate/{{ database_name }}.yml

Database creation

Just run CREATE DATABASE.

Easy, right?

Well…

Database creation

  1. Check arguments
  2. Select an available cluster
  3. Create git repository
  4. Run CREATE DATABASE (using a module)
  5. Create secrets
  6. Create roles and users (for applications, humans)
  7. Link the database to the git repository
  8. Run schema migrations

Minor upgrades

Ensure softwares are up-to-date:

  • Security
  • Bugs

Minor upgrades

  • Upgrade packages (DBMS, system)
  • Reboot (if needed)
  • Restart DBMS (if needed)
  • Order by role criticity

Minor upgrade (1/2)

Minor upgrade (2/2)

Database migration

  • Cluster is about to reach maximum capacity
  • Colocate or spread logical divisions
  • Isolate noisy neighbours
  • Major upgrades

Database migration

Move one or more databases from one cluster to another

  1. Setup logical replication
  2. Promote
    • Check
    • Migrate
    • Rollback

Database migration

  • Moved out of a datacenter last year with this method
  • 400+ databases
  • 16.78TiB
  • Under 30 minutes of downtime for the datacenter move
    • Big focus on playbook execution time
  • Thanks to Ansible

External collections

  • community.general
  • community.mysql
  • community.mongodb
  • community.postgresql

Internal collections

  • ovhcloud.internal
  • ovhcloud.mysqlsh
  • ovhcloud.patronictl
  • ovhcloud.sqlmigrate

Implementation

How we use Ansible

Secure Shell (SSH)

How can we securely connect to remote hosts to perform actions?

The Bastion

The Bastion

Ansible + The Bastion

“Ansible Wrapper”

[ssh_connection]
pipelining = True
private_key_file = ~/.ssh/id_ed25519
ssh_executable = /usr/share/ansible/plugins/bastion/sshwrapper.py
sftp_executable = /usr/share/ansible/plugins/bastion/sftpbastion.sh
transfer_method = sftp
retries = 3

https://github.com/ovh/the-bastion-ansible-wrapper

Inventory

Where can we find our hosts to perform operations?

Consul

Consul

Consul service discovery

Consul

  • Nodes
    • name, IP address, meta(data)
  • Services
    • databases
  • Access control list (ACL) with tokens
  • Encryption

Static configuration

  • Node meta
    • server_type
      • postgresql, mysql, filer, …
    • role
      • node, lb, backup, …
    • cluster identifier

Dynamic configuration

  • Node “subrole”
    • primary, replica
  • Database services

Where is my database?

Consul service

Ansible + Consul

How to use the inventory?

With a limit option

ansible server_type_postgresql -m ping

ansible-playbook -l server_type_postgresql playbook.yml

Group combinaison

  • & for intersection (AND)
  • : for multiple groups (OR)
  • ! for exclusion (NOT)
ansible-playbook -l 'test:&subrole_primary' playbook.yml
ansible-playbook -l 'server_type_postgresql:server_type_mysql' playbook.yml
ansible-playbook -l 'server_type_postgresql:!cluster_99' playbook.yml

Execution environments

Where Ansible runs?

Admin server

  • Virtual machine
  • Access via SSH
  • Shared environment
  • No API

AWX

  • Ansible orchestration
  • Running on Kubernetes
  • Personal accounts (via SSO/SAML)
  • REST API, web interface, CLI
  • Notifications (alerting, chat)
  • https://github.com/ansible/awx

Concepts

  • Organization, projects, teams, users, privileges
  • Inventory source
  • Source Control (Git) and Machine (SSH) credentials
  • Job templates
  • Scheduled jobs
  • Notification templates

AWX UI

AWX CLI

awx -f human job_templates launch --monitor --extra_vars \
    '{"database_name": "***", "branch": "master", "schema_url": "ssh://***.git"}' \
    database-primary-schema-update

Configuration

Components on Kubernetes

  • Web
  • Task
  • Execution environment (EE)

Disclaimer

Part of the issues we have encountered are probably related to our internal implementation (internal services, internal Kubernetes).

Quota on pods

Component Type cpu memory ephemeral-storage Quantity
web request 500m 1Gi 1
limit 2000m 2Gi
task request 1000m 2Gi 1
limit 1500m 4Gi
ee request 1000m 256Mi n
limit 2000m 2Gi 1G

Job execution time

Job execution time

PING

1 min 45 secs

Solutions

  • Enable SCM update cache
    • scm_update_on_launch (bool)
    • scm_update_cache_timeout (int)
  • Enable inventory cache
    • update_on_launch (bool)
    • update_cache_timeout (int)
  • Check quotas on Kubernetes namespace
  • Analyze playbook performances

Fixed

Custom Vault

  • Home-made solution inspired by HashiCorp Vault
  • Designed to be managed by humans, read by robots
  • Designed to be cached

Custom Vault and databases migrations

  • 70 databases at once
  • Endpoint included in secrets, need to be updated
  • 4 secrets per database
  • Every single lookup call took 4 seconds
  • 16 seconds per database
  • 18 minutes to update secrets

Custom Vault

  • No API route to search by name, only by id
  • List identifiers with GET /secrets
  • Different behaviors based on authentication
    • Application key
    • Basic auth

Custom Vault with application key

Custom Vault with basic auth

Solution

  • Init Container to pull all secrets locally once
  • Lookup vault_secret to read locally (application key)
  • Lookup vault_secret_with_user to bypass the cache (basic auth)

Fixed

Network unreachable on Kubernetes

configstore:
provider '***':
Post "https://***/auth/app":
dial tcp:
lookup *** on ***:53:
read udp ***->***:53:
i/o timeout

Breaking the job

Job failed with no output

Cascading break

List of failed jobs

Solution

Replace iptables by nftables on Kubernetes workers

Fixed

Consul Federation

But

Job failed with Consul Federator

Solution

Fixed

Database connection issue

  • AWX needs a database to run
  • AWX database is hosted on the databases infrastructure
  • AWX can restart load balancers (HAproxy) in front of its own database
  • Database connection is cut

Save HAProxy state

  • server-state-base /var/lib/haproxy/state
  • load-server-state-from-file local

Save HAProxy state

ExecReload=/path/to/haproxy-create-state-files.sh

#!/bin/bash
sock=/run/haproxy/admin.sock
base=/var/lib/haproxy/state
backends=$(socat ${sock} - <<< "show backend" | fgrep -v '#')
for backend in ${backends}
do
    statefile=${base}/${backend}
    socat ${sock} - <<< "show servers state ${backend}" > ${statefile}
done

Handle database connection failure

/etc/tower/conf.d/credentials.py

DATABASES = {
    'default': {
        "ENGINE": "awx.main.db.profiled_pg",
        ...
        "OPTIONS": {
            ...
            "keepalives": 1,
            "keepalives_idle": 5,
            "keepalives_interval": 5,
            "keepalives_count": 5
        },
    }
}

Fixed

Security

Weekly CVE report on Docker images

With JFrog Xray*

Security

Also available on Quay.io for base images

Security

Also available on Quay.io for base images

Solutions

Use community-ee-minimal image for execution environment

From 32 to 4 violations

0 critical, 1 high, 2 medium, 1 low

Fixed

Development

How do we work on playbooks?

Environment

  • Admin server on a LAB environment
  • Git clone code base in a unique directory
  • Edit files
  • Run playbooks
  • Create a patch
ansible@admin.lab ~ $ tree -L 1
├── ansible-jriou
ansible@admin.lab ~ $ cd ansible-jriou/
ansible@admin.lab ~/ansible-jriou $ git branch
* master
ansible@admin.lab ~/ansible-jriou $ vi ping.yml
ansible@admin.lab ~/ansible-jriou $ ansible-playbook ping.yml
ansible@admin.lab ~/ansible-jriou $ git diff > feature.patch

Tests

Trust, but verify.

– Wilfried Roset

Ansible Molecule

  • Run a scenario sequence
  • Designed to test playbooks and roles
  • Used to test playbooks and tasks
  • Mostly used to test syntax

Syntax check

molecule/ping
├── molecule.yml    (define the scenario)
└── converge.yml    (run the playbook)

Define the scenario with molecule.yml

driver:
  name: docker

platforms:
  - name: debian11
    image: "docker-registry/debian:bullseye"

scenario:
  test_sequence:
    - lint
    - syntax

lint: |
  set -e
  yamllint ping.yml
  ansible-lint ping.yml

Run the playbook with converge.yml

- name: Include playbook
  ansible.builtin.import_playbook: ../../ping.yml

Result

--> Found config file /path/to/run/.config/molecule/config.yml
--> Test matrix
    
└── ping
    ├── dependency
    └── syntax
    
--> Scenario: 'ping'
--> Action: 'dependency'
--> Scenario: 'ping'
--> Action: 'syntax'
--> Sanity checks: 'docker'
    
    playbook: /path/to/run/molecule/ping/converge.yml

CDS

CDS is an Enterprise-Grade Continuous Delivery & DevOps Automation Open Source Platform.

https://ovh.github.io/cds/

Workflow

CDS UI

Number of tests

What’s next?

  • Event-Driven Ansible
  • Better molecule tests
  • Scheduled unattended minor upgrades

Databases on Kubernetes?

Confidence

Thank you

Questions

// reveal.js plugins