Automating Internal Databases Operations at OVHcloud with Ansible
CfgMgmtCamp 2024
February 6, 2024
Speaker
- Julien RIOU
- Open Source DBA
- https://julien.riou.xyz
- @jriou@hachyderm.io
Summary
- Who are we?
- Managed infrastructure
- Management tools
- Ansible code base
- Real world examples
- Implementation
- Development
- What’s next?
Who are we?
- Major cloud provider in Europe
- Datacenters worldwide
- Baremetal servers, public & private cloud, managed services
All products rely on internal databases
Managed infrastructure
- 3 DBMS (MySQL, MongoDB, PostgreSQL)
- 7 autonomous infrastructures worldwide
- 500+ servers
- 2000+ databases
- 100+ clusters
- Highly secure environments
Cluster example
Mutualized environment
Management tools
- Infrastructure as code
- Configuration management
- One-shot operations
Infrastructure as Code
- Manage infrastructure lifecycle
- Create, replace, destroy
- Scale up, down
- Providers: OVH, vSphere, phpipam, AWS
- Use standard providers first
Using Terraform (Enterprise).
Providers:
- OVH: baremetal servers, private network, DNS records, load balancers
- vSphere: virtual machines on private cloud
- phpipam: private IP allocation
- AWS: s3 buckets
Configuration management
- Manage operating system security hardening
- Install and configure packages (including DBMS)
- Agent run manually on internal databases
Using Puppet.
Operating system security hardening:
- Kernel
- Network flows (iptables)
- Audit logs
One-shot operations
- Requests from users
- Maintenances
- Orchestration of multiple tasks
- Acting on external resources
Operation examples
- Bootstrap clusters
- Create/move/delete databases, users, permissions
- Test/apply schema migrations
- Minor/major upgrades
- Reboot and decrypt servers, clusters
- Daily restores
Automation
- Reduce human errors
- Free human time and energy
- Focus on what’s important
Deep dive into Ansible
Code base
Architecture of a playbook
- Playbook
- Play
- include task
- include task
- Play
- include task
- Play
Reusable tasks
- No role, only tasks
- Located under
tasks
directory - One task = one module
- Tasks can be included by one or more playbooks
- Naming convention is
scope-action.yml
- Idempotence
- TDD for Task-Driven Development (joke)
- Sometimes, more than one module are used in a task
- We try to keep tasks (very) small
- Sometimes, modules are directly used in playbooks
Real-world examples
- Schema migrations
- Database creation
- Minor upgrades
- Databases migrations
Schema migrations
- Applications evolve all the time
- Databases schemas too
- Reviewed and applied by DBAs
Schema migrations
-- +migrate Up
create table author (
id bigserial primary key,
name text not null
);
create table talk (
id bigserial primary key,
title text not null,
author_id bigint not null references author(id)
);
-- +migrate Down
drop table author, talk;
Schema migrations
- Move forward with
sql-migrate up
- Rollback with
sql-migrate down
Playbook overview
- name: check arguments
hosts: all
run_once: true
delegate_to: localhost
tasks:
- name: check variable schema_url # fail fast
- name: check variable database_name # fail fast
- name: update database to the latest schema migration
hosts: "{{ database_name }}:&subrole_primary"
tasks:
- name: create sql-migrate directories
- name: create sql-migrate configuration file
- name: clone schema
- name: run migrations
Playbook tasks
- name: create sql-migrate directories
ansible.builtin.file:
path: "{{ item }}"
state: directory
loop:
- /etc/sqlmigrate
- /var/lib/sqlmigrate
- name: create sql-migrate configuration file
ansible.builtin.template:
src: sqlmigrate/database.yml.j2
dest: "/etc/sqlmigrate/{{ database_name }}.yml"
Playbook tasks
- name: clone schema repository
ansible.builtin.git:
repo: "{{ schema_url }}"
dest: "/var/lib/sqlmigrate/{{ database_name }}"
version: "{{ branch|default('master') }}" # branch or tag
force: true
environment:
TMPDIR: /run
- name: run migrations
ansible.builtin.command:
cmd: sql-migrate up -config /etc/sqlmigrate/{{ database_name }}.yml
Database creation
Just run CREATE DATABASE
.
Easy, right?
Well…
Database creation
- Check arguments
- Select an available cluster
- Create git repository
- Run
CREATE DATABASE
(using a module) - Create secrets
- Create roles and users (for applications, humans)
- Link the database to the git repository
- Run schema migrations
Minor upgrades
Ensure softwares are up-to-date:
- Security
- Bugs
Minor upgrades
- Upgrade packages (DBMS, system)
- Reboot (if needed)
- Restart DBMS (if needed)
- Order by role criticity
Database migration
- Cluster is about to reach maximum capacity
- Colocate or spread logical divisions
- Isolate noisy neighbours
- Major upgrades
Database migration
Move one or more databases from one cluster to another
- Setup logical replication
- Promote
- Check
- Migrate
- Rollback
Database migration
- Moved out of a datacenter last year with this method
- 400+ databases
- 16.78TiB
- Under 30 minutes of downtime for the datacenter move
- Big focus on playbook execution time
- Thanks to Ansible
External collections
- community.general
- community.mysql
- community.mongodb
- community.postgresql
Internal collections
- ovhcloud.internal
- ovhcloud.mysqlsh
- ovhcloud.patronictl
- ovhcloud.sqlmigrate
Implementation
How we use Ansible
Secure Shell (SSH)
How can we securely connect to remote hosts to perform actions?
The Bastion
- Fine-grained access to infrastructure
- Sessions can be recorded
- Used in secure environment
- Already used by humans
- https://github.com/ovh/the-bastion
- The Bastion is a SSH gateway
- Secure environments includes PCI DSS and SecNumCloud
Ansible + The Bastion
“Ansible Wrapper”
[ssh_connection]
pipelining = True
private_key_file = ~/.ssh/id_ed25519
ssh_executable = /usr/share/ansible/plugins/bastion/sshwrapper.py
sftp_executable = /usr/share/ansible/plugins/bastion/sftpbastion.sh
transfer_method = sftp
retries = 3
https://github.com/ovh/the-bastion-ansible-wrapper
SCP is deprecated, use SFTP instead.
Inventory
Where can we find our hosts to perform operations?
- Distributed and highly-available data store
- Local agent
- https://github.com/hashicorp/consul
Consul
Consul
- Nodes
- name, IP address, meta(data)
- Services
- databases
- Access control list (ACL) with tokens
- Encryption
Static configuration
- Node meta
server_type
postgresql
,mysql
,filer
, …
role
node
,lb
,backup
, …
cluster
identifier
Dynamic configuration
- Node “subrole”
primary
,replica
- Database services
Where is my database?
Ansible + Consul
- Dynamic inventory
- https://github.com/wilfriedroset/consul-awx
How to use the inventory?
With a limit option
ansible server_type_postgresql -m ping
ansible-playbook -l server_type_postgresql playbook.yml
Group combinaison
&
for intersection (AND
):
for multiple groups (OR
)!
for exclusion (NOT
)
ansible-playbook -l 'test:&subrole_primary' playbook.yml
ansible-playbook -l 'server_type_postgresql:server_type_mysql' playbook.yml
ansible-playbook -l 'server_type_postgresql:!cluster_99' playbook.yml
Execution environments
Where Ansible runs?
Admin server
- Virtual machine
- Access via SSH
- Shared environment
- No API
AWX
- Ansible orchestration
- Running on Kubernetes
- Personal accounts (via SSO/SAML)
- REST API, web interface, CLI
- Notifications (alerting, chat)
- https://github.com/ansible/awx
Concepts
- Organization, projects, teams, users, privileges
- Inventory source
- Source Control (Git) and Machine (SSH) credentials
- Job templates
- Scheduled jobs
- Notification templates
AWX UI
AWX CLI
awx -f human job_templates launch --monitor --extra_vars \
'{"database_name": "***", "branch": "master", "schema_url": "ssh://***.git"}' \
database-primary-schema-update
Configuration
- Ansible playbook on the admin server
- AWX collection
- Terraform AWX provider available
Components on Kubernetes
- Web
- Task
- Execution environment (EE)
Disclaimer
Part of the issues we have encountered are probably related to our internal implementation (internal services, internal Kubernetes).
Quota on pods
Component | Type | cpu |
memory |
ephemeral-storage |
Quantity |
---|---|---|---|---|---|
web | request | 500m | 1Gi | 1 | |
limit | 2000m | 2Gi | |||
task | request | 1000m | 2Gi | 1 | |
limit | 1500m | 4Gi | |||
ee | request | 1000m | 256Mi | n | |
limit | 2000m | 2Gi | 1G |
- Raise resources on “ee” but not too much or the namespace quota will be reached quickly
- Raise resources on “task” to handle simultaneous jobs
- Raise resources on “web” for a smooth UI and HTTPS API experience
Job execution time
- Source Control Update
- Inventory Sync
- Pod scheduling time (quotas, simultaneous jobs)
- Containers starting time (init containers)
- Playbook execution time
Job execution time
PING
1 min 45 secs
Solutions
- Enable SCM update cache
scm_update_on_launch
(bool)scm_update_cache_timeout
(int)
- Enable inventory cache
update_on_launch
(bool)update_cache_timeout
(int)
- Check quotas on Kubernetes namespace
- Analyze playbook performances
Custom Vault
- Home-made solution inspired by HashiCorp Vault
- Designed to be managed by humans, read by robots
- Designed to be cached
Custom Vault and databases migrations
- 70 databases at once
- Endpoint included in secrets, need to be updated
- 4 secrets per database
- Every single lookup call took 4 seconds
- 16 seconds per database
- 18 minutes to update secrets
Custom Vault
- No API route to search by name, only by id
- List identifiers with
GET /secrets
- Different behaviors based on authentication
- Application key
- Basic auth
Solution
- Init Container to pull all secrets locally once
- Lookup
vault_secret
to read locally (application key) - Lookup
vault_secret_with_user
to bypass the cache (basic auth)
Two plugins to avoid breaking changes on the first one.
Network unreachable on Kubernetes
configstore:
provider '***':
Post "https://***/auth/app":
dial tcp:
lookup *** on ***:53:
read udp ***->***:53:
i/o timeout
- Name could not be resolved
- Secret could not be pulled
- Container is marked as Error
Breaking the job
- A dependent job (“projectupdate”) failed
Cascading break
- Source Control Update job has failed
- Inventory Sync failed because of Source Control Update
- Playbook Runs failed because Inventory Sync failed
Solution
Replace iptables by nftables on Kubernetes workers
Consul Federation
- Single endpoint
- Easy to configure
But
Chat channel with jobs in error due to issues with federation
Solution
- Use local Consul servers
- Connect directly to one server
- Or deploy a Consul agent in Kubernetes
Database connection issue
- AWX needs a database to run
- AWX database is hosted on the databases infrastructure
- AWX can restart load balancers (HAproxy) in front of its own database
- Database connection is cut
Save HAProxy state
server-state-base /var/lib/haproxy/state
load-server-state-from-file local
Save HAProxy state
ExecReload=/path/to/haproxy-create-state-files.sh
#!/bin/bash
sock=/run/haproxy/admin.sock
base=/var/lib/haproxy/state
backends=$(socat ${sock} - <<< "show backend" | fgrep -v '#')
for backend in ${backends}
do
statefile=${base}/${backend}
socat ${sock} - <<< "show servers state ${backend}" > ${statefile}
done
Handle database connection failure
/etc/tower/conf.d/credentials.py
DATABASES = {
'default': {
"ENGINE": "awx.main.db.profiled_pg",
...
"OPTIONS": {
...
"keepalives": 1,
"keepalives_idle": 5,
"keepalives_interval": 5,
"keepalives_count": 5
},
}
}
Security
Weekly CVE report on Docker images
With JFrog Xray*
*Proprietary software
Security
Also available on Quay.io for base images
Security
Also available on Quay.io for base images
Solutions
Use community-ee-minimal image for execution environment
From 32 to 4 violations
0 critical, 1 high, 2 medium, 1 low
Development
How do we work on playbooks?
Environment
- Admin server on a LAB environment
- Git clone code base in a unique directory
- Edit files
- Run playbooks
- Create a patch
- Create job template on AWX for new playbooks
ansible@admin.lab ~ $ tree -L 1
├── ansible-jriou
ansible@admin.lab ~ $ cd ansible-jriou/
ansible@admin.lab ~/ansible-jriou $ git branch
* master
ansible@admin.lab ~/ansible-jriou $ vi ping.yml
ansible@admin.lab ~/ansible-jriou $ ansible-playbook ping.yml
ansible@admin.lab ~/ansible-jriou $ git diff > feature.patch
Tests
Trust, but verify.
– Wilfried Roset
- Wilfried is my manager/partner for 6 years now. He’s not the author of the proverb but he uses it all the time.
- History of the proverb
Ansible Molecule
- Run a scenario sequence
- Designed to test playbooks and roles
- Used to test playbooks and tasks
- Mostly used to test
syntax
Syntax check
molecule/ping
├── molecule.yml (define the scenario)
└── converge.yml (run the playbook)
Define the scenario with molecule.yml
driver:
name: docker
platforms:
- name: debian11
image: "docker-registry/debian:bullseye"
scenario:
test_sequence:
- lint
- syntax
lint: |
set -e
yamllint ping.yml
ansible-lint ping.yml
Run the playbook with converge.yml
- name: Include playbook
ansible.builtin.import_playbook: ../../ping.yml
Result
--> Found config file /path/to/run/.config/molecule/config.yml
--> Test matrix
└── ping
├── dependency
└── syntax
--> Scenario: 'ping'
--> Action: 'dependency'
--> Scenario: 'ping'
--> Action: 'syntax'
--> Sanity checks: 'docker'
playbook: /path/to/run/molecule/ping/converge.yml
CDS is an Enterprise-Grade Continuous Delivery & DevOps Automation Open Source Platform.
Workflow
What’s next?
- Event-Driven Ansible
- Better molecule tests
- Scheduled unattended minor upgrades
Databases on Kubernetes?
Confidence