Database schema management for lazybones: from chaos to heaven
SCALE 21x
March 15, 2024
Speaker
- Julien RIOU
- Open Source DBA
- https://julien.riou.xyz
- @jriou@hachyderm.io
Summary
- Basics of schema migrations
- Who are we?
- Chaos
- Schema as code, CI/CD, code reviews
- Inventory, automation
- Heaven
- Known issues
- Takeaways
Basics of schema migrations
Schema
Schema changes over time
- New tables
- New columns
- New constraints
- New everything
Schema migration
Data definition language (DDL)
ALTER TABLE customer
ADD COLUMN archived_at TIMESTAMP WITH TIME ZONE;
Who are we?
Internal databases
Statistics
- 7 autonomous infrastructures worldwide
- 100+ clusters
- 500+ servers
- 2000+ databases
- MySQL and PostgreSQL
- Highly secure environments
Cluster example
Chaos
Who can ALTER my database?
Startup mindset
- All employees work in the same openspace
- A developer comes to the DBA desk
- Can you do X on my database, please?
- DBA SSH to the production database host
psql database
- Translate X to queries and run them
- Job done
The DBA forgot on which host the database is
The DBA misunderstood what developer really wanted
The DBA executed the query in Europe but forgot Canada
The DBA forgot to create a transaction and one query failed in the middle
The DBA started a transaction but forgot to commit
The DBA created a query that runs for too long and locks production
The DBA forgot to use SET ROLE
and permissions are now broken
Blame the DBA!
Ticketing
Let’s write migrations in a ticketing system
OTRS
Jira
But
- Multiple ticketing systems for different teams
- SQL statement are not well formatted
- Bad copy pastes
- Poor reviewing system
Schema as code
Git
- One repository for one or more databases
- 534 repositories
- Well known by developers
sql-migrate
SQL schema migration tool for Go
https://github.com/rubenv/sql-migrate
Developers create code:
v0001.sql
-- +migrate Up
create table x (...);
-- +migrate Down
drop table x;
- One file = one migration = one transaction
- Transactions can be disabled with
notransaction
hintCREATE INDEX CONCURRENTLY
- Focus on schema (DDL), not data (DML)
DBAs deploy with a command:
sql-migrate-wrapper
- Handle
SET ROLE
- Two migration paths
- admin: run as superuser to create extensions
- normal: run with DDL privileges to create objects
# sql-migrate-wrapper status -config /etc/sqlmigrate/test.json
INFO[2024-02-13T10:28:47Z] Processing 'test' database
INFO[2024-02-13T10:28:47Z] Processing normal migration path
+-----------+--------------------------------------+
| MIGRATION | APPLIED |
+-----------+--------------------------------------+
| v0001.sql | 2022-03-03 15:33:39.709067 +0100 CET |
| v0002.sql | 2022-11-30 12:43:34.947511 +0100 CET |
| v0003.sql | 2022-11-30 12:53:27.437182 +0100 CET |
| v0004.sql | 2023-01-24 17:20:22.330472 +0100 CET |
| v0005.sql | 2023-05-17 12:32:39.87181 +0200 CEST |
+-----------+--------------------------------------+
Code reviews
- Git based collaboration tool
- Adopted by the company
- Not opensource
Features
- Comments
- Blocking tasks
- Review status (approved, need works)
- Minimum number of required approvals
- From DBAs
- From database owners
CI/CD
Terminology
- Continuous Integration
- Test locally
- Ensure software is deliverable
- Merge changes to the main branch
- Continuous Delivery
- Build a release
- Continous Deployment
- Build a release and deploy it
- No human intervention
Tests without data
CDS is an Enterprise-Grade Continuous Delivery & DevOps Automation Open Source Platform.
Implementation
- Project: all schema tests at the same place
- Application: one git repository (schema)
- Pipeline: run sql-migrate on PostgreSQL 14
- Workflow: link application to pipelines
Web hooks
CDS workflow
CDS pipeline
Tests with data
Backups are based on ZFS snapshots performed and sent everyday to an anonymised dev environment:
- No business impact on downtime
- Test the restore process
- Schema migrations can be performed on data very similar to production
Deployment
Workflow
Concurrent sessions
lock_timeout (2000)
Abort any statement that waits longer than the specified amount of time while attempting to acquire a lock on a table, index, row, or other database object.
Inventory
Where should I connect to run the migrations?
Custom inventory
In a database of databases:
- repositories to databases
- databases to hosts
REST API
$ curl https://<api>/database?name=test
{
"code": 200,
"data": [
{
"schemarepository": "<uuid>",
"name": "test",
"id": "<uuid>",
"cluster": "<uuid>"
}
]
}
But
- Every change has to be declared
- Add a cluster, host, database, git repository, …
- Remove a cluster, host, database, git repository, …
- Easy to miss an event
- 100% home-made
Service discovery
- Distributed and highly-available data store
- Local agent
- https://github.com/hashicorp/consul
Consul
Consul
- Nodes
- name, IP address, meta(data)
- Services
- databases
- Access control list (ACL) with tokens
- Encryption
Static configuration
- Node meta
server_type
postgresql
,mysql
,filer
, …
role
node
,lb
,backup
, …
cluster
identifier
Dynamic configuration
- Node “subrole”
primary
,replica
- Database services
Where is my database?
What databases are behind the repository?
Special file in the repository meta.json
{
"databases": {
"US": [
"test"
]
}
}
Automation
How to apply migrations safely?
Ansible
Ansible is a radically simple IT automation system.
https://github.com/ansible/ansible
Concepts
- Inventory: combinaison of
- Hosts: remote machine to manage
- Groups: hosts sharing a common attribute
- Playbook: list of tasks executed in order, on groups
- Tasks: module + arguments
- Modules: smallest unit of code to execute on hosts
- Tasks: module + arguments
- Simple YAML that you should not be afraid
- Link to image generator
Playbook overview
- name: Check arguments
hosts: all
run_once: true
delegate_to: localhost
tasks:
- name: Check variable schema_url # fail fast
- name: Check variable database_name # fail fast
- name: Update database to the latest schema migration
hosts: "{{ database_name }}:&subrole_primary"
tasks:
- name: Create sql-migrate directories
- name: Create sql-migrate configuration file
- name: Clone schema
- name: Run migrations
Playbook tasks
- name: Create sql-migrate directories
ansible.builtin.file:
path: "{{ item }}"
state: directory
loop:
- /etc/sqlmigrate
- /var/lib/sqlmigrate
- name: Create sql-migrate configuration file
ansible.builtin.template:
src: sqlmigrate/database.yml.j2
dest: "/etc/sqlmigrate/{{ database_name }}.yml"
- name: Clone schema repository
ansible.builtin.git:
repo: "{{ schema_url }}"
dest: "/var/lib/sqlmigrate/{{ database_name }}"
version: "{{ branch|default('master') }}" # branch or tag
force: true
environment:
TMPDIR: /run
- name: Run migrations
ansible.builtin.command:
cmd: sql-migrate-wrapper up -config /etc/sqlmigrate/{{ database_name }}.json
Ansible playbook command
$ ansible-playbook \
--extra-vars \
'{"database_name": "test",
"branch": "master",
"schema_url": "ssh://git@hostname:port/project/repository.git"
}' \
database-primary-schema-update.yml
Ansible + Consul
- Dynamic inventory
- https://github.com/wilfriedroset/consul-awx
The Bastion (SSH)
- Fine-grained access to infrastructure
- Sessions can be recorded
- Used in secure environment
- Already used by humans
- https://github.com/ovh/the-bastion
- The Bastion is a SSH gateway
- Secure environments includes PCI DSS and SecNumCloud
Ansible + The Bastion
Ansible Wrapper
[ssh_connection]
pipelining = True
private_key_file = ~/.ssh/id_ed25519
ssh_executable = /usr/share/ansible/plugins/bastion/sshwrapper.py
sftp_executable = /usr/share/ansible/plugins/bastion/sftpbastion.sh
transfer_method = sftp
retries = 3
https://github.com/ovh/the-bastion-ansible-wrapper
SCP is deprecated, use SFTP instead.
AWX
AWX provides a web-based user interface, REST API, and task engine built on top of Ansible.
https://github.com/ansible/awx
AWX features
- REST API
- Web interface
- CLI
- Personal accounts (via SSO/SAML)
- Notifications (alerting, chat)
- Scheduled jobs
- Inventory management
- Source Control management
- Machine credentials
AWX Web interface
AWX CLI
awx -f human job_templates launch --monitor --extra_vars \
'{"database_name": "test",
"branch": "master",
"schema_url": "ssh://git@hostname:port/project/repository.git"
}' \
database-primary-schema-update
Event detection
How to deploy without human intervention?
Producer
Consumer
Consumer is 100% home-made today. We plan to replace this block by Event-Driven Ansible.
Current architecture
Migration failures
- AWX notifications
- Chat (Webex)
- On-call alerts (OpsGenie)
- NRPE probes (Icinga)
AWX notification on OpsGenie
NRPE probe notification on OpsGenie
In the end
Known issues
Highly concurrent repositories
- Multiple pull requests at the same time
- Multiple files
- From various developpers
- In the same database repository
Same file names
Predicted file names
Solution
Use a declarative model:
If anybody has a feedback on those tools or another, please feel free to contact me.
Granular deployment
- Changes are deployed everywhere at the same time
- Problems may rise directly on production
Solution
When merge? Before or after deployment to non-critical environments? What git branches? What orchestration around those git branches? How to rollback?
Test on real data
- “DEV” environment is not available everywhere
- Anonimization is hard
- This environment will be removed
Solution
Test schema migrations on the backup host directly:
- clone the latest ZFS dataset
- create a PostgreSQL instance on this clone
- run migration
- cleanup
schema-consumer
- 100% home-made
- Written in Python
- Add missing features
- Patch vulnerabilities
- Not open
Solution
Event-Driven Ansible
- Free software
- Same features (at least)
- Events from Kafka
- Create jobs on AWX
- Easy to use
- Rulebooks in YAML, similar to playbooks
- Easy to contribute
Links to sources:
Takeaways
- SQL is your friend
- Automation is easy
- Start small then iterate
- Learn from your errors