Database schema management for lazybones: from chaos to heaven

SCALE 21x

Julien RIOU

March 15, 2024

Speaker

Julien RIOU
Open Source DBA
https://julien.riou.xyz
@jriou@hachyderm.io

Summary

Basics of schema migrations
Who are we?
Chaos
Schema as code, CI/CD, code reviews
Inventory, automation
Heaven
Known issues
Takeaways

Basics of schema migrations

Schema

https://github.com/devrimgunduz/pagila

Schema changes over time

New tables
New columns
New constraints
New everything

Schema migration

Data definition language (DDL)

ALTER TABLE customer
    ADD COLUMN archived_at TIMESTAMP WITH TIME ZONE;

Who are we?

Internal databases

Statistics

7 autonomous infrastructures worldwide
100+ clusters
500+ servers
2000+ databases
MySQL and PostgreSQL
Highly secure environments

Cluster example

Chaos

Who can ALTER my database?

Startup mindset

All employees work in the same openspace
A developer comes to the DBA desk
Can you do X on my database, please?
DBA SSH to the production database host
psql database
Translate X to queries and run them
Job done

The DBA forgot on which host the database is

The DBA misunderstood what developer really wanted

The DBA executed the query in Europe but forgot Canada

The DBA forgot to create a transaction and one query failed in the middle

The DBA started a transaction but forgot to commit

The DBA created a query that runs for too long and locks production

The DBA forgot to use SET ROLE and permissions are now broken

Blame the DBA!

Link to image

Ticketing

Let’s write migrations in a ticketing system

OTRS

Jira

But

Multiple ticketing systems for different teams
SQL statement are not well formatted
Bad copy pastes
Poor reviewing system

Schema as code

Git

One repository for one or more databases
- 534 repositories
Well known by developers

sql-migrate

SQL schema migration tool for Go

https://github.com/rubenv/sql-migrate

Developers create code:

v0001.sql

-- +migrate Up
create table x (...);

-- +migrate Down
drop table x;

Link to image generator

One file = one migration = one transaction
Transactions can be disabled with notransaction hint
- CREATE INDEX CONCURRENTLY
Focus on schema (DDL), not data (DML)

DBAs deploy with a command:

sql-migrate-wrapper

Handle SET ROLE
Two migration paths
- admin: run as superuser to create extensions
- normal: run with DDL privileges to create objects

# sql-migrate-wrapper status -config /etc/sqlmigrate/test.json 
INFO[2024-02-13T10:28:47Z] Processing 'test' database                   
INFO[2024-02-13T10:28:47Z] Processing normal migration path             
+-----------+--------------------------------------+
| MIGRATION |               APPLIED                |
+-----------+--------------------------------------+
| v0001.sql | 2022-03-03 15:33:39.709067 +0100 CET |
| v0002.sql | 2022-11-30 12:43:34.947511 +0100 CET |
| v0003.sql | 2022-11-30 12:53:27.437182 +0100 CET |
| v0004.sql | 2023-01-24 17:20:22.330472 +0100 CET |
| v0005.sql | 2023-05-17 12:32:39.87181 +0200 CEST |
+-----------+--------------------------------------+

Code reviews

Git based collaboration tool
Adopted by the company
Not opensource

https://bitbucket.org

Features

Comments
Blocking tasks
Review status (approved, need works)
Minimum number of required approvals
- From DBAs
- From database owners

CI/CD

Terminology

Continuous Integration
- Test locally
- Ensure software is deliverable
- Merge changes to the main branch
Continuous Delivery
- Build a release
Continous Deployment
- Build a release and deploy it
- No human intervention

Tests without data

CDS

CDS is an Enterprise-Grade Continuous Delivery & DevOps Automation Open Source Platform.

https://ovh.github.io/cds/

Implementation

Project: all schema tests at the same place
Application: one git repository (schema)
Pipeline: run sql-migrate on PostgreSQL 14
Workflow: link application to pipelines

Web hooks

CDS workflow

CDS pipeline

Tests with data

Backups are based on ZFS snapshots performed and sent everyday to an anonymised dev environment:

No business impact on downtime
Test the restore process
Schema migrations can be performed on data very similar to production

Deployment

Workflow

Concurrent sessions

lock_timeout (2000)

Abort any statement that waits longer than the specified amount of time while attempting to acquire a lock on a table, index, row, or other database object.

Inventory

Where should I connect to run the migrations?

Custom inventory

In a database of databases:

repositories to databases
databases to hosts

REST API

$ curl https://<api>/database?name=test
{
  "code": 200,
  "data": [
    {
      "schemarepository": "<uuid>",
      "name": "test",
      "id": "<uuid>",
      "cluster": "<uuid>"
    }
  ]
}

But

Every change has to be declared
- Add a cluster, host, database, git repository, …
- Remove a cluster, host, database, git repository, …
Easy to miss an event
100% home-made

Close icons created by hqrloveq - Flaticon

Service discovery

Consul

Distributed and highly-available data store
Local agent
https://github.com/hashicorp/consul

Consul

Consul service discovery

Consul

Nodes
- name, IP address, meta(data)
Services
- databases
Access control list (ACL) with tokens
Encryption

Static configuration

Node meta
- server_type
  - postgresql, mysql, filer, …
- role
  - node, lb, backup, …
- cluster identifier

Dynamic configuration

Node “subrole”
- primary, replica
Database services

Where is my database?

Consul service

What databases are behind the repository?

Special file in the repository meta.json

{
  "databases": {
    "US": [
      "test"
    ]
  }
}

Automation

How to apply migrations safely?

Ansible

ansible

Ansible is a radically simple IT automation system.

https://github.com/ansible/ansible

Concepts

Inventory: combinaison of
- Hosts: remote machine to manage
- Groups: hosts sharing a common attribute
Playbook: list of tasks executed in order, on groups
- Tasks: module + arguments
  - Modules: smallest unit of code to execute on hosts

Simple YAML that you should not be afraid
Link to image generator

Playbook overview

- name: Check arguments
  hosts: all
  run_once: true
  delegate_to: localhost
  tasks:
    - name: Check variable schema_url    # fail fast
    - name: Check variable database_name # fail fast

- name: Update database to the latest schema migration
  hosts: "{{ database_name }}:&subrole_primary"
  tasks:
    - name: Create sql-migrate directories
    - name: Create sql-migrate configuration file
    - name: Clone schema
    - name: Run migrations

Playbook tasks

- name: Create sql-migrate directories
  ansible.builtin.file:
    path: "{{ item }}"
    state: directory
  loop:
    - /etc/sqlmigrate
    - /var/lib/sqlmigrate

- name: Create sql-migrate configuration file
  ansible.builtin.template:
    src: sqlmigrate/database.yml.j2
    dest: "/etc/sqlmigrate/{{ database_name }}.yml"

- name: Clone schema repository
  ansible.builtin.git:
    repo: "{{ schema_url }}"
    dest: "/var/lib/sqlmigrate/{{ database_name }}"
    version: "{{ branch|default('master') }}" # branch or tag
    force: true
  environment:
    TMPDIR: /run

- name: Run migrations
  ansible.builtin.command:
    cmd: sql-migrate-wrapper up -config /etc/sqlmigrate/{{ database_name }}.json

Ansible playbook command

$ ansible-playbook \
    --extra-vars \
        '{"database_name": "test",
          "branch": "master",
          "schema_url": "ssh://git@hostname:port/project/repository.git"
          }' \
    database-primary-schema-update.yml

Ansible + Consul

Dynamic inventory
https://github.com/wilfriedroset/consul-awx

The Bastion (SSH)

Fine-grained access to infrastructure
Sessions can be recorded
Used in secure environment
Already used by humans
https://github.com/ovh/the-bastion

The Bastion is a SSH gateway
Secure environments includes PCI DSS and SecNumCloud

Ansible + The Bastion

Ansible Wrapper

[ssh_connection]
pipelining = True
private_key_file = ~/.ssh/id_ed25519
ssh_executable = /usr/share/ansible/plugins/bastion/sshwrapper.py
sftp_executable = /usr/share/ansible/plugins/bastion/sftpbastion.sh
transfer_method = sftp
retries = 3

https://github.com/ovh/the-bastion-ansible-wrapper

SCP is deprecated, use SFTP instead.

AWX

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible.

https://github.com/ansible/awx

AWX features

REST API
Web interface
CLI
Personal accounts (via SSO/SAML)
Notifications (alerting, chat)
Scheduled jobs
Inventory management
Source Control management
Machine credentials

AWX Web interface

AWX CLI

awx -f human job_templates launch --monitor --extra_vars \
    '{"database_name": "test",
      "branch": "master",
      "schema_url": "ssh://git@hostname:port/project/repository.git"
      }' \
    database-primary-schema-update

Event detection

How to deploy without human intervention?

Producer

Consumer

Consumer is 100% home-made today. We plan to replace this block by Event-Driven Ansible.

Current architecture

Migration failures

AWX notifications
- Chat (Webex)
- On-call alerts (OpsGenie)
NRPE probes (Icinga)

AWX notification on OpsGenie

NRPE probe notification on OpsGenie

In the end

Link of the image

Heaven

Known issues

Highly concurrent repositories

Multiple pull requests at the same time
Multiple files
From various developpers
In the same database repository

Same file names

Predicted file names

Solution

Use a declarative model:

atlas (MySQL, PostgreSQL)
skeema (MySQL only)

If anybody has a feedback on those tools or another, please feel free to contact me.

Granular deployment

Changes are deployed everywhere at the same time
Problems may rise directly on production

Solution

When merge? Before or after deployment to non-critical environments? What git branches? What orchestration around those git branches? How to rollback?

Test on real data

“DEV” environment is not available everywhere
Anonimization is hard
This environment will be removed

Solution

Test schema migrations on the backup host directly:

clone the latest ZFS dataset
create a PostgreSQL instance on this clone
run migration
cleanup

schema-consumer

100% home-made
Written in Python
Add missing features
Patch vulnerabilities
Not open

Solution

Event-Driven Ansible

Link to the image

Free software
Same features (at least)
- Events from Kafka
- Create jobs on AWX
Easy to use
- Rulebooks in YAML, similar to playbooks
Easy to contribute
- Add Kafka SASL_SSL authentication (#177)
- add basic auth to controller (#602)

Speaker

Summary

Basics of schema migrations

Schema

Schema changes over time

Schema migration

Who are we?

Internal databases

Statistics

Cluster example

Chaos

Who can ALTER my database?

Startup mindset

Blame the DBA!

Ticketing

But

Schema as code

Git

sql-migrate

Code reviews

Features

CI/CD

Terminology

Tests without data

Implementation

Web hooks

Tests with data

Deployment

Workflow

Concurrent sessions

Inventory

Custom inventory

REST API

But

Service discovery

Consul

Consul

Static configuration

Dynamic configuration

Where is my database?

What databases are behind the repository?

Automation

Ansible

Concepts

Playbook overview

Playbook tasks

Ansible playbook command

Ansible + Consul

The Bastion (SSH)

Ansible + The Bastion

AWX

AWX features

AWX Web interface

AWX CLI

Event detection

Producer

Consumer

Current architecture

Migration failures

In the end

Heaven

Known issues

Highly concurrent repositories

Same file names

Predicted file names

Solution

Granular deployment

Solution

Test on real data

Solution

schema-consumer

Solution

Takeaways

Thank you

Questions