# GMLM Platform — Disaster Recovery Runbook

**Last updated:** Phase 10 — DevOps  
**Applies to:** All GMLM customer deployments

---

## Overview

This runbook covers recovery procedures for every failure scenario the platform may encounter.
Each procedure is written as a numbered sequence of SSH commands that can be executed by any
engineer with production server access — no special tooling required.

**Time-to-recovery targets:**

| Scenario | Target RTO | Target RPO |
|----------|-----------|-----------|
| Application crash / restart | < 2 minutes | 0 (stateless) |
| Failed update (auto rollback) | < 5 minutes | 0 (backup taken before update) |
| Failed update (manual rollback) | < 15 minutes | 0 (backup taken before update) |
| Database corruption | < 30 minutes | < 24 hours (daily backup) |
| Complete server failure | < 2 hours | < 24 hours (daily backup) |
| Redis failure | < 5 minutes | 0 (Redis holds cache/sessions only) |

---

## Scenario 1: Application Is Down (HTTP 502/503)

**Symptoms:** Users see 502 Bad Gateway or the site is unreachable.

```bash
# 1. SSH to server
ssh -p 22 ubuntu@YOUR_SERVER_IP

# 2. Check container status
cd /var/www/gmlm
docker compose ps

# 3. Check recent logs
docker compose logs --tail=50 app

# 4. Restart the application container
docker compose restart app

# 5. Wait 15 seconds and verify
sleep 15
curl -sf http://localhost/api/v1/health

# 6. If still failing, check Nginx
docker compose exec app nginx -t
docker compose exec app nginx -s reload

# 7. If Nginx is not the issue, rebuild the app container
docker compose up -d --force-recreate app

# 8. Verify and check logs
docker compose logs -f app
```

---

## Scenario 2: Manual Rollback After Failed Update

**When to use:** The update engine's automatic rollback ran but left the platform in maintenance mode,
or you need to manually restore from a backup.

```bash
# 1. List available backups
ls -lt /var/www/gmlm/storage/app/backups/

# 2. Read the backup manifest to find the right backup
cat /var/www/gmlm/storage/app/backups/BACKUP_ID/backup-manifest.json

# 3. Restore the database from backup
gunzip < /var/www/gmlm/storage/app/backups/BACKUP_ID/database.sql.gz \
  | docker compose exec -T mysql \
    mysql -u gmlm_user -p${DB_PASSWORD} gmlm_db

# 4. Restore core files
docker compose exec app tar -xzf \
  /var/www/gmlm/storage/app/backups/BACKUP_ID/core-files.tar.gz \
  -C /var/www/html

# 5. Clear all caches
docker compose exec app php artisan cache:clear
docker compose exec app php artisan config:clear
docker compose exec app php artisan route:clear
docker compose exec app php artisan view:clear

# 6. Restart workers
docker compose exec app php artisan horizon:terminate
docker compose restart horizon

# 7. Take out of maintenance mode
docker compose exec app php artisan up

# 8. Verify
docker compose exec app php artisan gmlm:health
curl -sf https://YOUR_DOMAIN/api/v1/health
```

---

## Scenario 3: Database Corruption or Data Loss

**When to use:** Database tables are corrupted, critical data was accidentally deleted,
or InnoDB crashed without clean shutdown.

```bash
# STEP 1: Immediately stop all writes to prevent further corruption
docker compose exec app php artisan down --render="errors.503"

# STEP 2: Take a snapshot of the current (corrupted) state for analysis
docker compose exec mysql mysqldump \
  -u root -p${DB_ROOT_PASSWORD} \
  --all-databases \
  | gzip > /tmp/corrupted-snapshot-$(date +%Y%m%d-%H%M%S).sql.gz

# STEP 3: Identify the most recent clean backup
LATEST_BACKUP=$(ls -t /var/www/gmlm/storage/app/backups/*/database.sql.gz | head -1)
echo "Will restore from: $LATEST_BACKUP"

# STEP 4: Check backup integrity
gunzip -t "$LATEST_BACKUP" && echo "Backup file is valid" || echo "BACKUP CORRUPT - check older backups"

# STEP 5: Drop and recreate the database
docker compose exec -T mysql mysql -u root -p${DB_ROOT_PASSWORD} << 'EOF'
DROP DATABASE IF EXISTS gmlm_db;
CREATE DATABASE gmlm_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
GRANT ALL PRIVILEGES ON gmlm_db.* TO 'gmlm_user'@'%';
FLUSH PRIVILEGES;
EOF

# STEP 6: Restore from backup
gunzip < "$LATEST_BACKUP" \
  | docker compose exec -T mysql \
    mysql -u gmlm_user -p${DB_PASSWORD} gmlm_db

# STEP 7: Verify row counts are reasonable
docker compose exec -T mysql mysql -u gmlm_user -p${DB_PASSWORD} gmlm_db << 'EOF'
SELECT table_name, table_rows
FROM information_schema.tables
WHERE table_schema = 'gmlm_db'
ORDER BY table_rows DESC;
EOF

# STEP 8: Run any missed migrations (safe — only applies pending ones)
docker compose exec app php artisan migrate --force

# STEP 9: Verify financial integrity
docker compose exec -T mysql mysql -u gmlm_user -p${DB_PASSWORD} gmlm_db << 'EOF'
SELECT
  COUNT(*) AS discrepancies
FROM wallets w
LEFT JOIN (
  SELECT wallet_id, SUM(CASE WHEN type='credit' THEN amount ELSE -amount END) as calc
  FROM wallet_transactions GROUP BY wallet_id
) t ON t.wallet_id = w.id
WHERE ABS(w.balance - COALESCE(t.calc, 0)) > 0.01;
EOF
# Expected output: discrepancies = 0

# STEP 10: Bring platform back online
docker compose exec app php artisan up
curl -sf https://YOUR_DOMAIN/api/v1/health
```

---

## Scenario 4: Redis Failure

**Impact:** Sessions are lost (users logged out), cache is empty (slower responses),
queue processing stops. **No financial data is at risk** — Redis holds only cache and sessions.

```bash
# 1. Restart Redis
docker compose restart redis

# 2. Verify Redis is responding
docker compose exec redis redis-cli ping  # Should return PONG

# 3. If Redis data is corrupt, flush and restart
docker compose exec redis redis-cli FLUSHALL
docker compose restart redis

# 4. Restart Horizon (it reconnects to Redis on startup)
docker compose restart horizon

# 5. Warm the cache
docker compose exec app php artisan gmlm:warm-cache

# 6. Users will need to log in again (sessions were in Redis)
# This is expected and correct behaviour.
```

---

## Scenario 5: Complete Server Failure (New Server Recovery)

**When to use:** The production server is completely dead (hardware failure, data centre outage,
accidentally terminated). You need to spin up a completely new server.

```bash
# ─── On your LOCAL machine ───────────────────────────────────

# STEP 1: Provision a new server (DigitalOcean, AWS EC2, etc.)
# Minimum spec: 4 CPU, 8GB RAM, 80GB SSD, Ubuntu 22.04

# STEP 2: Copy your codebase to the new server
scp -r /path/to/gmlm-codebase ubuntu@NEW_SERVER_IP:/var/www/gmlm

# OR: Clone from your private repository
ssh ubuntu@NEW_SERVER_IP "git clone https://github.com/your-org/gmlm-platform /var/www/gmlm"

# STEP 3: Transfer the most recent backup
# If you have S3 backups:
aws s3 cp s3://YOUR_BUCKET/backups/latest-backup.tar.gz /tmp/

# If transferring from old server (if partially accessible):
scp ubuntu@OLD_SERVER:/var/www/gmlm/storage/app/backups/BACKUP_ID/database.sql.gz /tmp/

# ─── On the NEW server ───────────────────────────────────────

# STEP 4: Run the deployment script (installs Docker, configures everything)
cd /var/www/gmlm
sudo bash deploy.sh \
  --domain YOUR_DOMAIN \
  --company "Company Name" \
  --email admin@company.com \
  --license YOUR_LICENSE_KEY \
  --currency USD

# The deploy.sh script will:
# - Install Docker
# - Generate a new .env with fresh secrets
# - Start containers
# - Ask you to run provisioning

# STEP 5: Instead of fresh provisioning, restore from backup
# First, let containers start
docker compose up -d mysql redis

# Wait for MySQL
sleep 30

# Restore database
gunzip < /tmp/database.sql.gz \
  | docker compose exec -T mysql \
    mysql -u root -p${DB_ROOT_PASSWORD} gmlm_db

# STEP 6: Start the full application
docker compose up -d

# STEP 7: Mark as installed (skip the setup wizard)
sed -i 's/APP_INSTALLED=false/APP_INSTALLED=true/' .env
docker compose exec app php artisan config:cache

# STEP 8: Restore the admin password if needed
docker compose exec app php artisan tinker
# In tinker:
# App\Domains\Identity\Models\User::where('role','admin')->first()->update(['password' => Hash::make('new-password')]);

# STEP 9: Update DNS to point to new server IP
# TTL dependent — can take up to your TTL value in minutes

# STEP 10: Verify
docker compose exec app php artisan gmlm:health
curl -sf https://YOUR_DOMAIN/api/v1/health
```

---

## Scenario 6: Horizon Workers Stopped Processing

**Symptoms:** Queue jobs are pending but nothing is being processed.
Dashboard shows stale data, commissions not being calculated.

```bash
# 1. Check Horizon status
docker compose exec app php artisan horizon:status

# 2. Check for failed jobs
docker compose exec app php artisan queue:failed

# 3. Restart Horizon
docker compose exec app php artisan horizon:terminate
docker compose restart horizon

# 4. Monitor the restart
docker compose logs -f horizon

# 5. If jobs are stuck in failed state, review and retry
docker compose exec app php artisan queue:failed    # List failed jobs
docker compose exec app php artisan queue:retry all # Retry all failed jobs
# OR retry specific job:
docker compose exec app php artisan queue:retry JOB_ID

# 6. If there are poison pill jobs (fail every retry), delete them
docker compose exec app php artisan queue:flush  # WARNING: deletes ALL failed jobs
```

---

## Scenario 7: Disk Space Full

**Symptoms:** Application throws storage errors, logs stop writing, uploads fail.

```bash
# 1. Check disk usage
df -h
du -sh /var/www/gmlm/storage/logs/*
du -sh /var/www/gmlm/storage/app/backups/*/

# 2. Remove old Docker images and volumes
docker system prune --volumes --force

# 3. Manually rotate and compress old logs
gzip /var/www/gmlm/storage/logs/gmlm-*.log.1

# 4. Remove old backups (keep last 5)
ls -t /var/www/gmlm/storage/app/backups/ | tail -n +6 | \
  xargs -I {} rm -rf "/var/www/gmlm/storage/app/backups/{}"

# 5. Compress old commission logs (financial retention: 7 years)
find /var/www/gmlm/storage/logs -name "commissions-*.log" -mtime +7 -exec gzip {} \;

# 6. If still critical, emergency log cleanup (last resort — DO NOT touch security/commission logs)
find /var/www/gmlm/storage/logs -name "gmlm-*.log" -mtime +3 -delete
```

---

## Emergency Contacts

| Role | Contact | Escalate When |
|------|---------|---------------|
| Platform Engineer | Configure in your team | Any P1 incident |
| Database Administrator | Configure in your team | Scenario 3, 5 |
| Hosting Provider Support | Your provider's support | Scenario 5, hardware failure |
| GMLM Platform Support | support@globalmlmsoftware.com | License issues, update failures |

---

## Post-Incident Checklist

After resolving any incident:

- [ ] Document what happened and when in an incident report
- [ ] Verify financial data integrity (wallet balances = sum of transactions)
- [ ] Verify no commission was double-paid or missed
- [ ] Review logs for root cause
- [ ] Check for failed jobs and retry or investigate
- [ ] Run `php artisan gmlm:health` and confirm all checks pass
- [ ] Notify affected customers if downtime exceeded SLA
- [ ] Update this runbook if any step was inaccurate or missing
