Health Checks

PMDaemon's health check system provides robust monitoring and validation of your processes, ensuring they're not just running but actually functioning correctly. This goes beyond basic process monitoring to verify that your applications are healthy and ready to serve traffic.

Overview

Health checks in PMDaemon offer:

🏥 HTTP health checks - Monitor web services via HTTP endpoints
📜 Script-based health checks - Custom validation logic for any application type
⏱️ Configurable parameters - Timeout, interval, and retry settings
🚦 Blocking start command - Wait for processes to be healthy before continuing
🔄 Auto-restart on failure - Automatic restart when health checks fail
📊 Health status integration - Visible in all monitoring interfaces

Health Check Types

HTTP Health Checks

Monitor web services by making HTTP requests to specific endpoints:

# Basic HTTP health check
pmdaemon start "node server.js" \
  --name web-api \
  --port 3000 \
  --health-check-url http://localhost:3000/health

# With custom parameters
pmdaemon start "python api.py" \
  --name python-api \
  --port 8000 \
  --health-check-url http://localhost:8000/status \
  --health-check-timeout 10s \
  --health-check-interval 30s \
  --health-check-retries 3

How it works:

PMDaemon makes HTTP GET requests to the specified URL
Considers 2xx status codes as healthy
Retries on failure according to retry settings
Marks process as unhealthy after max retries exceeded

Script-based Health Checks

Run custom scripts for complex health validation:

# Basic script health check
pmdaemon start "python worker.py" \
  --name background-worker \
  --health-check-script ./health-check.sh

# With custom parameters
pmdaemon start "node processor.js" \
  --name data-processor \
  --health-check-script ./scripts/check-processor.py \
  --health-check-timeout 15s \
  --health-check-interval 60s \
  --health-check-retries 2

How it works:

PMDaemon executes the specified script/command
Exit code 0 indicates healthy, non-zero indicates unhealthy
Script output is captured for debugging
Retries on failure according to retry settings

Configuration Parameters

Timeout Settings

Control how long to wait for health check responses:

# Short timeout for fast services
--health-check-timeout 5s

# Longer timeout for complex checks
--health-check-timeout 30s

# Very long timeout for heavy operations
--health-check-timeout 2m

Supported formats:

5s - 5 seconds
30s - 30 seconds
2m - 2 minutes
1h - 1 hour

Interval Settings

Configure how often health checks run:

# Frequent checks for critical services
--health-check-interval 10s

# Standard interval for most services
--health-check-interval 30s

# Less frequent for stable services
--health-check-interval 5m

Retry Settings

Set how many times to retry failed health checks:

# Conservative - fail fast
--health-check-retries 1

# Balanced - allow for temporary issues
--health-check-retries 3

# Aggressive - very tolerant of failures
--health-check-retries 5

Blocking Start Command

The --wait-ready flag makes the start command wait until health checks pass:

# Wait for HTTP service to be ready
pmdaemon start "node api.js" \
  --name api-service \
  --port 3000 \
  --health-check-url http://localhost:3000/health \
  --wait-ready

# Wait with custom timeout
pmdaemon start "python worker.py" \
  --name worker \
  --health-check-script ./health.sh \
  --wait-ready \
  --wait-timeout 60s

Perfect for deployment scripts:

#!/bin/bash
# Deploy script that waits for services

echo "Starting API service..."
pmdaemon start "node api.js" \
  --name api \
  --health-check-url http://localhost:3000/health \
  --wait-ready

echo "API is ready! Starting worker..."
pmdaemon start "python worker.py" \
  --name worker \
  --health-check-script ./worker-health.sh \
  --wait-ready

echo "All services are healthy and ready!"

Health Check Examples

Web API Health Check

Application code (Node.js):

// server.js
const express = require('express');
const app = express();

// Health check endpoint
app.get('/health', (req, res) => {
  // Check database connection, external services, etc.
  const isHealthy = checkDatabase() && checkRedis();
  
  if (isHealthy) {
    res.status(200).json({ status: 'healthy', timestamp: new Date() });
  } else {
    res.status(503).json({ status: 'unhealthy', timestamp: new Date() });
  }
});

app.listen(3000);

PMDaemon configuration:

pmdaemon start "node server.js" \
  --name web-api \
  --port 3000 \
  --health-check-url http://localhost:3000/health \
  --health-check-timeout 5s \
  --health-check-interval 30s \
  --health-check-retries 3 \
  --wait-ready

Database Worker Health Check

Health check script:

#!/bin/bash
# worker-health.sh

# Check if worker process is responding
if ! pgrep -f "python worker.py" > /dev/null; then
    echo "Worker process not found"
    exit 1
fi

# Check if worker can connect to database
if ! python -c "import psycopg2; psycopg2.connect('host=localhost dbname=mydb')" 2>/dev/null; then
    echo "Cannot connect to database"
    exit 1
fi

# Check if worker queue is not too backed up
QUEUE_SIZE=$(redis-cli llen worker_queue)
if [ "$QUEUE_SIZE" -gt 1000 ]; then
    echo "Queue too large: $QUEUE_SIZE items"
    exit 1
fi

echo "Worker is healthy"
exit 0

PMDaemon configuration:

pmdaemon start "python worker.py" \
  --name db-worker \
  --health-check-script ./worker-health.sh \
  --health-check-timeout 10s \
  --health-check-interval 60s \
  --health-check-retries 2

Microservice Health Check

Python FastAPI with health endpoint:

# main.py
from fastapi import FastAPI, HTTPException
import asyncio
import aioredis

app = FastAPI()

@app.get("/health")
async def health_check():
    try:
        # Check Redis connection
        redis = await aioredis.from_url("redis://localhost")
        await redis.ping()
        await redis.close()
        
        # Check other dependencies...
        
        return {"status": "healthy", "checks": {"redis": "ok"}}
    except Exception as e:
        raise HTTPException(status_code=503, detail=f"Unhealthy: {str(e)}")

PMDaemon configuration:

pmdaemon start "python -m uvicorn main:app --host 0.0.0.0 --port 8000" \
  --name microservice \
  --port 8000 \
  --health-check-url http://localhost:8000/health \
  --health-check-timeout 15s \
  --health-check-interval 45s \
  --wait-ready

Health Status Integration

Health status is visible throughout PMDaemon's interfaces:

Process List

pmdaemon list

┌────┬─────────────┬────────┬─────┬──────┬─────┬────────┬─────────┬──────────┬────────┐
│ ID │ Name        │ Status │ PID │ Port │ CPU │ Memory │ Uptime  │ Restarts │ Health │
├────┼─────────────┼────────┼─────┼──────┼─────┼────────┼─────────┼──────────┼────────┤
│ 0  │ web-api     │ 🟢     │ 123 │ 3000 │ 2%  │ 45MB   │ 2h 15m  │ 0        │ ✅     │
│ 1  │ worker      │ 🟢     │ 124 │ -    │ 1%  │ 32MB   │ 1h 30m  │ 1        │ ⚠️     │
│ 2  │ processor   │ 🔴     │ -   │ -    │ -   │ -      │ -       │ 3        │ ❌     │
└────┴─────────────┴────────┴─────┴──────┴─────┴────────┴─────────┴──────────┴────────┘

Health indicators:

✅ Healthy - All health checks passing
⚠️ Warning - Some health checks failing but within retry limits
❌ Unhealthy - Health checks failed, process may be restarted
❓ Unknown - Health checks not configured or not yet run

Real-time Monitoring

pmdaemon monit

Shows live health status updates with color-coded indicators.

Process Information

pmdaemon info web-api

Process Information:
  Name: web-api
  Status: Online
  PID: 1234
  Port: 3000
  Health Check:
    Type: HTTP
    URL: http://localhost:3000/health
    Status: Healthy
    Last Check: 2024-01-15 14:30:25
    Success Rate: 98.5% (197/200)
    Timeout: 5s
    Interval: 30s
    Retries: 3

Auto-restart on Health Failure

When health checks fail consistently, PMDaemon can automatically restart the process:

# Enable auto-restart on health failure (default behavior)
pmdaemon start "node api.js" \
  --name api \
  --health-check-url http://localhost:3000/health \
  --health-check-retries 3  # Restart after 3 consecutive failures

Restart behavior:

Health check fails
PMDaemon retries according to --health-check-retries
If all retries fail, process is marked as unhealthy
Process is automatically restarted
Health checks resume after restart

Web API Integration

Health status is available via the REST API and WebSocket:

REST API

# Get all processes with health status
curl http://localhost:9615/api/processes

# Get specific process health
curl http://localhost:9615/api/processes/web-api/health

WebSocket Updates

# Connect to WebSocket for real-time health updates
wscat -c ws://localhost:9615/ws

Health status changes are broadcast in real-time to connected clients.

Best Practices

1. Design Proper Health Endpoints

// Good: Comprehensive health check
app.get('/health', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    external_api: await checkExternalAPI(),
    disk_space: checkDiskSpace()
  };
  
  const isHealthy = Object.values(checks).every(check => check.healthy);
  
  res.status(isHealthy ? 200 : 503).json({
    status: isHealthy ? 'healthy' : 'unhealthy',
    checks,
    timestamp: new Date()
  });
});

// Avoid: Simple always-healthy endpoint
app.get('/health', (req, res) => {
  res.json({ status: 'ok' });  // Not useful
});

2. Set Appropriate Timeouts

# Fast web APIs
--health-check-timeout 5s --health-check-interval 30s

# Database operations
--health-check-timeout 15s --health-check-interval 60s

# Heavy batch processing
--health-check-timeout 30s --health-check-interval 300s

3. Use Blocking Start for Dependencies

# Start database first and wait
pmdaemon start "postgres" --name db --wait-ready

# Then start API that depends on database
pmdaemon start "node api.js" \
  --name api \
  --health-check-url http://localhost:3000/health \
  --wait-ready

4. Monitor Health Check Performance

# View health check statistics
pmdaemon info process-name

# Monitor for patterns in health failures
pmdaemon logs process-name | grep "health"

Troubleshooting

Health Checks Always Failing

# Check if health endpoint is accessible
curl http://localhost:3000/health

# Verify health check script manually
./health-check.sh
echo $?  # Should be 0 for healthy

# Check PMDaemon logs
pmdaemon logs process-name

Blocking Start Timing Out

# Increase wait timeout
pmdaemon start app.js \
  --health-check-url http://localhost:3000/health \
  --wait-ready \
  --wait-timeout 120s  # Increase from default 30s

# Check what's preventing health checks from passing
curl -v http://localhost:3000/health

False Positive Health Failures

# Increase retry count for flaky services
--health-check-retries 5

# Increase timeout for slow responses
--health-check-timeout 30s

# Reduce check frequency
--health-check-interval 120s

Next Steps

Monitoring - Real-time process monitoring
Web API - Access health status via API
Deployment Examples - Production deployment patterns
Troubleshooting - Common issues and solutions

Overview​

Health Check Types​

HTTP Health Checks​

Script-based Health Checks​

Configuration Parameters​

Timeout Settings​

Interval Settings​

Retry Settings​

Blocking Start Command​

Health Check Examples​

Web API Health Check​

Database Worker Health Check​

Microservice Health Check​

Health Status Integration​

Process List​

Real-time Monitoring​

Process Information​

Auto-restart on Health Failure​

Web API Integration​

REST API​

WebSocket Updates​

Best Practices​

1. Design Proper Health Endpoints​

2. Set Appropriate Timeouts​

3. Use Blocking Start for Dependencies​

4. Monitor Health Check Performance​

Troubleshooting​

Health Checks Always Failing​

Blocking Start Timing Out​

False Positive Health Failures​

Next Steps​

Overview

Health Check Types

HTTP Health Checks

Script-based Health Checks

Configuration Parameters

Timeout Settings

Interval Settings

Retry Settings

Blocking Start Command

Health Check Examples

Web API Health Check

Database Worker Health Check

Microservice Health Check

Health Status Integration

Process List

Real-time Monitoring

Process Information

Auto-restart on Health Failure

Web API Integration

REST API

WebSocket Updates

Best Practices

1. Design Proper Health Endpoints

2. Set Appropriate Timeouts

3. Use Blocking Start for Dependencies

4. Monitor Health Check Performance

Troubleshooting

Health Checks Always Failing

Blocking Start Timing Out

False Positive Health Failures

Next Steps