Health Checks
PMDaemon's health check system provides robust monitoring and validation of your processes, ensuring they're not just running but actually functioning correctly. This goes beyond basic process monitoring to verify that your applications are healthy and ready to serve traffic.
Overview
Health checks in PMDaemon offer:
- 🏥 HTTP health checks - Monitor web services via HTTP endpoints
- 📜 Script-based health checks - Custom validation logic for any application type
- ⏱️ Configurable parameters - Timeout, interval, and retry settings
- 🚦 Blocking start command - Wait for processes to be healthy before continuing
- 🔄 Auto-restart on failure - Automatic restart when health checks fail
- 📊 Health status integration - Visible in all monitoring interfaces
Health Check Types
HTTP Health Checks
Monitor web services by making HTTP requests to specific endpoints:
# Basic HTTP health check
pmdaemon start "node server.js" \
--name web-api \
--port 3000 \
--health-check-url http://localhost:3000/health
# With custom parameters
pmdaemon start "python api.py" \
--name python-api \
--port 8000 \
--health-check-url http://localhost:8000/status \
--health-check-timeout 10s \
--health-check-interval 30s \
--health-check-retries 3
How it works:
- PMDaemon makes HTTP GET requests to the specified URL
- Considers 2xx status codes as healthy
- Retries on failure according to retry settings
- Marks process as unhealthy after max retries exceeded
Script-based Health Checks
Run custom scripts for complex health validation:
# Basic script health check
pmdaemon start "python worker.py" \
--name background-worker \
--health-check-script ./health-check.sh
# With custom parameters
pmdaemon start "node processor.js" \
--name data-processor \
--health-check-script ./scripts/check-processor.py \
--health-check-timeout 15s \
--health-check-interval 60s \
--health-check-retries 2
How it works:
- PMDaemon executes the specified script/command
- Exit code 0 indicates healthy, non-zero indicates unhealthy
- Script output is captured for debugging
- Retries on failure according to retry settings
Configuration Parameters
Timeout Settings
Control how long to wait for health check responses:
# Short timeout for fast services
--health-check-timeout 5s
# Longer timeout for complex checks
--health-check-timeout 30s
# Very long timeout for heavy operations
--health-check-timeout 2m
Supported formats:
5s
- 5 seconds30s
- 30 seconds2m
- 2 minutes1h
- 1 hour
Interval Settings
Configure how often health checks run:
# Frequent checks for critical services
--health-check-interval 10s
# Standard interval for most services
--health-check-interval 30s
# Less frequent for stable services
--health-check-interval 5m
Retry Settings
Set how many times to retry failed health checks:
# Conservative - fail fast
--health-check-retries 1
# Balanced - allow for temporary issues
--health-check-retries 3
# Aggressive - very tolerant of failures
--health-check-retries 5
Blocking Start Command
The --wait-ready
flag makes the start command wait until health checks pass:
# Wait for HTTP service to be ready
pmdaemon start "node api.js" \
--name api-service \
--port 3000 \
--health-check-url http://localhost:3000/health \
--wait-ready
# Wait with custom timeout
pmdaemon start "python worker.py" \
--name worker \
--health-check-script ./health.sh \
--wait-ready \
--wait-timeout 60s
Perfect for deployment scripts:
#!/bin/bash
# Deploy script that waits for services
echo "Starting API service..."
pmdaemon start "node api.js" \
--name api \
--health-check-url http://localhost:3000/health \
--wait-ready
echo "API is ready! Starting worker..."
pmdaemon start "python worker.py" \
--name worker \
--health-check-script ./worker-health.sh \
--wait-ready
echo "All services are healthy and ready!"
Health Check Examples
Web API Health Check
Application code (Node.js):
// server.js
const express = require('express');
const app = express();
// Health check endpoint
app.get('/health', (req, res) => {
// Check database connection, external services, etc.
const isHealthy = checkDatabase() && checkRedis();
if (isHealthy) {
res.status(200).json({ status: 'healthy', timestamp: new Date() });
} else {
res.status(503).json({ status: 'unhealthy', timestamp: new Date() });
}
});
app.listen(3000);
PMDaemon configuration:
pmdaemon start "node server.js" \
--name web-api \
--port 3000 \
--health-check-url http://localhost:3000/health \
--health-check-timeout 5s \
--health-check-interval 30s \
--health-check-retries 3 \
--wait-ready
Database Worker Health Check
Health check script:
#!/bin/bash
# worker-health.sh
# Check if worker process is responding
if ! pgrep -f "python worker.py" > /dev/null; then
echo "Worker process not found"
exit 1
fi
# Check if worker can connect to database
if ! python -c "import psycopg2; psycopg2.connect('host=localhost dbname=mydb')" 2>/dev/null; then
echo "Cannot connect to database"
exit 1
fi
# Check if worker queue is not too backed up
QUEUE_SIZE=$(redis-cli llen worker_queue)
if [ "$QUEUE_SIZE" -gt 1000 ]; then
echo "Queue too large: $QUEUE_SIZE items"
exit 1
fi
echo "Worker is healthy"
exit 0
PMDaemon configuration:
pmdaemon start "python worker.py" \
--name db-worker \
--health-check-script ./worker-health.sh \
--health-check-timeout 10s \
--health-check-interval 60s \
--health-check-retries 2
Microservice Health Check
Python FastAPI with health endpoint:
# main.py
from fastapi import FastAPI, HTTPException
import asyncio
import aioredis
app = FastAPI()
@app.get("/health")
async def health_check():
try:
# Check Redis connection
redis = await aioredis.from_url("redis://localhost")
await redis.ping()
await redis.close()
# Check other dependencies...
return {"status": "healthy", "checks": {"redis": "ok"}}
except Exception as e:
raise HTTPException(status_code=503, detail=f"Unhealthy: {str(e)}")
PMDaemon configuration:
pmdaemon start "python -m uvicorn main:app --host 0.0.0.0 --port 8000" \
--name microservice \
--port 8000 \
--health-check-url http://localhost:8000/health \
--health-check-timeout 15s \
--health-check-interval 45s \
--wait-ready
Health Status Integration
Health status is visible throughout PMDaemon's interfaces:
Process List
pmdaemon list
┌────┬─────────────┬────────┬─────┬──────┬─────┬────────┬─────────┬──────────┬────────┐
│ ID │ Name │ Status │ PID │ Port │ CPU │ Memory │ Uptime │ Restarts │ Health │
├────┼─────────────┼────────┼─────┼──────┼─────┼────────┼─────────┼──────────┼────────┤
│ 0 │ web-api │ 🟢 │ 123 │ 3000 │ 2% │ 45MB │ 2h 15m │ 0 │ ✅ │
│ 1 │ worker │ 🟢 │ 124 │ - │ 1% │ 32MB │ 1h 30m │ 1 │ ⚠️ │
│ 2 │ processor │ 🔴 │ - │ - │ - │ - │ - │ 3 │ ❌ │
└────┴─────────────┴────────┴─────┴──────┴─────┴────────┴─────────┴──────────┴────────┘
Health indicators:
- ✅ Healthy - All health checks passing
- ⚠️ Warning - Some health checks failing but within retry limits
- ❌ Unhealthy - Health checks failed, process may be restarted
- ❓ Unknown - Health checks not configured or not yet run
Real-time Monitoring
pmdaemon monit
Shows live health status updates with color-coded indicators.
Process Information
pmdaemon info web-api
Process Information:
Name: web-api
Status: Online
PID: 1234
Port: 3000
Health Check:
Type: HTTP
URL: http://localhost:3000/health
Status: Healthy
Last Check: 2024-01-15 14:30:25
Success Rate: 98.5% (197/200)
Timeout: 5s
Interval: 30s
Retries: 3
Auto-restart on Health Failure
When health checks fail consistently, PMDaemon can automatically restart the process:
# Enable auto-restart on health failure (default behavior)
pmdaemon start "node api.js" \
--name api \
--health-check-url http://localhost:3000/health \
--health-check-retries 3 # Restart after 3 consecutive failures
Restart behavior:
- Health check fails
- PMDaemon retries according to
--health-check-retries
- If all retries fail, process is marked as unhealthy
- Process is automatically restarted
- Health checks resume after restart
Web API Integration
Health status is available via the REST API and WebSocket:
REST API
# Get all processes with health status
curl http://localhost:9615/api/processes
# Get specific process health
curl http://localhost:9615/api/processes/web-api/health
WebSocket Updates
# Connect to WebSocket for real-time health updates
wscat -c ws://localhost:9615/ws
Health status changes are broadcast in real-time to connected clients.
Best Practices
1. Design Proper Health Endpoints
// Good: Comprehensive health check
app.get('/health', async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
external_api: await checkExternalAPI(),
disk_space: checkDiskSpace()
};
const isHealthy = Object.values(checks).every(check => check.healthy);
res.status(isHealthy ? 200 : 503).json({
status: isHealthy ? 'healthy' : 'unhealthy',
checks,
timestamp: new Date()
});
});
// Avoid: Simple always-healthy endpoint
app.get('/health', (req, res) => {
res.json({ status: 'ok' }); // Not useful
});
2. Set Appropriate Timeouts
# Fast web APIs
--health-check-timeout 5s --health-check-interval 30s
# Database operations
--health-check-timeout 15s --health-check-interval 60s
# Heavy batch processing
--health-check-timeout 30s --health-check-interval 300s
3. Use Blocking Start for Dependencies
# Start database first and wait
pmdaemon start "postgres" --name db --wait-ready
# Then start API that depends on database
pmdaemon start "node api.js" \
--name api \
--health-check-url http://localhost:3000/health \
--wait-ready
4. Monitor Health Check Performance
# View health check statistics
pmdaemon info process-name
# Monitor for patterns in health failures
pmdaemon logs process-name | grep "health"
Troubleshooting
Health Checks Always Failing
# Check if health endpoint is accessible
curl http://localhost:3000/health
# Verify health check script manually
./health-check.sh
echo $? # Should be 0 for healthy
# Check PMDaemon logs
pmdaemon logs process-name
Blocking Start Timing Out
# Increase wait timeout
pmdaemon start app.js \
--health-check-url http://localhost:3000/health \
--wait-ready \
--wait-timeout 120s # Increase from default 30s
# Check what's preventing health checks from passing
curl -v http://localhost:3000/health
False Positive Health Failures
# Increase retry count for flaky services
--health-check-retries 5
# Increase timeout for slow responses
--health-check-timeout 30s
# Reduce check frequency
--health-check-interval 120s
Next Steps
- Monitoring - Real-time process monitoring
- Web API - Access health status via API
- Deployment Examples - Production deployment patterns
- Troubleshooting - Common issues and solutions