WATCHER_HEALTH_MONITORING.md 8.3 KB

Watcher Health Monitoring System

Overview

The Watcher Health Monitoring System provides comprehensive monitoring, logging, and automatic recovery capabilities for the file watcher service. This system ensures that the watcher service continues running as expected and automatically recovers if it unexpectedly crashes or stops.

Features

1. Health Monitoring

  • Continuous Health Checks: The system performs health checks every 30 seconds by default
  • State Change Detection: Automatically detects when the watcher unexpectedly stops
  • Real-time Status: Provides immediate feedback on watcher health status

2. Error Logging

  • Comprehensive Error Tracking: All watcher errors are logged to the database
  • Error History: Maintains the last 100 errors for investigation
  • Timestamp Records: Each error includes a timestamp for analysis
  • Error Messages: Detailed error descriptions help diagnose issues

3. Automatic Recovery

  • Configurable Auto-Recovery: Can be enabled or disabled via settings or API
  • Intelligent Recovery: Attempts to restart the watcher with the last known configuration
  • Recovery Limiting: Prevents infinite recovery loops with configurable attempt limits
  • Recovery Tracking: Logs all recovery attempts for troubleshooting

4. Web Interface Integration

  • Health Dashboard: Displays current watcher health status
  • Error Log Viewer: View recent errors from the dashboard
  • Auto-Recovery Toggle: Enable/disable auto-recovery from the UI
  • Real-time Notifications: WebSocket events notify about health changes

Architecture

Backend Components

WatcherHealthService (watcher-health.service.ts)

The main service responsible for monitoring and recovery:

@Injectable()
export class WatcherHealthService implements OnModuleInit, OnModuleDestroy {
  // Runs health checks every 30 seconds
  @Cron(CronExpression.EVERY_30_SECONDS)
  async healthCheckTask()

  // Attempts to recover failed watcher
  private async attemptRecovery(lastStatus: any)

  // Logs errors to database
  private logWatcherError(message: string)

  // Provides health status
  getHealthStatus(): WatcherHealthRecord

  // Retrieves recent errors
  getRecentErrors(limit = 20): Array<{timestamp, message}>
}

Database Schema

New table created automatically for error logging:

CREATE TABLE watcher_errors (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  timestamp TEXT NOT NULL,
  message TEXT NOT NULL,
  recovery_attempt INTEGER DEFAULT 0,
  created_at TEXT NOT NULL
);

Settings

Two new configuration settings are stored in the database:

  • watcher_auto_recovery (boolean): Enable/disable automatic recovery
  • watcher_health_check_interval (number): Health check interval in milliseconds

Frontend Components

WatcherHealthStatus (WatcherHealthStatus.tsx)

React component displaying:

  • Health status indicator (green for healthy, red for unhealthy)
  • Last error message and timestamp
  • Recent error log viewer
  • Auto-recovery status and toggle
  • Clear error logs button

API Endpoints

Get Watcher Health Status

GET /watcher/health

Returns:

{
  "timestamp": "2026-01-13T12:34:56.789Z",
  "isWatching": true,
  "lastCheckTime": "2026-01-13T12:34:56.789Z",
  "lastErrorTime": null,
  "lastErrorMessage": null,
  "isHealthy": true,
  "recoveryAttempts": 0
}

Get Recent Error Logs

GET /watcher/errors?limit=20

Returns:

[
  {
    "timestamp": "2026-01-13T12:30:00.000Z",
    "message": "Watcher error: ENOSPC - no space left on device"
  }
]

Clear Error Logs

DELETE /watcher/errors

Returns:

{
  "cleared": 5
}

Set Auto-Recovery Status

POST /watcher/auto-recovery
{
  "enabled": true
}

Get Auto-Recovery Status

GET /watcher/auto-recovery

Returns:

{
  "enabled": true
}

How It Works

Normal Operation

  1. WatcherHealthService initializes and starts monitoring
  2. Every 30 seconds, it checks if the watcher is still running
  3. If healthy, no action is taken
  4. Status updates are sent via WebSocket to connected clients

Watcher Failure Detection

  1. Health check detects watcher is not running when it should be
  2. An error is logged to the database
  3. WebSocket event is sent to all connected clients
  4. If auto-recovery is enabled, recovery process begins

Automatic Recovery

  1. Recovery attempts are incremented and tracked
  2. Existing watcher is stopped (if hung)
  3. Watcher is restarted with last known configuration
  4. Success/failure is logged
  5. WebSocket event notifies clients of recovery status
  6. On successful recovery, attempt counter resets
  7. Maximum 5 recovery attempts per hour (resets after 1 hour)

Error Logging

  • All errors are persisted to the watcher_errors table
  • Automatic cleanup keeps only the last 100 errors
  • Errors can be viewed and cleared via the API or web interface

Configuration

Environment Variables

None required - all settings are stored in the database

Database Settings

Settings can be modified programmatically or via the API:

// Enable auto-recovery
setSettings({ watcher_auto_recovery: true });

// Set health check interval to 60 seconds
setSettings({ watcher_health_check_interval: 60000 });

Monitoring & Troubleshooting

Viewing Error Logs

  1. Go to the "Watcher Health & Monitoring" section on the dashboard
  2. Click "View Errors" to see recent errors
  3. Errors are displayed with timestamps
  4. Click "Clear Log" to remove all errors

Enabling/Disabling Auto-Recovery

  1. Find the "Auto-Recovery" section in the health monitoring panel
  2. Click "Enable" or "Disable" as needed
  3. Confirmation toast appears when setting is updated

Interpreting Health Status

  • Healthy (Green): Watcher is running and no recent errors
  • Unhealthy (Red): Watcher has stopped or recent errors occurred
  • Recovery attempts shown if recovery has been attempted

Common Issues

"Watcher stopped unexpectedly"

  • Check disk space and system resources
  • Review error logs for specific error message
  • Check file system permissions for watched directories
  • Verify the watcher service has access to configured paths

Recovery attempts not resetting

  • Recovery attempts automatically reset after 1 hour
  • Or manually clear error logs via the API to reset counters

Auto-recovery not working

  • Verify auto-recovery is enabled in the UI
  • Check error logs for specific failure reasons
  • Ensure the service has permission to restart the watcher
  • Check system resources (file descriptors, memory)

Integration Points

WebSocket Events

The health monitoring system emits the following events:

// Watcher encountered an error
{
  type: 'health_alert',
  healthy: false,
  reason: 'Error message'
}

// Watcher was successfully recovered
{
  type: 'recovered',
  message: 'Watcher recovered successfully after failure'
}

Task Queue Integration

The watcher can interact with the task queue when recovering. If the watcher restarts, it resumes with the last known watch configuration.

Performance Considerations

  • Health checks run every 30 seconds (configurable)
  • Each health check is very fast (just status queries)
  • Error logs are limited to 100 entries (old entries auto-deleted)
  • WebSocket events are only sent on state changes
  • Minimal database overhead

Future Enhancements

Potential improvements:

  • Configurable recovery retry strategies
  • Advanced pattern matching for specific error types
  • Email/Slack notifications on watcher failures
  • Metrics and analytics dashboard
  • Health check history graphs
  • Customizable recovery delay intervals

Testing

To test the watcher health monitoring:

  1. Start the watcher via the UI
  2. Force stop the watcher (via API or directly kill the process)
  3. Observe automatic recovery (if enabled)
  4. Check error logs for recorded failure
  5. Monitor the health dashboard for status updates
  6. Test error log clearing functionality

Code References

  • Service: apps/service/src/watcher-health.service.ts
  • Module Integration: apps/service/src/app.module.ts
  • App Service: apps/service/src/app.service.ts
  • Controller Endpoints: apps/service/src/app.controller.ts
  • UI Component: apps/web/src/app/components/WatcherHealthStatus.tsx
  • Dashboard Integration: apps/web/src/app/components/StatsSection.tsx