Watcher Health Monitoring System

Overview

The Watcher Health Monitoring System provides comprehensive monitoring, logging, and automatic recovery capabilities for the file watcher service. This system ensures that the watcher service continues running as expected and automatically recovers if it unexpectedly crashes or stops.

Features

1. Health Monitoring

Continuous Health Checks: The system performs health checks every 30 seconds by default
State Change Detection: Automatically detects when the watcher unexpectedly stops
Real-time Status: Provides immediate feedback on watcher health status

2. Error Logging

Comprehensive Error Tracking: All watcher errors are logged to the database
Error History: Maintains the last 100 errors for investigation
Timestamp Records: Each error includes a timestamp for analysis
Error Messages: Detailed error descriptions help diagnose issues

3. Automatic Recovery

Configurable Auto-Recovery: Can be enabled or disabled via settings or API
Intelligent Recovery: Attempts to restart the watcher with the last known configuration
Recovery Limiting: Prevents infinite recovery loops with configurable attempt limits
Recovery Tracking: Logs all recovery attempts for troubleshooting

4. Web Interface Integration

Health Dashboard: Displays current watcher health status
Error Log Viewer: View recent errors from the dashboard
Auto-Recovery Toggle: Enable/disable auto-recovery from the UI
Real-time Notifications: WebSocket events notify about health changes

Architecture

Backend Components

WatcherHealthService (`watcher-health.service.ts`)

The main service responsible for monitoring and recovery:

@Injectable()
export class WatcherHealthService implements OnModuleInit, OnModuleDestroy {
  // Runs health checks every 30 seconds
  @Cron(CronExpression.EVERY_30_SECONDS)
  async healthCheckTask()

  // Attempts to recover failed watcher
  private async attemptRecovery(lastStatus: any)

  // Logs errors to database
  private logWatcherError(message: string)

  // Provides health status
  getHealthStatus(): WatcherHealthRecord

  // Retrieves recent errors
  getRecentErrors(limit = 20): Array<{timestamp, message}>
}

Database Schema

New table created automatically for error logging:

CREATE TABLE watcher_errors (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  timestamp TEXT NOT NULL,
  message TEXT NOT NULL,
  recovery_attempt INTEGER DEFAULT 0,
  created_at TEXT NOT NULL
);

Settings

Two new configuration settings are stored in the database:

watcher_auto_recovery (boolean): Enable/disable automatic recovery
watcher_health_check_interval (number): Health check interval in milliseconds

Frontend Components

WatcherHealthStatus (`WatcherHealthStatus.tsx`)

React component displaying:

Health status indicator (green for healthy, red for unhealthy)
Last error message and timestamp
Recent error log viewer
Auto-recovery status and toggle
Clear error logs button

API Endpoints

Get Watcher Health Status

GET /watcher/health

Returns:

{
  "timestamp": "2026-01-13T12:34:56.789Z",
  "isWatching": true,
  "lastCheckTime": "2026-01-13T12:34:56.789Z",
  "lastErrorTime": null,
  "lastErrorMessage": null,
  "isHealthy": true,
  "recoveryAttempts": 0
}

Get Recent Error Logs

GET /watcher/errors?limit=20

Returns:

[
  {
    "timestamp": "2026-01-13T12:30:00.000Z",
    "message": "Watcher error: ENOSPC - no space left on device"
  }
]

Clear Error Logs

DELETE /watcher/errors

Returns:

{
  "cleared": 5
}

Set Auto-Recovery Status

POST /watcher/auto-recovery
{
  "enabled": true
}

Get Auto-Recovery Status

GET /watcher/auto-recovery

Returns:

{
  "enabled": true
}

How It Works

Normal Operation

WatcherHealthService initializes and starts monitoring
Every 30 seconds, it checks if the watcher is still running
If healthy, no action is taken
Status updates are sent via WebSocket to connected clients

Watcher Failure Detection

Health check detects watcher is not running when it should be
An error is logged to the database
WebSocket event is sent to all connected clients
If auto-recovery is enabled, recovery process begins

Automatic Recovery

Recovery attempts are incremented and tracked
Existing watcher is stopped (if hung)
Watcher is restarted with last known configuration
Success/failure is logged
WebSocket event notifies clients of recovery status
On successful recovery, attempt counter resets
Maximum 5 recovery attempts per hour (resets after 1 hour)

Error Logging

All errors are persisted to the watcher_errors table
Automatic cleanup keeps only the last 100 errors
Errors can be viewed and cleared via the API or web interface

Configuration

Environment Variables

None required - all settings are stored in the database

Database Settings

Settings can be modified programmatically or via the API:

// Enable auto-recovery
setSettings({ watcher_auto_recovery: true });

// Set health check interval to 60 seconds
setSettings({ watcher_health_check_interval: 60000 });

Monitoring & Troubleshooting

Viewing Error Logs

Go to the "Watcher Health & Monitoring" section on the dashboard
Click "View Errors" to see recent errors
Errors are displayed with timestamps
Click "Clear Log" to remove all errors

Enabling/Disabling Auto-Recovery

Find the "Auto-Recovery" section in the health monitoring panel
Click "Enable" or "Disable" as needed
Confirmation toast appears when setting is updated

Interpreting Health Status

Healthy (Green): Watcher is running and no recent errors
Unhealthy (Red): Watcher has stopped or recent errors occurred
Recovery attempts shown if recovery has been attempted

Common Issues

"Watcher stopped unexpectedly"

Check disk space and system resources
Review error logs for specific error message
Check file system permissions for watched directories
Verify the watcher service has access to configured paths

Recovery attempts not resetting

Recovery attempts automatically reset after 1 hour
Or manually clear error logs via the API to reset counters

Auto-recovery not working

Verify auto-recovery is enabled in the UI
Check error logs for specific failure reasons
Ensure the service has permission to restart the watcher
Check system resources (file descriptors, memory)

Integration Points

WebSocket Events

The health monitoring system emits the following events:

// Watcher encountered an error
{
  type: 'health_alert',
  healthy: false,
  reason: 'Error message'
}

// Watcher was successfully recovered
{
  type: 'recovered',
  message: 'Watcher recovered successfully after failure'
}

Task Queue Integration

The watcher can interact with the task queue when recovering. If the watcher restarts, it resumes with the last known watch configuration.

Performance Considerations

Health checks run every 30 seconds (configurable)
Each health check is very fast (just status queries)
Error logs are limited to 100 entries (old entries auto-deleted)
WebSocket events are only sent on state changes
Minimal database overhead

Future Enhancements

Potential improvements:

Configurable recovery retry strategies
Advanced pattern matching for specific error types
Email/Slack notifications on watcher failures
Metrics and analytics dashboard
Health check history graphs
Customizable recovery delay intervals

Testing

To test the watcher health monitoring:

Start the watcher via the UI
Force stop the watcher (via API or directly kill the process)
Observe automatic recovery (if enabled)
Check error logs for recorded failure
Monitor the health dashboard for status updates
Test error log clearing functionality

Code References

Service: apps/service/src/watcher-health.service.ts
Module Integration: apps/service/src/app.module.ts
App Service: apps/service/src/app.service.ts
Controller Endpoints: apps/service/src/app.controller.ts
UI Component: apps/web/src/app/components/WatcherHealthStatus.tsx
Dashboard Integration: apps/web/src/app/components/StatsSection.tsx

WATCHER_HEALTH_MONITORING.md 8.3 KB 히스토리 Raw

Watcher Health Monitoring System

Overview

Features

1. Health Monitoring

2. Error Logging

3. Automatic Recovery

4. Web Interface Integration

Architecture

Backend Components

WatcherHealthService (watcher-health.service.ts)

Database Schema

Settings

Frontend Components

WatcherHealthStatus (WatcherHealthStatus.tsx)

API Endpoints

Get Watcher Health Status

Get Recent Error Logs

Clear Error Logs

Set Auto-Recovery Status

Get Auto-Recovery Status

How It Works

Normal Operation

Watcher Failure Detection

Automatic Recovery

Error Logging

Configuration

Environment Variables

Database Settings

Monitoring & Troubleshooting

Viewing Error Logs

Enabling/Disabling Auto-Recovery

Interpreting Health Status

Common Issues

"Watcher stopped unexpectedly"

Recovery attempts not resetting

Auto-recovery not working

Integration Points

WebSocket Events

Task Queue Integration

Performance Considerations

Future Enhancements

Testing

Code References

WATCHER_HEALTH_MONITORING.md 8.3 KB

히스토리 Raw

WatcherHealthService (`watcher-health.service.ts`)

WatcherHealthStatus (`WatcherHealthStatus.tsx`)