# Watcher Health Monitoring System ## Overview The Watcher Health Monitoring System provides comprehensive monitoring, logging, and automatic recovery capabilities for the file watcher service. This system ensures that the watcher service continues running as expected and automatically recovers if it unexpectedly crashes or stops. ## Features ### 1. Health Monitoring - **Continuous Health Checks**: The system performs health checks every 30 seconds by default - **State Change Detection**: Automatically detects when the watcher unexpectedly stops - **Real-time Status**: Provides immediate feedback on watcher health status ### 2. Error Logging - **Comprehensive Error Tracking**: All watcher errors are logged to the database - **Error History**: Maintains the last 100 errors for investigation - **Timestamp Records**: Each error includes a timestamp for analysis - **Error Messages**: Detailed error descriptions help diagnose issues ### 3. Automatic Recovery - **Configurable Auto-Recovery**: Can be enabled or disabled via settings or API - **Intelligent Recovery**: Attempts to restart the watcher with the last known configuration - **Recovery Limiting**: Prevents infinite recovery loops with configurable attempt limits - **Recovery Tracking**: Logs all recovery attempts for troubleshooting ### 4. Web Interface Integration - **Health Dashboard**: Displays current watcher health status - **Error Log Viewer**: View recent errors from the dashboard - **Auto-Recovery Toggle**: Enable/disable auto-recovery from the UI - **Real-time Notifications**: WebSocket events notify about health changes ## Architecture ### Backend Components #### WatcherHealthService (`watcher-health.service.ts`) The main service responsible for monitoring and recovery: ```typescript @Injectable() export class WatcherHealthService implements OnModuleInit, OnModuleDestroy { // Runs health checks every 30 seconds @Cron(CronExpression.EVERY_30_SECONDS) async healthCheckTask() // Attempts to recover failed watcher private async attemptRecovery(lastStatus: any) // Logs errors to database private logWatcherError(message: string) // Provides health status getHealthStatus(): WatcherHealthRecord // Retrieves recent errors getRecentErrors(limit = 20): Array<{timestamp, message}> } ``` #### Database Schema New table created automatically for error logging: ```sql CREATE TABLE watcher_errors ( id INTEGER PRIMARY KEY AUTOINCREMENT, timestamp TEXT NOT NULL, message TEXT NOT NULL, recovery_attempt INTEGER DEFAULT 0, created_at TEXT NOT NULL ); ``` #### Settings Two new configuration settings are stored in the database: - `watcher_auto_recovery` (boolean): Enable/disable automatic recovery - `watcher_health_check_interval` (number): Health check interval in milliseconds ### Frontend Components #### WatcherHealthStatus (`WatcherHealthStatus.tsx`) React component displaying: - Health status indicator (green for healthy, red for unhealthy) - Last error message and timestamp - Recent error log viewer - Auto-recovery status and toggle - Clear error logs button ## API Endpoints ### Get Watcher Health Status ``` GET /watcher/health ``` Returns: ```json { "timestamp": "2026-01-13T12:34:56.789Z", "isWatching": true, "lastCheckTime": "2026-01-13T12:34:56.789Z", "lastErrorTime": null, "lastErrorMessage": null, "isHealthy": true, "recoveryAttempts": 0 } ``` ### Get Recent Error Logs ``` GET /watcher/errors?limit=20 ``` Returns: ```json [ { "timestamp": "2026-01-13T12:30:00.000Z", "message": "Watcher error: ENOSPC - no space left on device" } ] ``` ### Clear Error Logs ``` DELETE /watcher/errors ``` Returns: ```json { "cleared": 5 } ``` ### Set Auto-Recovery Status ``` POST /watcher/auto-recovery { "enabled": true } ``` ### Get Auto-Recovery Status ``` GET /watcher/auto-recovery ``` Returns: ```json { "enabled": true } ``` ## How It Works ### Normal Operation 1. WatcherHealthService initializes and starts monitoring 2. Every 30 seconds, it checks if the watcher is still running 3. If healthy, no action is taken 4. Status updates are sent via WebSocket to connected clients ### Watcher Failure Detection 1. Health check detects watcher is not running when it should be 2. An error is logged to the database 3. WebSocket event is sent to all connected clients 4. If auto-recovery is enabled, recovery process begins ### Automatic Recovery 1. Recovery attempts are incremented and tracked 2. Existing watcher is stopped (if hung) 3. Watcher is restarted with last known configuration 4. Success/failure is logged 5. WebSocket event notifies clients of recovery status 6. On successful recovery, attempt counter resets 7. Maximum 5 recovery attempts per hour (resets after 1 hour) ### Error Logging - All errors are persisted to the `watcher_errors` table - Automatic cleanup keeps only the last 100 errors - Errors can be viewed and cleared via the API or web interface ## Configuration ### Environment Variables None required - all settings are stored in the database ### Database Settings Settings can be modified programmatically or via the API: ```typescript // Enable auto-recovery setSettings({ watcher_auto_recovery: true }); // Set health check interval to 60 seconds setSettings({ watcher_health_check_interval: 60000 }); ``` ## Monitoring & Troubleshooting ### Viewing Error Logs 1. Go to the "Watcher Health & Monitoring" section on the dashboard 2. Click "View Errors" to see recent errors 3. Errors are displayed with timestamps 4. Click "Clear Log" to remove all errors ### Enabling/Disabling Auto-Recovery 1. Find the "Auto-Recovery" section in the health monitoring panel 2. Click "Enable" or "Disable" as needed 3. Confirmation toast appears when setting is updated ### Interpreting Health Status - **Healthy (Green)**: Watcher is running and no recent errors - **Unhealthy (Red)**: Watcher has stopped or recent errors occurred - Recovery attempts shown if recovery has been attempted ### Common Issues #### "Watcher stopped unexpectedly" - Check disk space and system resources - Review error logs for specific error message - Check file system permissions for watched directories - Verify the watcher service has access to configured paths #### Recovery attempts not resetting - Recovery attempts automatically reset after 1 hour - Or manually clear error logs via the API to reset counters #### Auto-recovery not working - Verify auto-recovery is enabled in the UI - Check error logs for specific failure reasons - Ensure the service has permission to restart the watcher - Check system resources (file descriptors, memory) ## Integration Points ### WebSocket Events The health monitoring system emits the following events: ```typescript // Watcher encountered an error { type: 'health_alert', healthy: false, reason: 'Error message' } // Watcher was successfully recovered { type: 'recovered', message: 'Watcher recovered successfully after failure' } ``` ### Task Queue Integration The watcher can interact with the task queue when recovering. If the watcher restarts, it resumes with the last known watch configuration. ## Performance Considerations - Health checks run every 30 seconds (configurable) - Each health check is very fast (just status queries) - Error logs are limited to 100 entries (old entries auto-deleted) - WebSocket events are only sent on state changes - Minimal database overhead ## Future Enhancements Potential improvements: - Configurable recovery retry strategies - Advanced pattern matching for specific error types - Email/Slack notifications on watcher failures - Metrics and analytics dashboard - Health check history graphs - Customizable recovery delay intervals ## Testing To test the watcher health monitoring: 1. **Start the watcher via the UI** 2. **Force stop the watcher** (via API or directly kill the process) 3. **Observe automatic recovery** (if enabled) 4. **Check error logs** for recorded failure 5. **Monitor the health dashboard** for status updates 6. **Test error log clearing** functionality ## Code References - Service: `apps/service/src/watcher-health.service.ts` - Module Integration: `apps/service/src/app.module.ts` - App Service: `apps/service/src/app.service.ts` - Controller Endpoints: `apps/service/src/app.controller.ts` - UI Component: `apps/web/src/app/components/WatcherHealthStatus.tsx` - Dashboard Integration: `apps/web/src/app/components/StatsSection.tsx`