Watcher Health Monitoring System
Overview
The Watcher Health Monitoring System provides comprehensive monitoring, logging, and automatic recovery capabilities for the file watcher service. This system ensures that the watcher service continues running as expected and automatically recovers if it unexpectedly crashes or stops.
Features
1. Health Monitoring
- Continuous Health Checks: The system performs health checks every 30 seconds by default
- State Change Detection: Automatically detects when the watcher unexpectedly stops
- Real-time Status: Provides immediate feedback on watcher health status
2. Error Logging
- Comprehensive Error Tracking: All watcher errors are logged to the database
- Error History: Maintains the last 100 errors for investigation
- Timestamp Records: Each error includes a timestamp for analysis
- Error Messages: Detailed error descriptions help diagnose issues
3. Automatic Recovery
- Configurable Auto-Recovery: Can be enabled or disabled via settings or API
- Intelligent Recovery: Attempts to restart the watcher with the last known configuration
- Recovery Limiting: Prevents infinite recovery loops with configurable attempt limits
- Recovery Tracking: Logs all recovery attempts for troubleshooting
4. Web Interface Integration
- Health Dashboard: Displays current watcher health status
- Error Log Viewer: View recent errors from the dashboard
- Auto-Recovery Toggle: Enable/disable auto-recovery from the UI
- Real-time Notifications: WebSocket events notify about health changes
Architecture
Backend Components
WatcherHealthService (watcher-health.service.ts)
The main service responsible for monitoring and recovery:
@Injectable()
export class WatcherHealthService implements OnModuleInit, OnModuleDestroy {
// Runs health checks every 30 seconds
@Cron(CronExpression.EVERY_30_SECONDS)
async healthCheckTask()
// Attempts to recover failed watcher
private async attemptRecovery(lastStatus: any)
// Logs errors to database
private logWatcherError(message: string)
// Provides health status
getHealthStatus(): WatcherHealthRecord
// Retrieves recent errors
getRecentErrors(limit = 20): Array<{timestamp, message}>
}
Database Schema
New table created automatically for error logging:
CREATE TABLE watcher_errors (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TEXT NOT NULL,
message TEXT NOT NULL,
recovery_attempt INTEGER DEFAULT 0,
created_at TEXT NOT NULL
);
Settings
Two new configuration settings are stored in the database:
watcher_auto_recovery (boolean): Enable/disable automatic recovery
watcher_health_check_interval (number): Health check interval in milliseconds
Frontend Components
WatcherHealthStatus (WatcherHealthStatus.tsx)
React component displaying:
- Health status indicator (green for healthy, red for unhealthy)
- Last error message and timestamp
- Recent error log viewer
- Auto-recovery status and toggle
- Clear error logs button
API Endpoints
Get Watcher Health Status
GET /watcher/health
Returns:
{
"timestamp": "2026-01-13T12:34:56.789Z",
"isWatching": true,
"lastCheckTime": "2026-01-13T12:34:56.789Z",
"lastErrorTime": null,
"lastErrorMessage": null,
"isHealthy": true,
"recoveryAttempts": 0
}
Get Recent Error Logs
GET /watcher/errors?limit=20
Returns:
[
{
"timestamp": "2026-01-13T12:30:00.000Z",
"message": "Watcher error: ENOSPC - no space left on device"
}
]
Clear Error Logs
DELETE /watcher/errors
Returns:
{
"cleared": 5
}
Set Auto-Recovery Status
POST /watcher/auto-recovery
{
"enabled": true
}
Get Auto-Recovery Status
GET /watcher/auto-recovery
Returns:
{
"enabled": true
}
How It Works
Normal Operation
- WatcherHealthService initializes and starts monitoring
- Every 30 seconds, it checks if the watcher is still running
- If healthy, no action is taken
- Status updates are sent via WebSocket to connected clients
Watcher Failure Detection
- Health check detects watcher is not running when it should be
- An error is logged to the database
- WebSocket event is sent to all connected clients
- If auto-recovery is enabled, recovery process begins
Automatic Recovery
- Recovery attempts are incremented and tracked
- Existing watcher is stopped (if hung)
- Watcher is restarted with last known configuration
- Success/failure is logged
- WebSocket event notifies clients of recovery status
- On successful recovery, attempt counter resets
- Maximum 5 recovery attempts per hour (resets after 1 hour)
Error Logging
- All errors are persisted to the
watcher_errors table
- Automatic cleanup keeps only the last 100 errors
- Errors can be viewed and cleared via the API or web interface
Configuration
Environment Variables
None required - all settings are stored in the database
Database Settings
Settings can be modified programmatically or via the API:
// Enable auto-recovery
setSettings({ watcher_auto_recovery: true });
// Set health check interval to 60 seconds
setSettings({ watcher_health_check_interval: 60000 });
Monitoring & Troubleshooting
Viewing Error Logs
- Go to the "Watcher Health & Monitoring" section on the dashboard
- Click "View Errors" to see recent errors
- Errors are displayed with timestamps
- Click "Clear Log" to remove all errors
Enabling/Disabling Auto-Recovery
- Find the "Auto-Recovery" section in the health monitoring panel
- Click "Enable" or "Disable" as needed
- Confirmation toast appears when setting is updated
Interpreting Health Status
- Healthy (Green): Watcher is running and no recent errors
- Unhealthy (Red): Watcher has stopped or recent errors occurred
- Recovery attempts shown if recovery has been attempted
Common Issues
"Watcher stopped unexpectedly"
- Check disk space and system resources
- Review error logs for specific error message
- Check file system permissions for watched directories
- Verify the watcher service has access to configured paths
Recovery attempts not resetting
- Recovery attempts automatically reset after 1 hour
- Or manually clear error logs via the API to reset counters
Auto-recovery not working
- Verify auto-recovery is enabled in the UI
- Check error logs for specific failure reasons
- Ensure the service has permission to restart the watcher
- Check system resources (file descriptors, memory)
Integration Points
WebSocket Events
The health monitoring system emits the following events:
// Watcher encountered an error
{
type: 'health_alert',
healthy: false,
reason: 'Error message'
}
// Watcher was successfully recovered
{
type: 'recovered',
message: 'Watcher recovered successfully after failure'
}
Task Queue Integration
The watcher can interact with the task queue when recovering. If the watcher restarts, it resumes with the last known watch configuration.
Performance Considerations
- Health checks run every 30 seconds (configurable)
- Each health check is very fast (just status queries)
- Error logs are limited to 100 entries (old entries auto-deleted)
- WebSocket events are only sent on state changes
- Minimal database overhead
Future Enhancements
Potential improvements:
- Configurable recovery retry strategies
- Advanced pattern matching for specific error types
- Email/Slack notifications on watcher failures
- Metrics and analytics dashboard
- Health check history graphs
- Customizable recovery delay intervals
Testing
To test the watcher health monitoring:
- Start the watcher via the UI
- Force stop the watcher (via API or directly kill the process)
- Observe automatic recovery (if enabled)
- Check error logs for recorded failure
- Monitor the health dashboard for status updates
- Test error log clearing functionality
Code References
- Service:
apps/service/src/watcher-health.service.ts
- Module Integration:
apps/service/src/app.module.ts
- App Service:
apps/service/src/app.service.ts
- Controller Endpoints:
apps/service/src/app.controller.ts
- UI Component:
apps/web/src/app/components/WatcherHealthStatus.tsx
- Dashboard Integration:
apps/web/src/app/components/StatsSection.tsx