# Quick Reference: Database-Optimized Duplicate Detection ## Quick Start ### 1. Index Your Destinations ```bash # Index movies destination curl -X POST http://localhost:3000/maintenance/index/destination \ -H "Content-Type: application/json" \ -d '{"dataset": "movies", "destination": "/media/movies"}' # Index TV shows destination curl -X POST http://localhost:3000/maintenance/index/destination \ -H "Content-Type: application/json" \ -d '{"dataset": "tvshows", "destination": "/media/tvshows"}' ``` ### 2. Run Duplicate Scan ```bash # Scan uses database automatically if indexed curl -X POST http://localhost:3000/maintenance/duplicates/scan ``` ### 3. View Results ```bash # Get duplicate statistics curl http://localhost:3000/maintenance/index/stats # List duplicate groups curl http://localhost:3000/maintenance/duplicates ``` ## API Endpoints | Method | Endpoint | Description | | ------ | -------------------------------- | ----------------------------- | | POST | `/maintenance/index/destination` | Index destination files | | GET | `/maintenance/index/stats` | Get duplicate statistics | | GET | `/maintenance/index/count` | Get indexed file count | | DELETE | `/maintenance/index/:dataset` | Clear index for dataset | | POST | `/maintenance/duplicates/scan` | Scan for duplicates (uses DB) | | GET | `/maintenance/duplicates` | List duplicate groups | ## Request Examples ### Index with Options ```bash curl -X POST http://localhost:3000/maintenance/index/destination \ -H "Content-Type: application/json" \ -d '{ "dataset": "movies", "destination": "/media/movies", "reindex": true, "batchSize": 200 }' ``` ### Filter Duplicate Stats ```bash # Get stats for specific dataset curl "http://localhost:3000/maintenance/index/stats?dataset=movies" ``` ### Check Index Count ```bash # Count all indexed files curl "http://localhost:3000/maintenance/index/count?dataset=movies" # Count for specific destination curl "http://localhost:3000/maintenance/index/count?dataset=movies&destination=/media/movies" ``` ### Clear and Rebuild Index ```bash # Clear index curl -X DELETE "http://localhost:3000/maintenance/index/movies" # Rebuild curl -X POST http://localhost:3000/maintenance/index/destination \ -H "Content-Type: application/json" \ -d '{"dataset": "movies", "destination": "/media/movies"}' ``` ## Common Tasks ### Check if Indexing is Needed ```bash # If this returns 0 or a low number, you need to index curl "http://localhost:3000/maintenance/index/count?dataset=movies" ``` ### Re-index After Adding Files ```bash # Option 1: Full re-index (clears and rebuilds) curl -X POST http://localhost:3000/maintenance/index/destination \ -H "Content-Type: application/json" \ -d '{"dataset": "movies", "destination": "/media/movies", "reindex": true}' # Option 2: Incremental (only indexes new files) curl -X POST http://localhost:3000/maintenance/index/destination \ -H "Content-Type: application/json" \ -d '{"dataset": "movies", "destination": "/media/movies", "reindex": false}' ``` ### Find Duplicates Programmatically ```javascript // Using Node.js const response = await fetch( "http://localhost:3000/maintenance/index/stats?dataset=movies" ); const { duplicatesByDataset } = await response.json(); duplicatesByDataset.forEach((dup) => { console.log(`Found ${dup.file_count} copies of file with hash ${dup.hash}`); console.log("Files:", dup.files); }); ``` ## Database Queries (Direct Access) If you need to query the database directly: ```sql -- Find all duplicates SELECT * FROM file_duplicates; -- Find duplicates for a specific dataset SELECT * FROM file_duplicates WHERE dataset = 'movies'; -- Find files with a specific hash SELECT * FROM files WHERE hash = 'abc123...'; -- Count indexed files SELECT COUNT(*) FROM files WHERE destination_path IS NOT NULL; -- Find files needing indexing SELECT * FROM files WHERE destination_path IS NOT NULL AND hash IS NULL; ``` ## Maintenance Schedule Recommended maintenance: 1. **Daily**: Run duplicate scan (fast with DB) 2. **Weekly**: Re-index high-traffic destinations 3. **Monthly**: Full re-index of all destinations ## Troubleshooting ### Scan is slow - Check if destinations are indexed: `GET /maintenance/index/count` - If count is 0, index the destination first ### Duplicates not showing up - Ensure files are indexed - Run a fresh scan: `POST /maintenance/duplicates/scan` - Check duplicate stats: `GET /maintenance/index/stats` ### Need to rebuild index ```bash curl -X DELETE "http://localhost:3000/maintenance/index/movies" curl -X POST http://localhost:3000/maintenance/index/destination \ -H "Content-Type: application/json" \ -d '{"dataset": "movies", "destination": "/media/movies", "reindex": true}' ``` ## Performance Tips 1. **Batch Size**: Adjust based on file size (smaller files = larger batch) 2. **Re-index Strategy**: Use incremental updates unless data is corrupted 3. **Scheduled Indexing**: Run during off-peak hours 4. **Monitor**: Check index count regularly to ensure it's up to date