# Duplicate Detection Optimization ## Overview The duplicate scanner has been optimized to use database-indexed file hashes instead of walking the file system every time. This dramatically improves performance, especially for large destination directories. ## Architecture ### Database Schema Three new columns have been added to the `files` table: - `hash` (TEXT): SHA-1 hash of the file content - `file_size` (INTEGER): Size of the file in bytes - `destination_path` (TEXT): Path for files in destination directories (vs source files tracked via `input`) ### Indexes The following indexes were created for fast lookups: - `idx_files_hash`: Index on `hash` column - `idx_files_hash_size`: Composite index on `hash` and `file_size` - `idx_files_destination`: Index on `destination_path` ### Database View A `file_duplicates` view provides quick access to duplicate files: ```sql CREATE VIEW file_duplicates AS SELECT hash, file_size, dataset, COUNT(*) as file_count, GROUP_CONCAT(CASE WHEN destination_path IS NOT NULL THEN destination_path ELSE input END, '|||') as file_paths FROM files WHERE hash IS NOT NULL GROUP BY hash, file_size, dataset HAVING COUNT(*) > 1; ``` ## How It Works ### 1. Indexing Destination Files Before running duplicate detection, you need to index the destination directory: ```bash # Index a destination directory POST /maintenance/index/destination { "dataset": "movies", "destination": "/path/to/destination", "reindex": false, // Set to true to clear and re-index "batchSize": 100 // Number of files to process at once } ``` This will: 1. Walk the destination directory 2. Calculate SHA-1 hash for each file 3. Store the hash, file size, and path in the database 4. Process files in batches to avoid memory issues ### 2. Database-Based Duplicate Scanning The duplicate scanner now uses the database by default: ```typescript // In maintenance.service.ts private async scanDestinationWithWorker( dataset: string, destination: string, existingMap: Map<...>, useDatabase = true, // Database mode enabled by default ) ``` When `useDatabase` is true: 1. The worker queries the database for files with matching hashes 2. Groups are identified via SQL query instead of file system walk 3. Results are returned much faster ### 3. Fallback to File System Scanning If the database hasn't been indexed or `useDatabase` is false, the system falls back to the traditional file system scanning approach. ## API Endpoints ### Index Destination Files **POST** `/maintenance/index/destination` Request body: ```json { "dataset": "movies", "destination": "/path/to/destination", "reindex": false, "batchSize": 100 } ``` Response: ```json { "indexed": 1234, "skipped": 5, "errors": 0 } ``` ### Get Duplicate Statistics **GET** `/maintenance/index/stats?dataset=movies` Response: ```json { "totalDuplicates": 42, "duplicatesByDataset": [ { "dataset": "movies", "hash": "abc123...", "file_size": 1234567890, "file_count": 3, "files": [ "/path/to/file1.mp4", "/path/to/file2.mp4", "/path/to/file3.mp4" ] } ] } ``` ### Get Index Count **GET** `/maintenance/index/count?dataset=movies&destination=/path/to/destination` Response: ```json { "count": 1234 } ``` ### Clear Index **DELETE** `/maintenance/index/:dataset?destination=/path/to/destination` Response: ```json { "cleared": 1234 } ``` ## Database Methods ### DbService Methods #### `storeDestinationFile(dataset, destinationPath, hash, fileSize)` Store or update a destination file with its hash and size. #### `findDuplicatesByHash(hash, fileSize, dataset?)` Find all files matching a specific hash and size. #### `getAllDuplicates(dataset?)` Get all duplicates from the database view. #### `updateFileHash(dataset, input, hash, fileSize)` Update hash and size for an existing file record. #### `getDestinationFilesWithoutHash(dataset, destinationPath?)` Get files that need hash indexing. #### `clearDestinationFiles(dataset, destinationPath?)` Remove destination file entries (for re-indexing). #### `getDestinationFileCount(dataset, destinationPath?)` Get count of indexed destination files. ### MaintenanceService Methods #### `indexDestinationFiles(dataset, destinationPath, options)` Index all files in a destination directory. Options: - `reindex`: Clear existing entries and re-index (default: false) - `batchSize`: Number of files to process at once (default: 100) #### `getIndexedDuplicateStats(dataset?)` Get duplicate statistics from indexed files. ## Performance Comparison ### Traditional File System Scanning - Walks entire directory tree - Reads and hashes every file on each scan - O(n) complexity where n = total files - Slow for large directories (10,000+ files) ### Database-Indexed Scanning - One-time indexing cost - SQL query for duplicates - O(log n) lookups via indexes - Fast even for very large directories (100,000+ files) ### Example Performance For a destination with 10,000 files: | Method | Initial Scan | Subsequent Scans | | ----------- | ------------------------ | ---------------- | | File System | ~5-10 minutes | ~5-10 minutes | | Database | ~5-10 minutes (one-time) | ~5-10 seconds | ## Usage Workflow ### Initial Setup 1. Index destination directories for all datasets: ```bash # For each dataset and destination curl -X POST http://localhost:3000/maintenance/index/destination \ -H "Content-Type: application/json" \ -d '{ "dataset": "movies", "destination": "/media/movies" }' ``` 2. Run duplicate scan (will use database): ```bash curl -X POST http://localhost:3000/maintenance/duplicates/scan ``` ### Maintenance - Re-index when new files are added to destinations - Use `reindex: true` to completely rebuild the index - Monitor index count to ensure it's up to date ### Incremental Updates When files are added: ```typescript // After processing a file db.setFile(dataset, inputFile, { output: outputFile, hash: calculatedHash, file_size: fileSize, status: "completed", }); ``` ## Migration The database migration `2026-01-06T19-47-58_add_hash_and_destination_tracking.sql` is automatically applied on service startup. No manual intervention needed. ## Notes - Hashes are calculated using SHA-1 (fast, sufficient for duplicate detection) - The `destination_path` field distinguishes destination files from source files - Files in the `files` table can have either `input` (source) or `destination_path` (destination) set - The system gracefully falls back to file system scanning if the database isn't indexed