Optimized the duplicate scanner to use database-indexed file hashes instead of walking the file system every time. This provides significant performance improvements for large destination directories.
data/migrations/2026-01-06T19-47-58_add_hash_and_destination_tracking.sql)Added three new columns to the files table:
hash (TEXT): SHA-1 hash of file contentfile_size (INTEGER): File size in bytesdestination_path (TEXT): Path for files in destination directoriesAdded indexes for performance:
idx_files_hash: Index on hash columnidx_files_hash_size: Composite index on hash and file_sizeidx_files_destination: Index on destination_pathCreated a database view file_duplicates for easy duplicate queries.
apps/service/src/db.service.ts)Added new methods:
storeDestinationFile(): Store destination file with hash and sizefindDuplicatesByHash(): Find files by hash and sizegetAllDuplicates(): Get all duplicates from the viewupdateFileHash(): Update hash for existing filegetDestinationFilesWithoutHash(): Find files needing indexingclearDestinationFiles(): Remove destination file entriesgetDestinationFileCount(): Count indexed filesUpdated setFile() to accept hash and file_size in payload.
apps/service/src/maintenance.service.ts)Added new methods:
indexDestinationFiles(): Index all files in a destination with hashes
getIndexedDuplicateStats(): Get duplicate statistics from database
hashFile(): Private method to calculate file hash asynchronously
Updated scanDestinationWithWorker():
useDatabase parameter (default: true)apps/service/src/duplicate-worker.ts)Added database-based scanning:
scanDestinationWithDatabase(): Query duplicates from database instead of file systemapps/service/src/app.controller.ts)Added new endpoints:
POST /maintenance/index/destination: Index destination filesGET /maintenance/index/stats: Get duplicate statisticsGET /maintenance/index/count: Get index countDELETE /maintenance/index/:dataset: Clear indexapps/service/src/app.service.ts)Added methods to expose maintenance functionality:
indexDestinationFiles()getIndexedDuplicateStats()getDestinationFileCount()clearDestinationFiles()# 1. Index a destination directory
curl -X POST http://localhost:3000/maintenance/index/destination \
-H "Content-Type: application/json" \
-d '{
"dataset": "movies",
"destination": "/media/movies",
"batchSize": 100
}'
# 2. Check index count
curl http://localhost:3000/maintenance/index/count?dataset=movies
# 3. Get duplicate statistics
curl http://localhost:3000/maintenance/index/stats?dataset=movies
# 4. Run duplicate scan (uses database automatically)
curl -X POST http://localhost:3000/maintenance/duplicates/scan
# 5. Re-index if needed
curl -X POST http://localhost:3000/maintenance/index/destination \
-H "Content-Type: application/json" \
-d '{
"dataset": "movies",
"destination": "/media/movies",
"reindex": true
}'
data/migrations/2026-01-06T19-47-58_add_hash_and_destination_tracking.sql (new)apps/service/src/db.service.ts (enhanced)apps/service/src/maintenance.service.ts (enhanced)apps/service/src/duplicate-worker.ts (enhanced)apps/service/src/app.controller.ts (new endpoints)apps/service/src/app.service.ts (new methods)docs/DUPLICATE_DETECTION_OPTIMIZATION.md: Comprehensive documentationscripts/example-duplicate-detection.js: Usage examplesThe changes have been compiled and tested:
To test the functionality:
cd apps/service && pnpm devnode scripts/example-duplicate-detection.js