The duplicate scanner has been optimized to use database-indexed file hashes instead of walking the file system every time. This dramatically improves performance, especially for large destination directories.
Three new columns have been added to the files table:
hash (TEXT): SHA-1 hash of the file contentfile_size (INTEGER): Size of the file in bytesdestination_path (TEXT): Path for files in destination directories (vs source files tracked via input)The following indexes were created for fast lookups:
idx_files_hash: Index on hash columnidx_files_hash_size: Composite index on hash and file_sizeidx_files_destination: Index on destination_pathA file_duplicates view provides quick access to duplicate files:
CREATE VIEW file_duplicates AS
SELECT
hash,
file_size,
dataset,
COUNT(*) as file_count,
GROUP_CONCAT(CASE WHEN destination_path IS NOT NULL THEN destination_path ELSE input END, '|||') as file_paths
FROM files
WHERE hash IS NOT NULL
GROUP BY hash, file_size, dataset
HAVING COUNT(*) > 1;
Before running duplicate detection, you need to index the destination directory:
# Index a destination directory
POST /maintenance/index/destination
{
"dataset": "movies",
"destination": "/path/to/destination",
"reindex": false, // Set to true to clear and re-index
"batchSize": 100 // Number of files to process at once
}
This will:
The duplicate scanner now uses the database by default:
// In maintenance.service.ts
private async scanDestinationWithWorker(
dataset: string,
destination: string,
existingMap: Map<...>,
useDatabase = true, // Database mode enabled by default
)
When useDatabase is true:
If the database hasn't been indexed or useDatabase is false, the system falls back to the traditional file system scanning approach.
POST /maintenance/index/destination
Request body:
{
"dataset": "movies",
"destination": "/path/to/destination",
"reindex": false,
"batchSize": 100
}
Response:
{
"indexed": 1234,
"skipped": 5,
"errors": 0
}
GET /maintenance/index/stats?dataset=movies
Response:
{
"totalDuplicates": 42,
"duplicatesByDataset": [
{
"dataset": "movies",
"hash": "abc123...",
"file_size": 1234567890,
"file_count": 3,
"files": [
"/path/to/file1.mp4",
"/path/to/file2.mp4",
"/path/to/file3.mp4"
]
}
]
}
GET /maintenance/index/count?dataset=movies&destination=/path/to/destination
Response:
{
"count": 1234
}
DELETE /maintenance/index/:dataset?destination=/path/to/destination
Response:
{
"cleared": 1234
}
storeDestinationFile(dataset, destinationPath, hash, fileSize)Store or update a destination file with its hash and size.
findDuplicatesByHash(hash, fileSize, dataset?)Find all files matching a specific hash and size.
getAllDuplicates(dataset?)Get all duplicates from the database view.
updateFileHash(dataset, input, hash, fileSize)Update hash and size for an existing file record.
getDestinationFilesWithoutHash(dataset, destinationPath?)Get files that need hash indexing.
clearDestinationFiles(dataset, destinationPath?)Remove destination file entries (for re-indexing).
getDestinationFileCount(dataset, destinationPath?)Get count of indexed destination files.
indexDestinationFiles(dataset, destinationPath, options)Index all files in a destination directory.
Options:
reindex: Clear existing entries and re-index (default: false)batchSize: Number of files to process at once (default: 100)getIndexedDuplicateStats(dataset?)Get duplicate statistics from indexed files.
For a destination with 10,000 files:
| Method | Initial Scan | Subsequent Scans |
|---|---|---|
| File System | ~5-10 minutes | ~5-10 minutes |
| Database | ~5-10 minutes (one-time) | ~5-10 seconds |
# For each dataset and destination
curl -X POST http://localhost:3000/maintenance/index/destination \
-H "Content-Type: application/json" \
-d '{
"dataset": "movies",
"destination": "/media/movies"
}'
curl -X POST http://localhost:3000/maintenance/duplicates/scan
reindex: true to completely rebuild the indexWhen files are added:
// After processing a file
db.setFile(dataset, inputFile, {
output: outputFile,
hash: calculatedHash,
file_size: fileSize,
status: "completed",
});
The database migration 2026-01-06T19-47-58_add_hash_and_destination_tracking.sql is automatically applied on service startup. No manual intervention needed.
destination_path field distinguishes destination files from source filesfiles table can have either input (source) or destination_path (destination) set