DUPLICATE_DETECTION_OPTIMIZATION.md 6.5 KB

Duplicate Detection Optimization

Overview

The duplicate scanner has been optimized to use database-indexed file hashes instead of walking the file system every time. This dramatically improves performance, especially for large destination directories.

Architecture

Database Schema

Three new columns have been added to the files table:

  • hash (TEXT): SHA-1 hash of the file content
  • file_size (INTEGER): Size of the file in bytes
  • destination_path (TEXT): Path for files in destination directories (vs source files tracked via input)

Indexes

The following indexes were created for fast lookups:

  • idx_files_hash: Index on hash column
  • idx_files_hash_size: Composite index on hash and file_size
  • idx_files_destination: Index on destination_path

Database View

A file_duplicates view provides quick access to duplicate files:

CREATE VIEW file_duplicates AS
SELECT
  hash,
  file_size,
  dataset,
  COUNT(*) as file_count,
  GROUP_CONCAT(CASE WHEN destination_path IS NOT NULL THEN destination_path ELSE input END, '|||') as file_paths
FROM files
WHERE hash IS NOT NULL
GROUP BY hash, file_size, dataset
HAVING COUNT(*) > 1;

How It Works

1. Indexing Destination Files

Before running duplicate detection, you need to index the destination directory:

# Index a destination directory
POST /maintenance/index/destination
{
  "dataset": "movies",
  "destination": "/path/to/destination",
  "reindex": false,  // Set to true to clear and re-index
  "batchSize": 100   // Number of files to process at once
}

This will:

  1. Walk the destination directory
  2. Calculate SHA-1 hash for each file
  3. Store the hash, file size, and path in the database
  4. Process files in batches to avoid memory issues

2. Database-Based Duplicate Scanning

The duplicate scanner now uses the database by default:

// In maintenance.service.ts
private async scanDestinationWithWorker(
  dataset: string,
  destination: string,
  existingMap: Map<...>,
  useDatabase = true,  // Database mode enabled by default
)

When useDatabase is true:

  1. The worker queries the database for files with matching hashes
  2. Groups are identified via SQL query instead of file system walk
  3. Results are returned much faster

3. Fallback to File System Scanning

If the database hasn't been indexed or useDatabase is false, the system falls back to the traditional file system scanning approach.

API Endpoints

Index Destination Files

POST /maintenance/index/destination

Request body:

{
  "dataset": "movies",
  "destination": "/path/to/destination",
  "reindex": false,
  "batchSize": 100
}

Response:

{
  "indexed": 1234,
  "skipped": 5,
  "errors": 0
}

Get Duplicate Statistics

GET /maintenance/index/stats?dataset=movies

Response:

{
  "totalDuplicates": 42,
  "duplicatesByDataset": [
    {
      "dataset": "movies",
      "hash": "abc123...",
      "file_size": 1234567890,
      "file_count": 3,
      "files": [
        "/path/to/file1.mp4",
        "/path/to/file2.mp4",
        "/path/to/file3.mp4"
      ]
    }
  ]
}

Get Index Count

GET /maintenance/index/count?dataset=movies&destination=/path/to/destination

Response:

{
  "count": 1234
}

Clear Index

DELETE /maintenance/index/:dataset?destination=/path/to/destination

Response:

{
  "cleared": 1234
}

Database Methods

DbService Methods

storeDestinationFile(dataset, destinationPath, hash, fileSize)

Store or update a destination file with its hash and size.

findDuplicatesByHash(hash, fileSize, dataset?)

Find all files matching a specific hash and size.

getAllDuplicates(dataset?)

Get all duplicates from the database view.

updateFileHash(dataset, input, hash, fileSize)

Update hash and size for an existing file record.

getDestinationFilesWithoutHash(dataset, destinationPath?)

Get files that need hash indexing.

clearDestinationFiles(dataset, destinationPath?)

Remove destination file entries (for re-indexing).

getDestinationFileCount(dataset, destinationPath?)

Get count of indexed destination files.

MaintenanceService Methods

indexDestinationFiles(dataset, destinationPath, options)

Index all files in a destination directory.

Options:

  • reindex: Clear existing entries and re-index (default: false)
  • batchSize: Number of files to process at once (default: 100)

getIndexedDuplicateStats(dataset?)

Get duplicate statistics from indexed files.

Performance Comparison

Traditional File System Scanning

  • Walks entire directory tree
  • Reads and hashes every file on each scan
  • O(n) complexity where n = total files
  • Slow for large directories (10,000+ files)

Database-Indexed Scanning

  • One-time indexing cost
  • SQL query for duplicates
  • O(log n) lookups via indexes
  • Fast even for very large directories (100,000+ files)

Example Performance

For a destination with 10,000 files:

Method Initial Scan Subsequent Scans
File System ~5-10 minutes ~5-10 minutes
Database ~5-10 minutes (one-time) ~5-10 seconds

Usage Workflow

Initial Setup

  1. Index destination directories for all datasets:
# For each dataset and destination
curl -X POST http://localhost:3000/maintenance/index/destination \
  -H "Content-Type: application/json" \
  -d '{
    "dataset": "movies",
    "destination": "/media/movies"
  }'
  1. Run duplicate scan (will use database):
curl -X POST http://localhost:3000/maintenance/duplicates/scan

Maintenance

  • Re-index when new files are added to destinations
  • Use reindex: true to completely rebuild the index
  • Monitor index count to ensure it's up to date

Incremental Updates

When files are added:

// After processing a file
db.setFile(dataset, inputFile, {
  output: outputFile,
  hash: calculatedHash,
  file_size: fileSize,
  status: "completed",
});

Migration

The database migration 2026-01-06T19-47-58_add_hash_and_destination_tracking.sql is automatically applied on service startup. No manual intervention needed.

Notes

  • Hashes are calculated using SHA-1 (fast, sufficient for duplicate detection)
  • The destination_path field distinguishes destination files from source files
  • Files in the files table can have either input (source) or destination_path (destination) set
  • The system gracefully falls back to file system scanning if the database isn't indexed