Duplicate Detection Optimization

Overview

The duplicate scanner has been optimized to use database-indexed file hashes instead of walking the file system every time. This dramatically improves performance, especially for large destination directories.

Architecture

Database Schema

Three new columns have been added to the files table:

hash (TEXT): SHA-1 hash of the file content
file_size (INTEGER): Size of the file in bytes
destination_path (TEXT): Path for files in destination directories (vs source files tracked via input)

Indexes

The following indexes were created for fast lookups:

idx_files_hash: Index on hash column
idx_files_hash_size: Composite index on hash and file_size
idx_files_destination: Index on destination_path

Database View

A file_duplicates view provides quick access to duplicate files:

CREATE VIEW file_duplicates AS
SELECT
  hash,
  file_size,
  dataset,
  COUNT(*) as file_count,
  GROUP_CONCAT(CASE WHEN destination_path IS NOT NULL THEN destination_path ELSE input END, '|||') as file_paths
FROM files
WHERE hash IS NOT NULL
GROUP BY hash, file_size, dataset
HAVING COUNT(*) > 1;

How It Works

1. Indexing Destination Files

Before running duplicate detection, you need to index the destination directory:

# Index a destination directory
POST /maintenance/index/destination
{
  "dataset": "movies",
  "destination": "/path/to/destination",
  "reindex": false,  // Set to true to clear and re-index
  "batchSize": 100   // Number of files to process at once
}

This will:

Walk the destination directory
Calculate SHA-1 hash for each file
Store the hash, file size, and path in the database
Process files in batches to avoid memory issues

2. Database-Based Duplicate Scanning

The duplicate scanner now uses the database by default:

// In maintenance.service.ts
private async scanDestinationWithWorker(
  dataset: string,
  destination: string,
  existingMap: Map<...>,
  useDatabase = true,  // Database mode enabled by default
)

When useDatabase is true:

The worker queries the database for files with matching hashes
Groups are identified via SQL query instead of file system walk
Results are returned much faster

3. Fallback to File System Scanning

If the database hasn't been indexed or useDatabase is false, the system falls back to the traditional file system scanning approach.

API Endpoints

Index Destination Files

POST /maintenance/index/destination

Request body:

{
  "dataset": "movies",
  "destination": "/path/to/destination",
  "reindex": false,
  "batchSize": 100
}

Response:

{
  "indexed": 1234,
  "skipped": 5,
  "errors": 0
}

Get Duplicate Statistics

GET /maintenance/index/stats?dataset=movies

Response:

{
  "totalDuplicates": 42,
  "duplicatesByDataset": [
    {
      "dataset": "movies",
      "hash": "abc123...",
      "file_size": 1234567890,
      "file_count": 3,
      "files": [
        "/path/to/file1.mp4",
        "/path/to/file2.mp4",
        "/path/to/file3.mp4"
      ]
    }
  ]
}

Get Index Count

GET /maintenance/index/count?dataset=movies&destination=/path/to/destination

Response:

{
  "count": 1234
}

Clear Index

DELETE /maintenance/index/:dataset?destination=/path/to/destination

Response:

{
  "cleared": 1234
}

Database Methods

DbService Methods

`storeDestinationFile(dataset, destinationPath, hash, fileSize)`

Store or update a destination file with its hash and size.

`findDuplicatesByHash(hash, fileSize, dataset?)`

Find all files matching a specific hash and size.

`getAllDuplicates(dataset?)`

Get all duplicates from the database view.

`updateFileHash(dataset, input, hash, fileSize)`

Update hash and size for an existing file record.

`getDestinationFilesWithoutHash(dataset, destinationPath?)`

Get files that need hash indexing.

`clearDestinationFiles(dataset, destinationPath?)`

Remove destination file entries (for re-indexing).

`getDestinationFileCount(dataset, destinationPath?)`

Get count of indexed destination files.

MaintenanceService Methods

`indexDestinationFiles(dataset, destinationPath, options)`

Index all files in a destination directory.

Options:

reindex: Clear existing entries and re-index (default: false)
batchSize: Number of files to process at once (default: 100)

`getIndexedDuplicateStats(dataset?)`

Get duplicate statistics from indexed files.

Performance Comparison

Traditional File System Scanning

Walks entire directory tree
Reads and hashes every file on each scan
O(n) complexity where n = total files
Slow for large directories (10,000+ files)

Database-Indexed Scanning

One-time indexing cost
SQL query for duplicates
O(log n) lookups via indexes
Fast even for very large directories (100,000+ files)

Example Performance

For a destination with 10,000 files:

Method	Initial Scan	Subsequent Scans
File System	~5-10 minutes	~5-10 minutes
Database	~5-10 minutes (one-time)	~5-10 seconds

Usage Workflow

Initial Setup

Index destination directories for all datasets:

# For each dataset and destination
curl -X POST http://localhost:3000/maintenance/index/destination \
  -H "Content-Type: application/json" \
  -d '{
    "dataset": "movies",
    "destination": "/media/movies"
  }'

Run duplicate scan (will use database):

curl -X POST http://localhost:3000/maintenance/duplicates/scan

Maintenance

Re-index when new files are added to destinations
Use reindex: true to completely rebuild the index
Monitor index count to ensure it's up to date

Incremental Updates

When files are added:

// After processing a file
db.setFile(dataset, inputFile, {
  output: outputFile,
  hash: calculatedHash,
  file_size: fileSize,
  status: "completed",
});

Migration

The database migration 2026-01-06T19-47-58_add_hash_and_destination_tracking.sql is automatically applied on service startup. No manual intervention needed.

Notes

Hashes are calculated using SHA-1 (fast, sufficient for duplicate detection)
The destination_path field distinguishes destination files from source files
Files in the files table can have either input (source) or destination_path (destination) set
The system gracefully falls back to file system scanning if the database isn't indexed

DUPLICATE_DETECTION_OPTIMIZATION.md 6.5 KB Histórico Em bruto

Duplicate Detection Optimization

Overview

Architecture

Database Schema

Indexes

Database View

How It Works

1. Indexing Destination Files

2. Database-Based Duplicate Scanning

3. Fallback to File System Scanning

API Endpoints

Index Destination Files

Get Duplicate Statistics

Get Index Count

Clear Index

Database Methods

DbService Methods

storeDestinationFile(dataset, destinationPath, hash, fileSize)

findDuplicatesByHash(hash, fileSize, dataset?)

getAllDuplicates(dataset?)

updateFileHash(dataset, input, hash, fileSize)

getDestinationFilesWithoutHash(dataset, destinationPath?)

clearDestinationFiles(dataset, destinationPath?)

getDestinationFileCount(dataset, destinationPath?)

MaintenanceService Methods

indexDestinationFiles(dataset, destinationPath, options)

getIndexedDuplicateStats(dataset?)

Performance Comparison

Traditional File System Scanning

Database-Indexed Scanning

Example Performance

Usage Workflow

Initial Setup

Maintenance

Incremental Updates

Migration

Notes

DUPLICATE_DETECTION_OPTIMIZATION.md 6.5 KB

Histórico Em bruto

`storeDestinationFile(dataset, destinationPath, hash, fileSize)`

`findDuplicatesByHash(hash, fileSize, dataset?)`

`getAllDuplicates(dataset?)`

`updateFileHash(dataset, input, hash, fileSize)`

`getDestinationFilesWithoutHash(dataset, destinationPath?)`

`clearDestinationFiles(dataset, destinationPath?)`

`getDestinationFileCount(dataset, destinationPath?)`

`indexDestinationFiles(dataset, destinationPath, options)`

`getIndexedDuplicateStats(dataset?)`