# Duplicate Detection Optimization

## Overview

The duplicate scanner has been optimized to use database-indexed file hashes instead of walking the file system every time. This dramatically improves performance, especially for large destination directories.

## Architecture

### Database Schema

Three new columns have been added to the `files` table:

- `hash` (TEXT): SHA-1 hash of the file content
- `file_size` (INTEGER): Size of the file in bytes
- `destination_path` (TEXT): Path for files in destination directories (vs source files tracked via `input`)

### Indexes

The following indexes were created for fast lookups:

- `idx_files_hash`: Index on `hash` column
- `idx_files_hash_size`: Composite index on `hash` and `file_size`
- `idx_files_destination`: Index on `destination_path`

### Database View

A `file_duplicates` view provides quick access to duplicate files:

```sql
CREATE VIEW file_duplicates AS
SELECT
  hash,
  file_size,
  dataset,
  COUNT(*) as file_count,
  GROUP_CONCAT(CASE WHEN destination_path IS NOT NULL THEN destination_path ELSE input END, '|||') as file_paths
FROM files
WHERE hash IS NOT NULL
GROUP BY hash, file_size, dataset
HAVING COUNT(*) > 1;
```

## How It Works

### 1. Indexing Destination Files

Before running duplicate detection, you need to index the destination directory:

```bash
# Index a destination directory
POST /maintenance/index/destination
{
  "dataset": "movies",
  "destination": "/path/to/destination",
  "reindex": false,  // Set to true to clear and re-index
  "batchSize": 100   // Number of files to process at once
}
```

This will:

1. Walk the destination directory
2. Calculate SHA-1 hash for each file
3. Store the hash, file size, and path in the database
4. Process files in batches to avoid memory issues

### 2. Database-Based Duplicate Scanning

The duplicate scanner now uses the database by default:

```typescript
// In maintenance.service.ts
private async scanDestinationWithWorker(
  dataset: string,
  destination: string,
  existingMap: Map<...>,
  useDatabase = true,  // Database mode enabled by default
)
```

When `useDatabase` is true:

1. The worker queries the database for files with matching hashes
2. Groups are identified via SQL query instead of file system walk
3. Results are returned much faster

### 3. Fallback to File System Scanning

If the database hasn't been indexed or `useDatabase` is false, the system falls back to the traditional file system scanning approach.

## API Endpoints

### Index Destination Files

**POST** `/maintenance/index/destination`

Request body:

```json
{
  "dataset": "movies",
  "destination": "/path/to/destination",
  "reindex": false,
  "batchSize": 100
}
```

Response:

```json
{
  "indexed": 1234,
  "skipped": 5,
  "errors": 0
}
```

### Get Duplicate Statistics

**GET** `/maintenance/index/stats?dataset=movies`

Response:

```json
{
  "totalDuplicates": 42,
  "duplicatesByDataset": [
    {
      "dataset": "movies",
      "hash": "abc123...",
      "file_size": 1234567890,
      "file_count": 3,
      "files": [
        "/path/to/file1.mp4",
        "/path/to/file2.mp4",
        "/path/to/file3.mp4"
      ]
    }
  ]
}
```

### Get Index Count

**GET** `/maintenance/index/count?dataset=movies&destination=/path/to/destination`

Response:

```json
{
  "count": 1234
}
```

### Clear Index

**DELETE** `/maintenance/index/:dataset?destination=/path/to/destination`

Response:

```json
{
  "cleared": 1234
}
```

## Database Methods

### DbService Methods

#### `storeDestinationFile(dataset, destinationPath, hash, fileSize)`

Store or update a destination file with its hash and size.

#### `findDuplicatesByHash(hash, fileSize, dataset?)`

Find all files matching a specific hash and size.

#### `getAllDuplicates(dataset?)`

Get all duplicates from the database view.

#### `updateFileHash(dataset, input, hash, fileSize)`

Update hash and size for an existing file record.

#### `getDestinationFilesWithoutHash(dataset, destinationPath?)`

Get files that need hash indexing.

#### `clearDestinationFiles(dataset, destinationPath?)`

Remove destination file entries (for re-indexing).

#### `getDestinationFileCount(dataset, destinationPath?)`

Get count of indexed destination files.

### MaintenanceService Methods

#### `indexDestinationFiles(dataset, destinationPath, options)`

Index all files in a destination directory.

Options:

- `reindex`: Clear existing entries and re-index (default: false)
- `batchSize`: Number of files to process at once (default: 100)

#### `getIndexedDuplicateStats(dataset?)`

Get duplicate statistics from indexed files.

## Performance Comparison

### Traditional File System Scanning

- Walks entire directory tree
- Reads and hashes every file on each scan
- O(n) complexity where n = total files
- Slow for large directories (10,000+ files)

### Database-Indexed Scanning

- One-time indexing cost
- SQL query for duplicates
- O(log n) lookups via indexes
- Fast even for very large directories (100,000+ files)

### Example Performance

For a destination with 10,000 files:

| Method      | Initial Scan             | Subsequent Scans |
| ----------- | ------------------------ | ---------------- |
| File System | ~5-10 minutes            | ~5-10 minutes    |
| Database    | ~5-10 minutes (one-time) | ~5-10 seconds    |

## Usage Workflow

### Initial Setup

1. Index destination directories for all datasets:

```bash
# For each dataset and destination
curl -X POST http://localhost:3000/maintenance/index/destination \
  -H "Content-Type: application/json" \
  -d '{
    "dataset": "movies",
    "destination": "/media/movies"
  }'
```

2. Run duplicate scan (will use database):

```bash
curl -X POST http://localhost:3000/maintenance/duplicates/scan
```

### Maintenance

- Re-index when new files are added to destinations
- Use `reindex: true` to completely rebuild the index
- Monitor index count to ensure it's up to date

### Incremental Updates

When files are added:

```typescript
// After processing a file
db.setFile(dataset, inputFile, {
  output: outputFile,
  hash: calculatedHash,
  file_size: fileSize,
  status: "completed",
});
```

## Migration

The database migration `2026-01-06T19-47-58_add_hash_and_destination_tracking.sql` is automatically applied on service startup. No manual intervention needed.

## Notes

- Hashes are calculated using SHA-1 (fast, sufficient for duplicate detection)
- The `destination_path` field distinguishes destination files from source files
- Files in the `files` table can have either `input` (source) or `destination_path` (destination) set
- The system gracefully falls back to file system scanning if the database isn't indexed