DUPLICATE_DETECTION_QUICKREF.md 5.1 KB

Quick Reference: Database-Optimized Duplicate Detection

Quick Start

1. Index Your Destinations

# Index movies destination
curl -X POST http://localhost:3000/maintenance/index/destination \
  -H "Content-Type: application/json" \
  -d '{"dataset": "movies", "destination": "/media/movies"}'

# Index TV shows destination
curl -X POST http://localhost:3000/maintenance/index/destination \
  -H "Content-Type: application/json" \
  -d '{"dataset": "tvshows", "destination": "/media/tvshows"}'

2. Run Duplicate Scan

# Scan uses database automatically if indexed
curl -X POST http://localhost:3000/maintenance/duplicates/scan

3. View Results

# Get duplicate statistics
curl http://localhost:3000/maintenance/index/stats

# List duplicate groups
curl http://localhost:3000/maintenance/duplicates

API Endpoints

Method Endpoint Description
POST /maintenance/index/destination Index destination files
GET /maintenance/index/stats Get duplicate statistics
GET /maintenance/index/count Get indexed file count
DELETE /maintenance/index/:dataset Clear index for dataset
POST /maintenance/duplicates/scan Scan for duplicates (uses DB)
GET /maintenance/duplicates List duplicate groups

Request Examples

Index with Options

curl -X POST http://localhost:3000/maintenance/index/destination \
  -H "Content-Type: application/json" \
  -d '{
    "dataset": "movies",
    "destination": "/media/movies",
    "reindex": true,
    "batchSize": 200
  }'

Filter Duplicate Stats

# Get stats for specific dataset
curl "http://localhost:3000/maintenance/index/stats?dataset=movies"

Check Index Count

# Count all indexed files
curl "http://localhost:3000/maintenance/index/count?dataset=movies"

# Count for specific destination
curl "http://localhost:3000/maintenance/index/count?dataset=movies&destination=/media/movies"

Clear and Rebuild Index

# Clear index
curl -X DELETE "http://localhost:3000/maintenance/index/movies"

# Rebuild
curl -X POST http://localhost:3000/maintenance/index/destination \
  -H "Content-Type: application/json" \
  -d '{"dataset": "movies", "destination": "/media/movies"}'

Common Tasks

Check if Indexing is Needed

# If this returns 0 or a low number, you need to index
curl "http://localhost:3000/maintenance/index/count?dataset=movies"

Re-index After Adding Files

# Option 1: Full re-index (clears and rebuilds)
curl -X POST http://localhost:3000/maintenance/index/destination \
  -H "Content-Type: application/json" \
  -d '{"dataset": "movies", "destination": "/media/movies", "reindex": true}'

# Option 2: Incremental (only indexes new files)
curl -X POST http://localhost:3000/maintenance/index/destination \
  -H "Content-Type: application/json" \
  -d '{"dataset": "movies", "destination": "/media/movies", "reindex": false}'

Find Duplicates Programmatically

// Using Node.js
const response = await fetch(
  "http://localhost:3000/maintenance/index/stats?dataset=movies"
);
const { duplicatesByDataset } = await response.json();

duplicatesByDataset.forEach((dup) => {
  console.log(`Found ${dup.file_count} copies of file with hash ${dup.hash}`);
  console.log("Files:", dup.files);
});

Database Queries (Direct Access)

If you need to query the database directly:

-- Find all duplicates
SELECT * FROM file_duplicates;

-- Find duplicates for a specific dataset
SELECT * FROM file_duplicates WHERE dataset = 'movies';

-- Find files with a specific hash
SELECT * FROM files WHERE hash = 'abc123...';

-- Count indexed files
SELECT COUNT(*) FROM files WHERE destination_path IS NOT NULL;

-- Find files needing indexing
SELECT * FROM files
WHERE destination_path IS NOT NULL
  AND hash IS NULL;

Maintenance Schedule

Recommended maintenance:

  1. Daily: Run duplicate scan (fast with DB)
  2. Weekly: Re-index high-traffic destinations
  3. Monthly: Full re-index of all destinations

Troubleshooting

Scan is slow

  • Check if destinations are indexed: GET /maintenance/index/count
  • If count is 0, index the destination first

Duplicates not showing up

  • Ensure files are indexed
  • Run a fresh scan: POST /maintenance/duplicates/scan
  • Check duplicate stats: GET /maintenance/index/stats

Need to rebuild index

curl -X DELETE "http://localhost:3000/maintenance/index/movies"
curl -X POST http://localhost:3000/maintenance/index/destination \
  -H "Content-Type: application/json" \
  -d '{"dataset": "movies", "destination": "/media/movies", "reindex": true}'

Performance Tips

  1. Batch Size: Adjust based on file size (smaller files = larger batch)
  2. Re-index Strategy: Use incremental updates unless data is corrupted
  3. Scheduled Indexing: Run during off-peak hours
  4. Monitor: Check index count regularly to ensure it's up to date