Here is the dilemma: There is a radio program I like to listen to on occasion, and one radio station posts a year of MP3 archives of their content online, so I downloaded the hour-long time slots when this program takes place. They broadcast the program on Saturday nights and Monday mornings; Monday broadcasts are often a repeat of Saturday, but not always. I want to keep a copy of each edition of the show, but remove any duplicates.

If all Mondays were simply repeats of Saturday, I could simply ignore Monday broadcasts, but this is not the case. Additionally, when Monday's and Saturday's content match, the downloads for the time slots are never identical files, because the end of the previous program and commercials carry into the start of the next time slot. The commercials are always different, and after sampling a few broadcasts, the start of the program I want to keep varies between 30 seconds and more than 2 minutes after the beginning of the time slot.

My solution for finding duplicate content is to take an audio fingerprint of the first several minutes of each broadcast, throwing away the first two and a half minutes to cut out the commercials (which are always different) and introduction to the program (which is always the same). Since the start of the program varies by a couple minutes, the audio fingerprints are cross-correlated over a 5 minute span. A peak in the cross-correlation corresponds to a match in content. Below are examples of cross-correlation between two non-identical files, where the left one has matching content (the offset of the peak corresponds to the difference in time at which the matching content starts), and the right one has no matching content.

Matching content No matching content

The Python program which pulls this off is given below, comparing files in a directory with Monday's content to files in another directory with Saturday's content. The audio fingerprinting is done using fpcalc in Chromaprint (http://acoustid.org/chromaprint); fpcalc should be placed in the same directory as the Python script. Chromaprint samples the audio content at 11025 Hz with a frame size of 4096 with 2/3 overlap, so each point in the fingerprint is 4096 / 11025 Hz / 3 = 0.124 s. The fingerprinting algorithm is explained at http://oxygene.sk/2011/01/how-does-chromaprint-work/. Audio files which have a cross-correlation exceeding a threshold are declared matching, and recorded in a file along with their correlation and the offset location of the peak. This file can then be parsed with another program to remove duplicate content (possibly with a stricter threshold), or examined manually. The program is not very fast, but it gets the job done, and certainly beats trying to find duplicates by hand. To compare files within a directory, simply set both directories to the same; the program will avoid correlating files with each other more than once, or with themselves.

fpcalc_compare_dirs.py

#!/usr/bin/env python
# Nick Masluk
# 2013-05-05
# 2014-11-16 Added fpcalc_loc rather than assuming fpcalc is located in working
#            directory.
#            Addition of error checking.
#            Some minor syntax cleanup.

# location of fpcalc (leave blank if found in $PATH)
fpcalc_loc = './'
# directories to compare files between
# set to the same directory to compare files between themselves
dir_a = '/home/nick/monday/'
dir_b = '/home/nick/saturday/'
# file to write matches to
match_file = 'matches.txt'
# seconds to sample audio file for
sample_time = 500
# number of points to crop from beginning of fingerprint
# 4096 samples per frame / 11025 samples per second / 3 points per frame
# = 0.124 seconds per point
crop = 1100
# number of points to scan cross correlation over
span = 1200
# step size (in points) of cross correlation
step = 3
# report match when cross correlation has a peak exceeding threshold
threshold = 0.5
# minimum number of points that must overlap in cross correlation
# exception is raised if this cannot be met
min_overlap = 200

################################################################################
# import modules
################################################################################

import re
import commands
import numpy
import math

################################################################################
# function definitions
################################################################################

# adds escape characters in front of Bash special characters
def esc_bash_chars(string):
    # match any of the following characters between the capital A's
    # A`!$^&*()=[]{}\|;:'",<>? A
    # note that characters ] and ' need escape characters themselves in the
    # regex, and \ requires two escape characters
    specialchars = re.compile('[`!$^&*()=[\]{}\\\|;:\'",<>? ]')
    string_escaped = ""
    for char in string:
        if specialchars.search(char):
            string_escaped += '\\' + char
        else:
            string_escaped += char
    return string_escaped

# returns variance of list
def variance(listx):
    meanx = numpy.mean(listx)
    # get mean of x^2
    meanx_sqr = 0
    for x in listx:
        meanx_sqr += x**2
    meanx_sqr = meanx_sqr / float(len(listx))
    return meanx_sqr - meanx**2

# returns correlation between lists
def correlation(listx, listy):
    if len(listx) == 0 or len(listy) == 0:
        # Error checking in main program should prevent us from ever being
        # able to get here.
        raise Exception('Empty lists cannot be correlated.')
    if len(listx) > len(listy):
        listx = listx[:len(listy)]
    elif len(listx) < len(listy):
        listy = listy[:len(listx)]
    
    meanx = numpy.mean(listx)
    meany = numpy.mean(listy)
    
    covariance = 0
    for i in range(len(listx)):
        covariance += (listx[i] - meanx) * (listy[i] - meany)
    covariance = covariance / float(len(listx))
    
    return covariance / (math.sqrt(variance(listx)) * math.sqrt(variance(listy)))

# return cross correlation, with listy offset from listx
def cross_correlation(listx, listy, offset):
    if offset > 0:
        listx = listx[offset:]
        listy = listy[:len(listx)]
    elif offset < 0:
        offset = -offset
        listy = listy[offset:]
        listx = listx[:len(listy)]
    if min(len(listx), len(listy)) < min_overlap:
        # Error checking in main program should prevent us from ever being
        # able to get here.
        raise Exception('Overlap too small: %i' % min(len(listx), len(listy)))
    return correlation(listx, listy)

# cross correlate listx and listy with offsets from -span to span
def compare(listx, listy, span, step):
    if span > min(len(listx), len(listy)):
        # Error checking in main program should prevent us from ever being
        # able to get here.
        raise Exception('span >= sample size: %i >= %i\n'
                        % (span, min(len(listx), len(listy)))
                        + 'Reduce span, reduce crop or increase sample_time.')
    corr_xy = []
    for offset in numpy.arange(-span, span + 1, step):
        corr_xy.append(cross_correlation(listx, listy, offset))
    return corr_xy

# return index of maximum value in list
def max_index(listx):
    max_index = 0
    max_value = listx[0]
    for i, value in enumerate(listx):
        if value > max_value:
            max_value = value
            max_index = i
    return max_index

# write to a file
def write_string(string, filename):
    file_out = open(filename, 'ab')
    file_out.write(string + '\n')
    file_out.close()

################################################################################
# main code
################################################################################

# escape Bash special characters
dir_a = esc_bash_chars(dir_a)
dir_b = esc_bash_chars(dir_b)
match_file = esc_bash_chars(match_file)

# get list of files to compare from each directory
filelist_a = commands.getoutput('ls ' + dir_a + '*.mp3').split('\n')
filelist_b = commands.getoutput('ls ' + dir_b + '*.mp3').split('\n')

# if cross-correlating between files within a directory, don't correlate files
# twice, or correlate files with themselves
intra_correlating = False
if filelist_a == filelist_b:
    intra_correlating = True

for i, file_a in enumerate(filelist_a):
    # if correlating between files within a directory, set filelist_b such that 
    # cross-correlations are not repeated, and files are not correlated with
    # themselves
    if intra_correlating:
        # remove files already correlated with from filelist_b, along with
        # current file
        filelist_b = filelist_a[i+1:]
        if len(filelist_b) == 0:
            # nothing left to check!
            break

    file_a = esc_bash_chars(file_a)
    # calculate fingerprint
    fpcalc_out = commands.getoutput('%sfpcalc -raw -length %i %s'
                                    % (fpcalc_loc, sample_time, file_a))
    fingerprint_index = fpcalc_out.find('FINGERPRINT=') + 12
    # convert fingerprint to list of integers
    fingerprint_a = map(int, fpcalc_out[fingerprint_index:].split(','))
    # check that fingerprint length meets minimum size
    if len(fingerprint_a) - crop - span < min_overlap:
        raise Exception('Fingerprint length less than required:\n'
                        + 'File: %s\n' % file_a
                        + 'Fingerprint length: %i\n' % len(fingerprint_a)
                        + 'Required length (crop + span + min_overlap): %i\n'
                        % (crop + span + min_overlap)
                        + 'Increase sample_time, reduce span or reduce crop.')

    for file_b in filelist_b:
        file_b = esc_bash_chars(file_b)
        # calculate fingerprint
        fpcalc_out = commands.getoutput('%sfpcalc -raw -length %i %s'
                                        % (fpcalc_loc, sample_time, file_b))
        fingerprint_index = fpcalc_out.find('FINGERPRINT=') + 12
        # convert fingerprint to list of integers
        fingerprint_b = map(int, fpcalc_out[fingerprint_index:].split(','))
        # check that fingerprint length meets minimum size
        if len(fingerprint_b) - crop - span < min_overlap:
            raise Exception('Fingerprint length less than required:\n'
                            + 'File: %s\n' % file_b
                            + 'Fingerprint length: %i\n' % len(fingerprint_b)
                            + 'Required length (crop + span + min_overlap): %i\n'
                            % (crop + span + min_overlap)
                            + 'Increase sample_time, reduce span or reduce crop.')

        # cross correlation between fingerprints
        corr_ab = compare(fingerprint_a[crop:], fingerprint_b[crop:], span, step)
        max_corr_index = max_index(corr_ab)
        max_corr_offset = -span + max_corr_index * step

        # report matches
        if corr_ab[max_corr_index] > threshold:
            print('%s and %s match with correlation of %.4f at offset %i'
                  % (file_a, file_b, corr_ab[max_corr_index], max_corr_offset))
            write_string('%s\t%s\t%.6f\t%i'
                         % (file_a, file_b, corr_ab[max_corr_index],
                            max_corr_offset),
                         match_file)