Duplicate Files

		GNU/Linux Desktop Survival Guide by Graham Williams

CLICK HERE TO VISIT THE UPDATED SURVIVAL GUIDE

Duplicate Files

20191229 A common challenge is to find duplicate files, such as photos or music or documents. When available disk space becomes tight then it's also a good time for a clean up.

A simple trick to find duplicates is to calculate a MD5 signature for a file, and to the use that signature to find duplicates of the file, knowing that in general a mapping of the contents of a file to a signature is a unique mapping - the signature is unique for different files.

The fdupes package provides the fdupes command that incorporates the use of the MD5 signature within a more thorough pipeline to guarantee the files are duplicates. The pipeline for checking for duplicate files begins with a file size comparison, a partial MD5 signature comparison, a full MD5 signature comparison, and then a byte-to-byte comparison.

A summary as obtained using the --summarize or -m option is often useful to begin with:

$ fdupes --summarize .
13567 duplicate files (in 6407 sets), occupying 16996.0 megabytes

fdupes requires at least one command line argument (a path to a directory). In the above a period (.) is used to indicate the current directory.

With no options fdupes lists groups of duplicated files in the specified directory:

$ fdupes .
./20180323_thesis_02.pdf
./20180323_thesis_01.pdf
./20180323_thesis.pdf

./20030102_pakdd01_03.pdf
./20031012_pakdd01.pdf

./20200531_siunits_01.pdf
./20200531_siunits.pdf

Use the --recurse or -r option to recurse into subdirectories.

fdupes can delete duplicates, retaining the first listed file. A general heuristic is to keep the original rather than files with versioned file names, noting they contain exactly the same content. Ordering the list by name and then reversing the order can be useful:

$ fdupes --order='name' --reverse .
./20180323_thesis.pdf
./20180323_thesis_01.pdf
./20180323_thesis_02.pdf

./20031012_pakdd01.pdf
./20030102_pakdd01_03.pdf

./20200531_siunits.pdf
./20200531_siunits_01.pdf

The following command will delete duplicates, keeping the first file in the list, the list being ordered in reverse by the filename:

$ fdupes --delete --noprompt --order='name' --reverse .

The --ommitfirst or -f option will generate a list of duplicate files excluding the first of the duplicates. This is then a list that can be saved to file to generate a script to manually delete the duplicate files if desired.

Support further development by purchasing the PDF version of the book.
Other online resources include the Data Science Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2020 Togaware Pty Ltd. Creative Commons ShareAlike V4.