Linux删除重复文件

文章作者:Tyan
博客:noahsnail.com  |  CSDN  |  简书

1. 引言

在Linux系统处理数据时,经常会遇到删除重复文件的问题。例如,在进行图片分类任务时,希望删除训练数据中的重复图片。在Linux系统中,存在一个fdupes命令可以查找并删除重复文件。

2. Fdupes介绍

Fdupes是Adrian Lopez用C语言编写的Linux实用程序,它能够在给定的目录和子目录集中找到重复文件,Fdupes通过比较文件的MD5签名然后进行字节比较来识别重复文件。其比较顺序为:

大小比较 > 部分MD5签名比较 > 完整MD5签名比较 > 字节比较

3. 安装fdupes

以CentOS系统为例,fdupes的安装命令为:

1
sudo yum install -y fdupes

4. fdupes的使用

删除重复文件,并且不需要询问用户:

1
$ fdupes -dN [folder_name]

其中,-d参数表示保留一个文件,并删除其它重复文件,-N-d一起使用,表示保留第一个重复文件并删除其它重复文件,不需要提示用户。

使用说明:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
$ fdupes -h
Usage: fdupes [options] DIRECTORY...

-r --recurse for every directory given follow subdirectories
encountered within
-R --recurse: for each directory given after this option follow
subdirectories encountered within (note the ':' at
the end of the option, manpage for more details)
-s --symlinks follow symlinks
-H --hardlinks normally, when two or more files point to the same
disk area they are treated as non-duplicates; this
option will change this behavior
-n --noempty exclude zero-length files from consideration
-A --nohidden exclude hidden files from consideration
-f --omitfirst omit the first file in each set of matches
-1 --sameline list each set of matches on a single line
-S --size show size of duplicate files
-m --summarize summarize dupe information
-q --quiet hide progress indicator
-d --delete prompt user for files to preserve and delete all
others; important: under particular circumstances,
data may be lost when using this option together
with -s or --symlinks, or when specifying a
particular directory more than once; refer to the
fdupes documentation for additional information
-N --noprompt together with --delete, preserve the first file in
each set of duplicates and delete the rest without
prompting the user
-I --immediate delete duplicates as they are encountered, without
grouping into sets; implies --noprompt
-p --permissions don't consider files with different owner/group or
permission bits as duplicates
-o --order=BY select sort order for output and deleting; by file
modification time (BY='time'; default), status
change time (BY='ctime'), or filename (BY='name')
-i --reverse reverse order while sorting
-v --version display fdupes version
-h --help display this help message

参考资料

  1. https://www.tecmint.com/fdupes-find-and-delete-duplicate-files-in-linux/
  2. https://www.howtoing.com/fdupes-find-and-delete-duplicate-files-in-linux
  3. http://www.runoob.com/linux/linux-comm-who.html
如果有收获,可以请我喝杯咖啡!