awk

Identifying Duplicate Files in the Current Directory

Here are some tools to find duplicates files:

  • duff: Quickly find duplicate files
  • fdupes: Finds duplicate files in a given set of directories

But we can also just cobble together with a few commonly used CLI tools:

1
2
$ find -type f | xargs -n 1 -I {} md5sum '{}' | sort | \
trueawk '{if(k==$1){printf("%s\n%s\n",v,$2)}else{print("")};k=$1;v=$2}' | uniq

The command finds all files in the current directory, computes the MD5 checksum of each file, sorts them by the checksum first (k), then the file name (v). Finally, prints the duplicates ones and runs through a unique filter to obtain the final result.

Put into a shell script (find-duplicates.sh) with some dummy files to test the command and the above mentioned tools:

1
2
3
4
5
6
7
8
9
10
#!/bin/sh
apt-get update && apt-get install -y duff fdupes
cd /tmp && seq 3 | xargs -I {} bash -c 'touch file{}; date > date{}; echo $RANDOM > rand{}'
find -type f | xargs -n 1 -I {} md5sum '{}' | sort | \
trueawk '{if(k==$1){printf("%s\n%s\n",v,$2)}else{print("")};k=$1;v=$2}' | uniq
duff -r .
fdupes -r .

Execute the script in a disposable environment:

1
$ cat find-duplicates.sh | docker run -i --rm debian:9.6

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
Ign:1 http://cdn-fastly.deb.debian.org/debian stretch InRelease
Get:2 http://security-cdn.debian.org/debian-security stretch/updates InRelease [94.3 kB]
Get:3 http://cdn-fastly.deb.debian.org/debian stretch-updates InRelease [91.0 kB]
Get:4 http://cdn-fastly.deb.debian.org/debian stretch Release [118 kB]
Get:5 http://security-cdn.debian.org/debian-security stretch/updates/main amd64 Packages [467 kB]
Get:6 http://cdn-fastly.deb.debian.org/debian stretch Release.gpg [2434 B]
Get:7 http://cdn-fastly.deb.debian.org/debian stretch-updates/main amd64 Packages [5152 B]
Get:8 http://cdn-fastly.deb.debian.org/debian stretch/main amd64 Packages [7089 kB]
Fetched 7868 kB in 1s (4119 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following NEW packages will be installed:
duff fdupes
0 upgraded, 2 newly installed, 0 to remove and 3 not upgraded.
Need to get 52.4 kB of archives.
After this operation, 146 kB of additional disk space will be used.
Get:1 http://cdn-fastly.deb.debian.org/debian stretch/main amd64 fdupes amd64 1:1.6.1-1+b1 [21.2 kB]
Get:2 http://cdn-fastly.deb.debian.org/debian stretch/main amd64 duff amd64 0.5.2-1.1+b2 [31.2 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 52.4 kB in 0s (110 kB/s)
Selecting previously unselected package fdupes.
(Reading database ... 6498 files and directories currently installed.)
Preparing to unpack .../fdupes_1%3a1.6.1-1+b1_amd64.deb ...
Unpacking fdupes (1:1.6.1-1+b1) ...
Selecting previously unselected package duff.
Preparing to unpack .../duff_0.5.2-1.1+b2_amd64.deb ...
Unpacking duff (0.5.2-1.1+b2) ...
Setting up duff (0.5.2-1.1+b2) ...
Setting up fdupes (1:1.6.1-1+b1) ...
+ xargs -n 1 -I '{}' md5sum '{}'
+ sort
+ find -type f
+ awk '{if(k==$1){printf("%s\n%s\n",v,$2)}else{print("")};k=$1;v=$2}'
+ uniq
./file1
./file2
./file3
./date1
./date2
./date3
+ duff -r .
3 files in cluster 1 (0 bytes, digest da39a3ee5e6b4b0d3255bfef95601890afd80709)
./file3
./file1
./file2
3 files in cluster 2 (29 bytes, digest c20aeceea6a6b4bb1903d9124ea69223da08c69c)
./date2
./date1
./date3
+ fdupes -r .
./date2
./date1
./date3
./file3
./file1
./file2

Looks similar, same result.

However, those tools provide additional configurations such as following symlink files. Worth to install.

Split JSON File into Multiple Parts

mongoexport allows us to export documents from MongoDB into a JSON file:

$ mongoexport -d mydb -c mycollection -o myfile.json --jsonArray

The --jsonArray option writes the entire content of the export as a single JSON array. Sometimes, the size of individual file is quite large, we need to break it down into small pieces, because most web servers will have a limit on how much data can be submitted at once. However, with the option, the entire document is a single line. Any line processing command cannot be easily used. (If you are looking to do so, you can use jq to split a large JSON file into smaller pieces.) We can omit the option (default behavior) and have the export utility to dump it one document at a time. The entire exported JSON file is technically not in correct JSON format, but each line, which represents a MongoDB document, is valid JSON, and can be used to do some command line processing.

To break a large file into many smaller pieces, we can use split command:

$ split -l 10 data.json

The -l or --lines option limits each file with a maximum of 10 lines.

Another way we can use -C or --line-bytes option to put at most 1k bytes of lines per output file:

$ split -d -a 3 -C 1k data.json

One thing needs to make sure is that the size of each line is no more than the maximum size specified by the option, otherwise, partial lines will be generated.

It is good to keep all those parts in their own directory:

$ mkdir pieces && cd pieces && split -d -a 3 -C 1k ../data.json && cd ..

Unless breaking the file into one line at a time, otherwise, we need to convert individual JSON file into correct JSON format:

$ find pieces/* -exec sh -c \
> "awk 'BEGIN{l=\"[\"}{print l;l=\$0\",\"}END{print\$0\"\n]\"}' \
> {} > {}.json && rm {}" \;

The output JSON file will contains an array of MongoDB documents. The main idea of the AWK script is to print out previous line as it reads the current line. BEGIN { l = "[" } defines the first line as the opening square bracket, and the END { print $0"\n" } prints the last line of the file and the closing square bracket.

There bounds to be a better way. Just need to keep looking.