split

Split a Large JSON file into Smaller Pieces

In the previous post, I have written about how to split a large JSON file into multiple parts, but that was limited to the default behavior of mongoexport, where each line in the output file represents a JSON string. If you have to deal with a large JSON file, such as the one generated with --jsonArray option in mongoexport, you can to parse the file incrementally or streaming.

I have downloaded a large JOSN data set (about 144MB) from Data.gov. If you try to read the entire data set into memory:

> var json = require('./data.json')
Killed

The process is not able to handle it. Use streaming is necessary. And luckily, our command line JSON processing tool, jq, supports streaming.

The parts we are interested are encapsulated in an array under data property of the data set. We are going to split each element of the array into its own file.

Don’t try to use -f option in jq to read file from the command line, it will read everything into a memory. Instead, do cat data.json | jq.

$ mkdir parts
$ cat data.json | jq -c -M '.data[]' | \
  while read line; do echo $line > parts/$(date +%s%N).json; done

The entire data set is piped into jq to filter and compress each array element. Each element is printed in one line, and each line is saved into its own JSON file by using UNIX timestamp plus nanosecond as the filename. All pieces are saved into parts/ directory.

But there is one problem with embedded JSON string, which has to do with echo, due to backslash. For example, if echoing the following string:

{"name":"{\"first\":\"Foo\",\"last\":\"Foo\"}","username":"foo","id":1}

It will be printed as an invalid JSON:

{"name":"{"first":"Foo","last":"Foo"}","username":"foo","id":1}

Backslashes are stripped. To fix this problem, we can simply double backslash:

$ cat data.json | jq -c -M '.data[]' | sed 's/\\"/\\\\"/g' | \
  while read line; do echo $line > parts/$(date +%s%N).json; done

You can even try curl the remote JSON file instead using cat from the downloaded file. But you might want to try with a smaller file first, because, with my slow machine, it took me nearly an hour to finish splitting into 678,733 parts:

real    49m35.780s
user    2m42.888s
sys     6m48.048s

To take it a little bit further, the next step is to decide how many lines or array elements to write into a single file.

Split JSON File into Multiple Parts

mongoexport allows us to export documents from MongoDB into a JSON file:

$ mongoexport -d mydb -c mycollection -o myfile.json --jsonArray

The --jsonArray option writes the entire content of the export as a single JSON array. Sometimes, the size of individual file is quite large, we need to break it down into small pieces, because most web servers will have a limit on how much data can be submitted at once. However, with the option, the entire document is a single line. Any line processing command cannot be easily used. (If you are looking to do so, you can use jq to split a large JSON file into smaller pieces.) We can omit the option (default behavior) and have the export utility to dump it one document at a time. The entire exported JSON file is technically not in correct JSON format, but each line, which represents a MongoDB document, is valid JSON, and can be used to do some command line processing.

To break a large file into many smaller pieces, we can use split command:

$ split -l 10 data.json

The -l or --lines option limits each file with a maximum of 10 lines.

Another way we can use -C or --line-bytes option to put at most 1k bytes of lines per output file:

$ split -d -a 3 -C 1k data.json

One thing needs to make sure is that the size of each line is no more than the maximum size specified by the option, otherwise, partial lines will be generated.

It is good to keep all those parts in their own directory:

$ mkdir pieces && cd pieces && split -d -a 3 -C 1k ../data.json && cd ..

Unless breaking the file into one line at a time, otherwise, we need to convert individual JSON file into correct JSON format:

$ find pieces/* -exec sh -c \
> "awk 'BEGIN{l=\"[\"}{print l;l=\$0\",\"}END{print\$0\"\n]\"}' \
> {} > {}.json && rm {}" \;

The output JSON file will contains an array of MongoDB documents. The main idea of the AWK script is to print out previous line as it reads the current line. BEGIN { l = "[" } defines the first line as the opening square bracket, and the END { print $0"\n" } prints the last line of the file and the closing square bracket.

There bounds to be a better way. Just need to keep looking.