json

Wrapping Line by Line JSON to an Array

Some utilities print out JSON data (object) line by line. Line by line JSON is not really a valid JSON format, need to wrap it in bracket as an array.

For example, here are a few lines of JSON objects:

1
2
3
4
$ for i in $(seq 3); do echo '{"id":'$i'}'; done
{"id":1}
{"id":2}
{"id":3}

To combine them into a single array, we can use jq‘s -s, —slurp option:

1
2
$ for i in $(seq 3); do echo '{"id":'$i'}'; done | jq -cM -s
[{"id":1},{"id":2},{"id":3}]

The option reads the entire input stream into a large array. This is good until the input stream is too large to process, because jq is not producing output while slurping up the input. If the input is line by line JSON, and all you want to do is wrapping it into an array (with or even without the trailing newline), we can simply do the following:

1
2
3
$ for i in $(seq 3); do echo '{"id":'$i'}'; done |\
awk 'BEGIN { printf "["; getline; printf "%s", $0 } { printf ",%s", $0 } END { print "]" }'
[{"id":1},{"id":2},{"id":3}]

This is with the newline printed. And it will solve the problem of a very large input. Data will simply stream out as it comes it. Try it with a forever while loop.

There is another option in jq named —stream, and it seems to be doing the same. But option —slurp overrides —stream, and the option itself is already so complicated.

In conclusion, you can use jq -s to wrap line by line JSON into an array. If you want to be safe, just use the awk example above.

Escaping in JSON with Backslash

Escape characters are part of the syntax for many programming languages, data formats, and communication protocols. For a given alphabet an escape character’s purpose is to start character sequences (so named escape sequences), which have to be interpreted differently from the same characters occurring without the prefixed escape character.[^2]

JSON or JavaScript Object Notation is a data interchange format. It has an escape character as well.

In many programming languages such as C, Perl, and PHP and in Unix scripting languages, the backslash is an escape character, used to indicate that the character following it should be treated specially (if it would otherwise be treated normally), or normally (if it would otherwise be treated specially).[^3]

JavaScript also uses backslash as an escape character. JSON is based on a subset of the JavaScript Programming Language, therefore, JSON also uses backslash as the escape character:

A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes.[^1]

A character can be:

  • Any Unicode character except " or \ or control character
  • \"
  • \\
  • \/
  • \b
  • \f
  • \n
  • \r
  • \t
  • \u + four-hex-digits

Only a few characters can be escaped in JSON. If the character is not one of the listed:

1
2
$ cat data.json
"\a"

it returns a SyntaxError[^4]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ node -e 'console.log(require("./data.json"))'
module.js:561
throw err;
^
SyntaxError: /home/chao/tmp/js/data.json: Unexpected token a in JSON at position 2
at Object.parse (native)
at Object.Module._extensions..json (module.js:558:27)
at Module.load (module.js:458:32)
at tryModuleLoad (module.js:417:12)
at Function.Module._load (module.js:409:3)
at Module.require (module.js:468:17)
at require (internal/module.js:20:19)
at [eval]:1:13
at ContextifyScript.Script.runInThisContext (vm.js:25:33)
at Object.exports.runInThisContext (vm.js:77:17)

Avoid Assigning undefined to an Object Property

A property value should be any JavaScript value except for undefined. If you do something like this (albeit it is legal):

1
var foo = { bar: undefined };

will leads to confusing code. Because when accessing the property value:

1
console.log(foo.bar); // undefined

It will returns undefined. But you are not sure if it means the property exists or not or the value of the property is set to undefined. Therefore, you should do:

1
var foo = { bar: null };

This indicates that the property is expected, and with the value of null.

You do be able to check the existence of a property by:

1
2
Object.keys(foo); // ['bar']
foo.hasOwnProperty('bar'); // true

But more importantly, if you serialize the object with JSON.stringify, properties with undefined will be omitted:

1
JSON.stringify({ bar: undefined }); // '{}'

According to JSON specification, a value can be a string in double quotes, or a number, or true or false or null, or an object or an array. undefined is not a valid JSON value.

null is fine:

1
JSON.stringify({ bar: null }); // '{"bar":null}'

So, for the best practice, avoid assigning undefined to a property of an object. Use null to indicate the expected property without a value. This will increase portability when using JSON to serialize.

Split a Large JSON file into Smaller Pieces

In the previous post, I have written about how to split a large JSON file into multiple parts, but that was limited to the default behavior of mongoexport, where each line in the output file represents a JSON string. If you have to deal with a large JSON file, such as the one generated with --jsonArray option in mongoexport, you can to parse the file incrementally or streaming.

I have downloaded a large JOSN data set (about 144MB) from Data.gov. If you try to read the entire data set into memory:

> var json = require('./data.json')
Killed

The process is not able to handle it. Use streaming is necessary. And luckily, our command line JSON processing tool, jq, supports streaming.

The parts we are interested are encapsulated in an array under data property of the data set. We are going to split each element of the array into its own file.

Don’t try to use -f option in jq to read file from the command line, it will read everything into a memory. Instead, do cat data.json | jq.

$ mkdir parts
$ cat data.json | jq -c -M '.data[]' | \
  while read line; do echo $line > parts/$(date +%s%N).json; done

The entire data set is piped into jq to filter and compress each array element. Each element is printed in one line, and each line is saved into its own JSON file by using UNIX timestamp plus nanosecond as the filename. All pieces are saved into parts/ directory.

But there is one problem with embedded JSON string, which has to do with echo, due to backslash. For example, if echoing the following string:

{"name":"{\"first\":\"Foo\",\"last\":\"Foo\"}","username":"foo","id":1}

It will be printed as an invalid JSON:

{"name":"{"first":"Foo","last":"Foo"}","username":"foo","id":1}

Backslashes are stripped. To fix this problem, we can simply double backslash:

$ cat data.json | jq -c -M '.data[]' | sed 's/\\"/\\\\"/g' | \
  while read line; do echo $line > parts/$(date +%s%N).json; done

You can even try curl the remote JSON file instead using cat from the downloaded file. But you might want to try with a smaller file first, because, with my slow machine, it took me nearly an hour to finish splitting into 678,733 parts:

real    49m35.780s
user    2m42.888s
sys     6m48.048s

To take it a little bit further, the next step is to decide how many lines or array elements to write into a single file.

Command Line JSON Processing

What is the best command line tool to process JSON?

Hmm… Okay, let’s try different command line JSON processing tools with the following use case to decide which one is the best to use.

Here is the use case: JSON | filter | shell. A program outputs JSON data, pipes into a JSON command line processing tool to filter data, and then send to a shell command to do more work.

Here is a snippet of sample JSON data:

1
2
3
4
5
6
7
8
9
10
11
12
[
{
"id": 1,
"username": "foo",
"name": "Foo Foo"
},
{
"id": 2,
"username": "bar",
"name": "Bar Bar"
}
]

Or in one-liner data.json:

1
[{"id":1,"username":"foo","name":"Foo Foo"},{"id":2,"username":"bar","name":"Bar Bar"}]

The command line JSON processor should filter each element of the array and convert it into its own line:

1
2
{"name":"Foo Foo"}
{"name":"Bar Bar"}

The result will be piped as the input line by line into a shell script echo.bash:

1
2
3
4
5
#!/usr/bin/env bash
while read line; do
echo "ECHO: '"$line"'"
done

The final output should be:

1
2
ECHO: '{"name":"Foo Foo"}'
ECHO: '{"name":"Bar Bar"}'

Custom Solution

Before start looking for existing tools, let’s see how difficult it is to write a custom solution.

1
2
3
4
5
6
7
8
9
10
11
12
13
// Filter and convert array element into its own line.
var rl = require('readline').createInterface({
input : process.stdin,
output: process.stdout,
});
rl.on('line', function (line) {
JSON.parse(line).forEach(function (item) {
console.log('{"name":"' + item.name + '"}');
});
}).on('close', function () {
// Shh...
});

Perform a test run:

1
2
3
$ cat data.json | node filter.js | bash echo.bash
ECHO: '{"name":"Foo Foo"}'
ECHO: '{"name":"Bar Bar"}'

Well, it works. In essence, we are writing a simple JSON parser. Unless you want to keep the footprint small, you don’t want to write another JSON parser. And why bother to reinvent the wheel? Let’s start look at the existing solutions.

Node Modules

Let’s start with the tools from NPM registry:

$ npm search json command

Here are a few candidates that appears to be matching from the description:

  • jku - Jku is a command-line tool to filter and/or modifiy a JSON stream. It is heavily inspired by jq. (2 stars and not active, last update 8 months ago).
  • json or json-command - JSON command line procesing toolkit. (122 stars and 14 forks, last update 9 months ago)
  • jutil - Command-line utilities for manipulating JSON. (88 stars and 2 forks, last update more than 2 years ago)

Not a lot of choice, and modules are not active. This might be that because there is already a really good solution, jq, which has 2493 stars and 145 forks, and the last update was 6 days ago.

jq

jq is like sed for JSON data - you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text. - jq

Instead of NPM install, do:

$ sudo apt-get -y install jq

Since we don’t need color or prettified, just line by line. So, here is the command chain:

1
2
3
$ cat data.json | jq -c -M '.[] | {name}' | bash echo.bash
ECHO: '{"name":"Foo Foo"}'
ECHO: '{"name":"Bar Bar"}'

jq can do much more than just the example just shown. It has zero runtime dependencies, and flexible to deal with not just array but object as well.

Conclusion

jq is clearly the winner here, with the least dependency, the most functionality and more popularity, as well as a comprehensive documentation.

Split JSON File into Multiple Parts

mongoexport allows us to export documents from MongoDB into a JSON file:

$ mongoexport -d mydb -c mycollection -o myfile.json --jsonArray

The --jsonArray option writes the entire content of the export as a single JSON array. Sometimes, the size of individual file is quite large, we need to break it down into small pieces, because most web servers will have a limit on how much data can be submitted at once. However, with the option, the entire document is a single line. Any line processing command cannot be easily used. (If you are looking to do so, you can use jq to split a large JSON file into smaller pieces.) We can omit the option (default behavior) and have the export utility to dump it one document at a time. The entire exported JSON file is technically not in correct JSON format, but each line, which represents a MongoDB document, is valid JSON, and can be used to do some command line processing.

To break a large file into many smaller pieces, we can use split command:

$ split -l 10 data.json

The -l or --lines option limits each file with a maximum of 10 lines.

Another way we can use -C or --line-bytes option to put at most 1k bytes of lines per output file:

$ split -d -a 3 -C 1k data.json

One thing needs to make sure is that the size of each line is no more than the maximum size specified by the option, otherwise, partial lines will be generated.

It is good to keep all those parts in their own directory:

$ mkdir pieces && cd pieces && split -d -a 3 -C 1k ../data.json && cd ..

Unless breaking the file into one line at a time, otherwise, we need to convert individual JSON file into correct JSON format:

$ find pieces/* -exec sh -c \
> "awk 'BEGIN{l=\"[\"}{print l;l=\$0\",\"}END{print\$0\"\n]\"}' \
> {} > {}.json && rm {}" \;

The output JSON file will contains an array of MongoDB documents. The main idea of the AWK script is to print out previous line as it reads the current line. BEGIN { l = "[" } defines the first line as the opening square bracket, and the END { print $0"\n" } prints the last line of the file and the closing square bracket.

There bounds to be a better way. Just need to keep looking.