jq

Wrapping Line by Line JSON to an Array

Some utilities print out JSON data (object) line by line. Line by line JSON is not really a valid JSON format, need to wrap it in bracket as an array.

For example, here are a few lines of JSON objects:

1
2
3
4
$ for i in $(seq 3); do echo '{"id":'$i'}'; done
{"id":1}
{"id":2}
{"id":3}

To combine them into a single array, we can use jq‘s -s, —slurp option:

1
2
$ for i in $(seq 3); do echo '{"id":'$i'}'; done | jq -cM -s
[{"id":1},{"id":2},{"id":3}]

The option reads the entire input stream into a large array. This is good until the input stream is too large to process, because jq is not producing output while slurping up the input. If the input is line by line JSON, and all you want to do is wrapping it into an array (with or even without the trailing newline), we can simply do the following:

1
2
3
$ for i in $(seq 3); do echo '{"id":'$i'}'; done |\
awk 'BEGIN { printf "["; getline; printf "%s", $0 } { printf ",%s", $0 } END { print "]" }'
[{"id":1},{"id":2},{"id":3}]

This is with the newline printed. And it will solve the problem of a very large input. Data will simply stream out as it comes it. Try it with a forever while loop.

There is another option in jq named —stream, and it seems to be doing the same. But option —slurp overrides —stream, and the option itself is already so complicated.

In conclusion, you can use jq -s to wrap line by line JSON into an array. If you want to be safe, just use the awk example above.

Registering a Domain with Amazon Route 53 via CLI

Register a domain with Amazon Route 53 via CLI is very straightforward, the only complicate thing is to generate the acceptable JSON structure, then somehow enter the JSON string in the command line without meddling from the shell.

The command is:

1
$ aws route53domains register-domain

The documentation can be found from the AWS CLI Command Reference or issuing:

1
$ aws route53domains register-domain help

Most the command line options require simple string values, but some options like --admin-contact requires a structure data, which is complex to enter in the command line. The easiest way is to forget all other options and use --cli-input-json to get everything in JSON.

First let’s generate the skeleton or template to use:

1
$ aws route53domains register-domain --generate-cli-skeleton

That should output something like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
{
"DomainName": "",
"IdnLangCode": "",
"DurationInYears": 0,
"AutoRenew": true,
"AdminContact": {
"FirstName": "",
"LastName": "",
"ContactType": "",
"OrganizationName": "",
"AddressLine1": "",
"AddressLine2": "",
"City": "",
"State": "",
"CountryCode": "",
"ZipCode": "",
"PhoneNumber": "",
"Email": "",
"Fax": "",
"ExtraParams": [
{
"Name": "",
"Value": ""
}
]
},
"RegistrantContact": {
"FirstName": "",
"LastName": "",
"ContactType": "",
"OrganizationName": "",
"AddressLine1": "",
"AddressLine2": "",
"City": "",
"State": "",
"CountryCode": "",
"ZipCode": "",
"PhoneNumber": "",
"Email": "",
"Fax": "",
"ExtraParams": [
{
"Name": "",
"Value": ""
}
]
},
"TechContact": {
"FirstName": "",
"LastName": "",
"ContactType": "",
"OrganizationName": "",
"AddressLine1": "",
"AddressLine2": "",
"City": "",
"State": "",
"CountryCode": "",
"ZipCode": "",
"PhoneNumber": "",
"Email": "",
"Fax": "",
"ExtraParams": [
{
"Name": "",
"Value": ""
}
]
},
"PrivacyProtectAdminContact": true,
"PrivacyProtectRegistrantContact": true,
"PrivacyProtectTechContact": true
}

Fill in the fields, leave out the ones not needed:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
{
"DomainName": "example.com",
"DurationInYears": 1,
"AutoRenew": true,
"AdminContact": {
"FirstName": "Joe",
"LastName": "Doe",
"ContactType": "PERSON",
"AddressLine1": "101 Main St",
"City": "New York",
"State": "NY",
"CountryCode": "US",
"ZipCode": "10001",
"PhoneNumber": "+1.8888888888",
"Email": "[email protected]"
},
"RegistrantContact": {
"FirstName": "Joe",
"LastName": "Doe",
"ContactType": "PERSON",
"AddressLine1": "101 Main St",
"City": "New York",
"State": "NY",
"CountryCode": "US",
"ZipCode": "10001",
"PhoneNumber": "+1.8888888888",
"Email": "[email protected]"
},
"TechContact": {
"FirstName": "Joe",
"LastName": "Doe",
"ContactType": "PERSON",
"AddressLine1": "101 Main St",
"City": "New York",
"State": "NY",
"CountryCode": "US",
"ZipCode": "10001",
"PhoneNumber": "+1.8888888888",
"Email": "[email protected]"
},
"PrivacyProtectAdminContact": true,
"PrivacyProtectRegistrantContact": true,
"PrivacyProtectTechContact": true
}

Now let’s register by running through jq to clean up extra blanks and end of line characters, and then pipe to xargs command by using new line character as the delimiter instead of a blank by default:

1
2
3
4
5
$ cat input.json | jq -cM . | \
xargs -d '\n' aws route53domains register-domain --cli-input-json
{
"OperationId": "00000000-0000-0000-0000-00000000000"
}

Installing jq from Source

Packages built in both Ubuntu and Debian packages lack behind, therefore, to get the latest version of jq, build from source.

There are a few prerequisites to install:

  • GCC
  • Make
  • Autotools

Both GCC and Make are usually installed if you do development, but not Autotools. Luckily, this is easy to fulfill:

1
$ sudo apt-get install automake

Install from source:

1
2
3
4
$ sudo git clone https://github.com/stedolan/jq.git
$ cd jq
$ sudo git checkout jq-1.5
$ sudo ./configure && sudo make && sudo make install

The installed path is at:

1
2
$ which jq
jq is /usr/local/bin/jq

However, this gives me an unexpected tag:

1
2
$ jq --version
jq-1.5-dirty

Delete All Messages in an Amazon SQS Queue via AWS CLI

Amazon SQS or Simple Queue Service is a fast, reliable, scalable, fully managed message queuing service. There is also AWS CLI or Command Line Interface available to use with the service.

If you have a lot of messages in a queue, this command will show the approximate number:

1
2
3
$ aws sqs get-queue-attributes \
--queue-url $url \
--attribute-names ApproximateNumberOfMessages

Where $url is the URL to the Amazon SQS queue.

There is no command to delete all messages yet, but you can chain a few commands together to make it work:

Create a jq Docker Image with Automated Build

I have created a jq Docker image based on BusyBox with automated builds. BusyBox is really really small in size, so the jq image I have created is also very small, just a little over 6 MB.

Here is the source code and image repositories:

Building Docker image is pretty straight-forward, just follow the instruction:

https://docs.docker.com/userguide/dockerrepos/#automated-builds

I have also created a tag v1.4 in my GitHub repository to match the release of jq binary. This should also be reflected in Docker registry. After couple tries, here is the build details for adding both latest and 1.4 tags:

1
2
3
4
Type Name Dockerfile Location Tag Name
------------------------------------------------
Tag v1.4 / 1.4
Branch master / latest

The first two columns match git branch and tag, and the last column reflects Docker tags. See https://github.com/realguess/docker-jq/tags and https://registry.hub.docker.com/u/realguess/jq/tags/manage/.

This is done by starting an automated build with type of tag instead of branch. Everything is done via the Docker Hub website.

If I pull down this repository:

1
$ docker pull realguess/jq

It should give me two image layers with the two different image IDs, which makes sense, as latest commit usually is not the same as the tagged one.

After a while, the index should be built, and I can search it via:

1
2
3
$ docker search jq
NAME DESCRIPTION STARS OFFICIAL AUTOMATED
realguess/jq 1 [OK]

The image is listed as AUTOMATED, but the description is missing here in the search command.

There are two types of descriptions:

  • Short description
  • Full description

Full description will be generated automatically via README.md file. I thought the short description can also be generated via README-short.txt, however, this is not the case. You can add it in the settings page for example:

https://registry.hub.docker.com/u/realguess/jq/settings/

Automated builds are triggered automatically with GitHub and BitBucket repository. Once any commit is pushed to either repository, a trigger will be sent to Docker Hub, and automated build will start.

Split a Large JSON file into Smaller Pieces

In the previous post, I have written about how to split a large JSON file into multiple parts, but that was limited to the default behavior of mongoexport, where each line in the output file represents a JSON string. If you have to deal with a large JSON file, such as the one generated with --jsonArray option in mongoexport, you can to parse the file incrementally or streaming.

I have downloaded a large JOSN data set (about 144MB) from Data.gov. If you try to read the entire data set into memory:

> var json = require('./data.json')
Killed

The process is not able to handle it. Use streaming is necessary. And luckily, our command line JSON processing tool, jq, supports streaming.

The parts we are interested are encapsulated in an array under data property of the data set. We are going to split each element of the array into its own file.

Don’t try to use -f option in jq to read file from the command line, it will read everything into a memory. Instead, do cat data.json | jq.

$ mkdir parts
$ cat data.json | jq -c -M '.data[]' | \
  while read line; do echo $line > parts/$(date +%s%N).json; done

The entire data set is piped into jq to filter and compress each array element. Each element is printed in one line, and each line is saved into its own JSON file by using UNIX timestamp plus nanosecond as the filename. All pieces are saved into parts/ directory.

But there is one problem with embedded JSON string, which has to do with echo, due to backslash. For example, if echoing the following string:

{"name":"{\"first\":\"Foo\",\"last\":\"Foo\"}","username":"foo","id":1}

It will be printed as an invalid JSON:

{"name":"{"first":"Foo","last":"Foo"}","username":"foo","id":1}

Backslashes are stripped. To fix this problem, we can simply double backslash:

$ cat data.json | jq -c -M '.data[]' | sed 's/\\"/\\\\"/g' | \
  while read line; do echo $line > parts/$(date +%s%N).json; done

You can even try curl the remote JSON file instead using cat from the downloaded file. But you might want to try with a smaller file first, because, with my slow machine, it took me nearly an hour to finish splitting into 678,733 parts:

real    49m35.780s
user    2m42.888s
sys     6m48.048s

To take it a little bit further, the next step is to decide how many lines or array elements to write into a single file.

Command Line JSON Processing

What is the best command line tool to process JSON?

Hmm… Okay, let’s try different command line JSON processing tools with the following use case to decide which one is the best to use.

Here is the use case: JSON | filter | shell. A program outputs JSON data, pipes into a JSON command line processing tool to filter data, and then send to a shell command to do more work.

Here is a snippet of sample JSON data:

1
2
3
4
5
6
7
8
9
10
11
12
[
{
"id": 1,
"username": "foo",
"name": "Foo Foo"
},
{
"id": 2,
"username": "bar",
"name": "Bar Bar"
}
]

Or in one-liner data.json:

1
[{"id":1,"username":"foo","name":"Foo Foo"},{"id":2,"username":"bar","name":"Bar Bar"}]

The command line JSON processor should filter each element of the array and convert it into its own line:

1
2
{"name":"Foo Foo"}
{"name":"Bar Bar"}

The result will be piped as the input line by line into a shell script echo.bash:

1
2
3
4
5
#!/usr/bin/env bash
while read line; do
echo "ECHO: '"$line"'"
done

The final output should be:

1
2
ECHO: '{"name":"Foo Foo"}'
ECHO: '{"name":"Bar Bar"}'

Custom Solution

Before start looking for existing tools, let’s see how difficult it is to write a custom solution.

1
2
3
4
5
6
7
8
9
10
11
12
13
// Filter and convert array element into its own line.
var rl = require('readline').createInterface({
input : process.stdin,
output: process.stdout,
});
rl.on('line', function (line) {
JSON.parse(line).forEach(function (item) {
console.log('{"name":"' + item.name + '"}');
});
}).on('close', function () {
// Shh...
});

Perform a test run:

1
2
3
$ cat data.json | node filter.js | bash echo.bash
ECHO: '{"name":"Foo Foo"}'
ECHO: '{"name":"Bar Bar"}'

Well, it works. In essence, we are writing a simple JSON parser. Unless you want to keep the footprint small, you don’t want to write another JSON parser. And why bother to reinvent the wheel? Let’s start look at the existing solutions.

Node Modules

Let’s start with the tools from NPM registry:

$ npm search json command

Here are a few candidates that appears to be matching from the description:

  • jku - Jku is a command-line tool to filter and/or modifiy a JSON stream. It is heavily inspired by jq. (2 stars and not active, last update 8 months ago).
  • json or json-command - JSON command line procesing toolkit. (122 stars and 14 forks, last update 9 months ago)
  • jutil - Command-line utilities for manipulating JSON. (88 stars and 2 forks, last update more than 2 years ago)

Not a lot of choice, and modules are not active. This might be that because there is already a really good solution, jq, which has 2493 stars and 145 forks, and the last update was 6 days ago.

jq

jq is like sed for JSON data - you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text. - jq

Instead of NPM install, do:

$ sudo apt-get -y install jq

Since we don’t need color or prettified, just line by line. So, here is the command chain:

1
2
3
$ cat data.json | jq -c -M '.[] | {name}' | bash echo.bash
ECHO: '{"name":"Foo Foo"}'
ECHO: '{"name":"Bar Bar"}'

jq can do much more than just the example just shown. It has zero runtime dependencies, and flexible to deal with not just array but object as well.

Conclusion

jq is clearly the winner here, with the least dependency, the most functionality and more popularity, as well as a comprehensive documentation.