stream

Wrapping Line by Line JSON to an Array

Some utilities print out JSON data (object) line by line. Line by line JSON is not really a valid JSON format, need to wrap it in bracket as an array.

For example, here are a few lines of JSON objects:

1
2
3
4
$ for i in $(seq 3); do echo '{"id":'$i'}'; done
{"id":1}
{"id":2}
{"id":3}

To combine them into a single array, we can use jq‘s -s, —slurp option:

1
2
$ for i in $(seq 3); do echo '{"id":'$i'}'; done | jq -cM -s
[{"id":1},{"id":2},{"id":3}]

The option reads the entire input stream into a large array. This is good until the input stream is too large to process, because jq is not producing output while slurping up the input. If the input is line by line JSON, and all you want to do is wrapping it into an array (with or even without the trailing newline), we can simply do the following:

1
2
3
$ for i in $(seq 3); do echo '{"id":'$i'}'; done |\
awk 'BEGIN { printf "["; getline; printf "%s", $0 } { printf ",%s", $0 } END { print "]" }'
[{"id":1},{"id":2},{"id":3}]

This is with the newline printed. And it will solve the problem of a very large input. Data will simply stream out as it comes it. Try it with a forever while loop.

There is another option in jq named —stream, and it seems to be doing the same. But option —slurp overrides —stream, and the option itself is already so complicated.

In conclusion, you can use jq -s to wrap line by line JSON into an array. If you want to be safe, just use the awk example above.

Streaming HTTP Request Directly to Response in Node.js

This is a Node.js starting script to stream HTTP request directly into response:

1
2
3
require('http').createServer((req, res) => {
req.pipe(res); // Pipe request directly to response
}).listen(3000);

It behaves almost like an echo, you get back whatever you sent. For example, use HTTPie to make a request to the above server:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ echo foo | http --verbose --stream :3000 Content-Type:text/plain
POST / HTTP/1.1
Accept: application/json, */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Content-Length: 4
Host: localhost:3000
User-Agent: HTTPie/0.9.6
Content-Type: text/plain
foo
HTTP/1.1 200 OK
Connection: keep-alive
Transfer-Encoding: chunked
foo

We can also add the Content-Type response header to echo back what the entire media type is after assembling all chunks.

1
2
3
4
5
6
7
require('http').createServer((req, res) => {
req.pipe(res); // Pipe request directly to response
if (req.headers['content-type']) {
res.setHeader('Content-Type', req.headers['content-type']);
}
}).listen(3000);

The response should have the Content-Type field as below:

1
2
3
4
5
6
HTTP/1.1 200 OK
Connection: keep-alive
Content-Type: text/plain
Transfer-Encoding: chunked
foo

Notice that instead of usual Content-Length in the response header, we’ve got Transfer-Encoding: chunked. The default transfer encoding for Node.js HTTP is chunked:

Sending a ‘Content-length’ header will disable the default chunked encoding.[^1]

About transfer encoding:

Chunked transfer encoding is a data transfer mechanism in version 1.1 of the Hypertext Transfer Protocol (HTTP) in which data is sent in a series of “chunks”. It uses the Transfer-Encoding HTTP header in place of the Content-Length header, which the earlier version of the protocol would otherwise require. Because the Content-Length header is not used, the sender does not need to know the length of the content before it starts transmitting a response to the receiver. Senders can begin transmitting dynamically-generated content before knowing the total size of that content. … The size of each chunk is sent right before the chunk itself so that the receiver can tell when it has finished receiving data for that chunk. The data transfer is terminated by a final chunk of length zero.[^2]

With the above starting script, now you can attach some transform streams to manipulate the request and stream back in chunked response.

Settings:

1
2
3
4
$ node --version
v6.3.1
$ http --version
0.9.6

[^1]: HTTP, Node.js API Docs

[^2]: Chunked transfer encoding, Wikipedia

Split a Large JSON file into Smaller Pieces

In the previous post, I have written about how to split a large JSON file into multiple parts, but that was limited to the default behavior of mongoexport, where each line in the output file represents a JSON string. If you have to deal with a large JSON file, such as the one generated with --jsonArray option in mongoexport, you can to parse the file incrementally or streaming.

I have downloaded a large JOSN data set (about 144MB) from Data.gov. If you try to read the entire data set into memory:

> var json = require('./data.json')
Killed

The process is not able to handle it. Use streaming is necessary. And luckily, our command line JSON processing tool, jq, supports streaming.

The parts we are interested are encapsulated in an array under data property of the data set. We are going to split each element of the array into its own file.

Don’t try to use -f option in jq to read file from the command line, it will read everything into a memory. Instead, do cat data.json | jq.

$ mkdir parts
$ cat data.json | jq -c -M '.data[]' | \
  while read line; do echo $line > parts/$(date +%s%N).json; done

The entire data set is piped into jq to filter and compress each array element. Each element is printed in one line, and each line is saved into its own JSON file by using UNIX timestamp plus nanosecond as the filename. All pieces are saved into parts/ directory.

But there is one problem with embedded JSON string, which has to do with echo, due to backslash. For example, if echoing the following string:

{"name":"{\"first\":\"Foo\",\"last\":\"Foo\"}","username":"foo","id":1}

It will be printed as an invalid JSON:

{"name":"{"first":"Foo","last":"Foo"}","username":"foo","id":1}

Backslashes are stripped. To fix this problem, we can simply double backslash:

$ cat data.json | jq -c -M '.data[]' | sed 's/\\"/\\\\"/g' | \
  while read line; do echo $line > parts/$(date +%s%N).json; done

You can even try curl the remote JSON file instead using cat from the downloaded file. But you might want to try with a smaller file first, because, with my slow machine, it took me nearly an hour to finish splitting into 678,733 parts:

real    49m35.780s
user    2m42.888s
sys     6m48.048s

To take it a little bit further, the next step is to decide how many lines or array elements to write into a single file.