Get Started with the Cut Command of GNU Coreutils

A part of GNU Coreutils, cut command likes paste and join is one that operates on fields. The cut command prints selected parts (fields or sections) of lines.

Things to keep in mind:

  • Work on text stream line by line
  • Print sections/parts/fields
  • Either -c, -b or -f option must be used
  • The default delimiter is TAB
  • The delimiter must be a single character
  • Consecutive delimiters need to be consolidated into one
  • Cut multiple or a range of fields

Generate sample data and saved in the file tab.txt:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ file=tab.txt && \
for i in $(seq 1 9); do \
for j in $(seq 1 9); do \
if [ "$j" == "9" ]; then \
echo -e "$i$j"; \
else \
echo -en "$i$j\t"; \
fi; \
done; \
done > $file && cat $file
11 12 13 14 15 16 17 18 19
21 22 23 24 25 26 27 28 29
31 32 33 34 35 36 37 38 39
41 42 43 44 45 46 47 48 49
51 52 53 54 55 56 57 58 59
61 62 63 64 65 66 67 68 69
71 72 73 74 75 76 77 78 79
81 82 83 84 85 86 87 88 89
91 92 93 94 95 96 97 98 99

Fields on each line is separated by a tab character.

There is a required option:

1
2
3
$ cut tab.txt
cut: you must specify a list of bytes, characters, or fields
Try 'cut --help' for more information.

These options are:

  • bytes: -b or --bytes=LIST
  • characters: -c or --characters=LIST
  • fields: -f or --fields=LIST

This is very odd. Option means optional, a default should be provided. But let’s focus on the common used one: fields.

Cut the first and the ninth fields with the default delimiter TAB:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ cut -f 1 tab.txt
11
21
31
41
51
61
71
81
91
$ cut -f 9 tab.txt
19
29
39
49
59
69
79
89
99

Use space as the delimiter:

1
2
3
4
5
6
7
8
9
10
$ cp tab.txt space.txt && sed -i 's/\t/ /g' $_ && cat $_
11 12 13 14 15 16 17 18 19
21 22 23 24 25 26 27 28 29
31 32 33 34 35 36 37 38 39
41 42 43 44 45 46 47 48 49
51 52 53 54 55 56 57 58 59
61 62 63 64 65 66 67 68 69
71 72 73 74 75 76 77 78 79
81 82 83 84 85 86 87 88 89
91 92 93 94 95 96 97 98 99

We must choose a different delimiter via -d or --delimiter=DELIM option:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ cut -f 1 $_
11 12 13 14 15 16 17 18 19
21 22 23 24 25 26 27 28 29
31 32 33 34 35 36 37 38 39
41 42 43 44 45 46 47 48 49
51 52 53 54 55 56 57 58 59
61 62 63 64 65 66 67 68 69
71 72 73 74 75 76 77 78 79
81 82 83 84 85 86 87 88 89
91 92 93 94 95 96 97 98 99
$ cut -f 1 -d ' ' $_
11
21
31
41
51
61
71
81
91

Delimiter must be a single character:

1
2
3
$ cut -f 1 -d '\s' $_
cut: the delimiter must be a single character
Try 'cut --help' for more information.

Files containing mixed delimiters (tab and space):

1
2
3
4
5
6
7
8
9
10
$ cp tab.txt mixed.txt && sed -i 's/\(9.\)\t/\1 /g' $_ && cat $_
11 12 13 14 15 16 17 18 19
21 22 23 24 25 26 27 28 29
31 32 33 34 35 36 37 38 39
41 42 43 44 45 46 47 48 49
51 52 53 54 55 56 57 58 59
61 62 63 64 65 66 67 68 69
71 72 73 74 75 76 77 78 79
81 82 83 84 85 86 87 88 89
91 92 93 94 95 96 97 98 99

will print any line that contains no delimiter character:

1
2
3
4
5
6
7
8
9
10
$ cut -f 1 mixed.txt
11
21
31
41
51
61
71
81
91 92 93 94 95 96 97 98 99

Or use -s or --only-delimited option to omit those lines.

1
2
3
4
5
6
7
8
9
$ cut -sf 1 mixed.txt
11
21
31
41
51
61
71
81

But the better approach is to do data cleansing prior.

What about multiple TAB characters in the file:

1
2
3
4
5
6
7
8
9
10
$ sed -i 's/\(1.\)\t/\1\t\t/' mixed.txt && cat $_
11 12 13 14 15 16 17 18 19
21 22 23 24 25 26 27 28 29
31 32 33 34 35 36 37 38 39
41 42 43 44 45 46 47 48 49
51 52 53 54 55 56 57 58 59
61 62 63 64 65 66 67 68 69
71 72 73 74 75 76 77 78 79
81 82 83 84 85 86 87 88 89
91 92 93 94 95 96 97 98 99

An empty field is still a field:

1
2
3
4
5
6
7
8
9
$ cut -sf 2 mixed.txt
22
32
42
52
62
72
82

Therefore, the drawback here is that there cannot be multiple delimiter sticking together. Must perform data cleansing to reduce consecutive delimiters into a single one:

1
2
3
4
5
6
7
8
9
10
$ sed -i 's/\t\+/\t/g' mixed.txt && cat $_
11 12 13 14 15 16 17 18 19
21 22 23 24 25 26 27 28 29
31 32 33 34 35 36 37 38 39
41 42 43 44 45 46 47 48 49
51 52 53 54 55 56 57 58 59
61 62 63 64 65 66 67 68 69
71 72 73 74 75 76 77 78 79
81 82 83 84 85 86 87 88 89
91 92 93 94 95 96 97 98 99

Multiple fields can be cut:

1
2
3
4
5
6
7
8
9
10
$ cut -f 1,3,5,7,9 tab.txt
11 13 15 17 19
21 23 25 27 29
31 33 35 37 39
41 43 45 47 49
51 53 55 57 59
61 63 65 67 69
71 73 75 77 79
81 83 85 87 89
91 93 95 97 99

Cut a range:

1
2
3
4
5
6
7
8
9
10
$ cut -f 3-5 tab.txt
13 14 15
23 24 25
33 34 35
43 44 45
53 54 55
63 64 65
73 74 75
83 84 85
93 94 95

Cut up to or from a field:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ cut -f -3 tab.txt
11 12 13
21 22 23
31 32 33
41 42 43
51 52 53
61 62 63
71 72 73
81 82 83
91 92 93
$ cut -f 7- tab.txt
17 18 19
27 28 29
37 38 39
47 48 49
57 58 59
67 68 69
77 78 79
87 88 89
97 98 99

When cut multiple fields, the fields are separated by the same delimiter used (indicated by -d field or TAB as the default). If change the output delimiter, it’s not the job of cut, pipe to another program:

1
2
3
4
5
6
7
8
9
10
$ cut -f 3-5 tab.txt | sed 's/\t/ /g'
13 14 15
23 24 25
33 34 35
43 44 45
53 54 55
63 64 65
73 74 75
83 84 85
93 94 95

Node.js Crypto Starters

Starter for the most common usage.

Create a MD5 hex hash:

1
require('crypto').createHash('md5').update(data, 'utf8').digest('hex');

Create a SHA256 base64 hash:

1
require('crypto').createHash('sha256').update(data, 'utf8').digest('base64');

Calculating the digest varies with encoding:

1
2
3
4
5
6
7
8
9
10
11
12
> data = 'secret'
'secret'
> require('crypto').createHash('sha256').update(data, 'utf8').digest('hex');
'2bb80d537b1da3e38bd30361aa855686bde0eacd7162fef6a25fe97bf527a25b'
> require('crypto').createHash('sha256').update(data, 'utf8').digest('base64');
'K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72ol/pe/Unols='
> require('crypto').createHash('sha256').update(data, 'utf8').digest('binary');
'+¸\rS{\u001d£ãÓ\u0003aªV½àêÍqbþö¢_é{õ\'¢['
> require('crypto').createHash('sha256').update(data, 'utf8').digest();
<Buffer 2b b8 0d 53 7b 1d a3 e3 8b d3 03 61 aa 85 56 86 bd e0 ea cd 71 62 fe f6 a2 5f e9 7b f5 27 a2 5b>
> require('crypto').createHash('sha256').update(data, 'utf8').digest().toString();
'+�\rS{\u001d����\u0003a��V�����qb���_�{�\'�['

Both hex and base64 are commonly used ones.

Needs an input_encoding when update data, otherwise there will be different results:

1
2
3
4
5
6
7
8
9
10
> data = '秘密'
'秘密'
> require('crypto').createHash('sha256').update(data, 'ascii').digest('hex');
'7588bc51fd0ff10db9c66549e5cb7969b9f6b7cf3ccee080137a8eb1aa06e718'
> require('crypto').createHash('sha256').update(data, 'utf8').digest('hex');
'062a2931da683a9897c5a0b597113b1e9fd0d5bfb63e2a5d7c88b724f7f55c02'
> require('crypto').createHash('sha256').update(data, 'binary').digest('hex');
'7588bc51fd0ff10db9c66549e5cb7969b9f6b7cf3ccee080137a8eb1aa06e718'
> require('crypto').createHash('sha256').update(data).digest('hex');
'7588bc51fd0ff10db9c66549e5cb7969b9f6b7cf3ccee080137a8eb1aa06e718'

Don’t assume ASCII.

Parameter data must be a string or buffer, otherwise there will be an error:

1
TypeError: Not a string or buffer

Add a Custom Domain to the Specific Module of Google App Engine Managed VMs

I would like to add a custom domain to a module other than the default one in Google App Engine Managed VMs. For example, I have the following sub-domains:

1
2
3
4
Custom domain names | SSL support | Record type | Data | Alias
------------------- | ----------- | ----------- | -------------------- | -----
api.example.com | none | CNAME | ghs.googlehosted.com | api
admin.example.com | none | CNAME | ghs.googlehosted.com | admin

listed in the application console:

https://console.developers.google.com/project/[project]/appengine/settings/domains

The domain api.example.com points to the default module. And I would like the other domain admin.example.com to point to the module admin. However, by merely adding the custom domain in the settings, both api.example.com and admin.example.com are pointed to the same module: the default module. Then how to point the custom domain and route the traffic to the admin module? The answer is on the dispatch file. But first need to understand the concept of modules in Google App Engine.

Google App Engine Module Hierarchy
Source: Google App Engine Docs

The above chart illustrates the architecture of a Google App Engine application:

  • Application: “An App Engine application is made up of one or more modules”. [1]
  • Module: “Each module consists of source code and configuration files. The files used by a module represents a version of the module. When you deploy a module, you always deploy a specific version of the module.” [1] “All module-less apps have been converted to contain a single default module.” [1] Every application has a single default module.
  • Version: “A particular module/version will have one or more instances.”
  • Instance: “Each instance runs its own separate executable.”

Another concept is the resource sharing:

  • “App Engine Modules let developers factor large applications into logical components that can share stateful services and communicate in a secure fashion.” [1]
  • “Stateful services (such as Memcache, Datastore, and Task Queues) are shared by all modules in an application.” [1]

Remember that Google Cloud Platform is project based. If there are a web application and a mobile application, and they are both making requests to another application, the API. Then there should be three projects. However, in this case, both api and admin share the same services, such as the same database and file storage. We should put them together in the same application as separate modules.

With that in mind, how to route requests to a specific module? “Every module, version, and instance has its own unique URI (for example, v1.my-module.my-app.appspot.com). Incoming user requests are routed to an instance of a particular module/version according to URL addressing conventions and an optional customized dispatch file.” [1] Since a custom domain is used, we have to use the customized dispatch file. And this is done by creating a dispatch file dispatch.yml to route requests based on URL patterns.

1
2
3
4
5
6
7
8
# Dispatch
# ========
---
dispatch:
- url: '*/favicon.ico'
module: default
- url: 'admin.example.com/*'
module: admin

The application field is not necessary, otherwise, you will get:

1
2
WARNING: The [application] field is specified in file [/home/chao/example/dispatch.yml].
This field is not used by gcloud and should be removed.

Glob characters (such as asterisk) can be used, and support up to 10 routing rules.

Deploy the dispatch file is simple:

1
$ gcloud preview app deploy dispatch.yml

And it should require almost no time to wait. Once it is ready, the requests sent to admin.example.com will be routed properly to the admin module.

The First Release Version

When you have a release for the first time, how do you name your release version based on SemVer? 1.0.0, 0.1.0 or 0.0.1?

I like to start off with 0.1.0. Because version 0.0.0 means there is nothing. Then from nothing to something, that’s breaking change, hence version 0.1.0.

Use a Single Dockerfile

I was planning to build two different Docker images: one for production and one for test. However, the more I had coded, the more frustrated I had become. I want something simple, and more means complexity:

  1. Having multiple images means more configuration files to manage and update.
  2. More things to do when working with third-party tools, for example, Google App Engine Managed VMs deployment will only recognize the application root level Dockerfile out of the box.
  3. Steeper learning curve for new developers.

Keep our application structure simple is important. If you need multiple dockerfiles, then your application is too complex. The difference between npm install --production and npm install isn’t a lot. But it will save you time and effort from managing multiple dockerfiles. There is no reason to have more than one Dockerfile. Just use a different command such as npm test when running tests.

Wait and then Grab the Lock and Update

When provisioning a new machine in Google Compute Engine with Vagrant, I spotted the following error messages when updating the system:

1
2
Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable)
E: Unable to lock directory /var/lib/apt/lists/

It appears to be that I was just behind on the race to grab the lock on the APT update. Without getting the latest index, I am not able to upgrade the packages. An simple solution is to wait it out:

1
2
3
4
5
6
7
8
9
10
11
#!/usr/bin/env bash
#
# System Update
# =============
apt-get update
while [ "$?" != "0" ]; do
sleep 5
apt-get update
done
apt-get -y dist-upgrade

Scaffold an Explicit Document to Avoid Error When Creating Indexes in Google Cloud Datastore

When working with Google Cloud Datastore, I would like to design each kind (collection or table) with its own schema and index files:

1
2
3
4
$ tree
.
├── index.yml
└── schema.yml

But not every kind has to have something in the index.yml file. So, what to do? Will the index creation command accept an empty index.yml file with zero byte? Let’s give try:

1
2
3
4
$ gcloud preview datastore create-indexes index.yml
ERROR: (gcloud.preview.datastore.create-indexes) An error occurred while parsing file: [/home/chao/kind/index.yml]
The file is empty

No, it does not like it. Then, what’s the minimum required? The answer is an explicit document with an empty document content:

1
---

The three dashes form a directives end marker line. YAML uses three dashes to separate directives from document content.

When creating indexes with the file, it will proceed without errors or warnings:

1
2
3
4
5
$ gcloud preview datastore create-indexes index.yml
You are about to update the following configurations:
- myproj/index From: [/home/chao/kind/index.yml]
Do you want to continue (Y/n)?

Therefore, to avoid error, when scaffold an index.yml file for indexing, use an explicit document with an empty document content.

Google Cloud Datastore Starter: Dataset

Google Cloud Datastore dataset starter, part of a starter collection:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// Google Cloud Datastore Starter: Dataset
// =======================================
'use strict';
// Define a dataset.
var params = {
projectId: 'MY_PROJECT',
keyFilename: '/path/to/key/file.json',
namespace: 'MY_NAMESPACE'
};
var dataset = require('gcloud').datastore.dataset(params);
// Define an entity (including both key and data).
var kind = 'MY_KIND';
var key = dataset.key(kind);
var data = [
{
name: 'title',
value: 'Google Cloud Datastore Starter',
excludeFromIndexes: false
}
];
// var data = { title: 'Google Cloud Datastore Starter' }; // Simple version
var entity = {
key: key,
data: data
};
// Save a single entity.
dataset.save(entity, function (err) {
if (err) {
throw err;
}
});

Understand the Limitation of the String Property Type of Google Cloud Datastore

Google Cloud Datastore supports a variety of data types for property values:

  • Integers
  • Floating-point numbers
  • Strings
  • Dates
  • Binary data

Strings are likely the most frequently used. When working with strings, the most important question to ask is how many bytes can be inserted into the property/field. According to the table listed in properties and value types:

Up to 1500 bytes if property is indexed, up to 1MB otherwise

Let’s give a try:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// String Property Type
// ====================
//
// Investigate the string property type of Google Cloud Datastore.
//
// Dependencies tested:
//
// - `gcloud@v0.21.0`
// - `lorem-ipsum@1.0.3`
'use strict';
// Define dependencies.
var params = {
projectId: 'myproj',
keyFilename: '/path/to/key/file.json',
namespace: 'test'
};
var dataset = require('gcloud').datastore.dataset(params);
var kind = 'String';
var lorem = require('lorem-ipsum');
// Save an entity to the datastore.
function run(params) {
var size = params.size;
var index = params.index;
var str = lorem({ count: size }).substr(0, size);
var key = dataset.key(kind);
var data = [
{
name: 'title',
value: str,
excludeFromIndexes: !index
}
];
var entity = {
key: key,
data: data
};
dataset.save(entity, function (err) {
console.log(size, new Buffer(str).length, index,
err ? err.code + ' ' + err.message : '200 OK');
});
}
// Explanation of fields:
//
// - `size`: Total number of bytes to produce
// - `index`: Whether to index the string field
//
// In-line comment indicates the expected result.
[
{ size: 1500, index: true }, // OK
{ size: 1501, index: false }, // OK
{ size: 1501, index: true }, // Error
{ size: 1024 * 1024 - 92, index: false }, // OK
{ size: 1024 * 1024 - 91, index: false }, // Error
].forEach(run);

Don’t ask me where 91 or 92 is coming from. Apparently, this is somewhere closer to 1 MB, but not exactly. The result of the test script:

Install Docker on Ubuntu Trusty 14.04

Install Docker on Ubuntu Trusty 14.04 is fairly straightforward:

1
2
3
$ curl -sSL https://get.docker.com/ | sh
$ sudo usermod -aG docker ${USER}
$ exit

Log out, then log back in to verify the installation by running a sample container:

1
$ docker run hello-world

Done!