Get Started with the Cut Command of GNU Coreutils

A part of GNU Coreutils, cut command likes paste and join is one that operates on fields. The cut command prints selected parts (fields or sections) of lines.

Things to keep in mind:

Work on text stream line by line
Print sections/parts/fields
Either -c, -b or -f option must be used
The default delimiter is TAB
The delimiter must be a single character
Consecutive delimiters need to be consolidated into one
Cut multiple or a range of fields

Generate sample data and saved in the file tab.txt:

$ file=tab.txt &&              \
  for i in $(seq 1 9); do      \
    for j in $(seq 1 9); do    \
      if [ "$j" == "9" ]; then \
        echo -e "$i$j";        \
      else                     \
        echo -en "$i$j\t";     \
      fi;                      \
    done;                      \
  done > $file && cat $file
11      12      13      14      15      16      17      18      19
21      22      23      24      25      26      27      28      29
31      32      33      34      35      36      37      38      39
41      42      43      44      45      46      47      48      49
51      52      53      54      55      56      57      58      59
61      62      63      64      65      66      67      68      69
71      72      73      74      75      76      77      78      79
81      82      83      84      85      86      87      88      89
91      92      93      94      95      96      97      98      99

Fields on each line is separated by a tab character.

There is a required option:

1
2
3

$ cut tab.txt
cut: you must specify a list of bytes, characters, or fields
Try 'cut --help' for more information.

These options are:

bytes: -b or --bytes=LIST
characters: -c or --characters=LIST
fields: -f or --fields=LIST

This is very odd. Option means optional, a default should be provided. But let’s focus on the common used one: fields.

Cut the first and the ninth fields with the default delimiter TAB:

$ cut -f 1 tab.txt
11
21
31
41
51
61
71
81
91
$ cut -f 9 tab.txt
19
29
39
49
59
69
79
89
99

Use space as the delimiter:

$ cp tab.txt space.txt && sed -i 's/\t/ /g' $_ && cat $_
11 12 13 14 15 16 17 18 19
21 22 23 24 25 26 27 28 29
31 32 33 34 35 36 37 38 39
41 42 43 44 45 46 47 48 49
51 52 53 54 55 56 57 58 59
61 62 63 64 65 66 67 68 69
71 72 73 74 75 76 77 78 79
81 82 83 84 85 86 87 88 89
91 92 93 94 95 96 97 98 99

We must choose a different delimiter via -d or --delimiter=DELIM option:

$ cut -f 1 $_
11 12 13 14 15 16 17 18 19
21 22 23 24 25 26 27 28 29
31 32 33 34 35 36 37 38 39
41 42 43 44 45 46 47 48 49
51 52 53 54 55 56 57 58 59
61 62 63 64 65 66 67 68 69
71 72 73 74 75 76 77 78 79
81 82 83 84 85 86 87 88 89
91 92 93 94 95 96 97 98 99
$ cut -f 1 -d ' ' $_
11
21
31
41
51
61
71
81
91

Delimiter must be a single character:

1
2
3

$ cut -f 1 -d '\s' $_
cut: the delimiter must be a single character
Try 'cut --help' for more information.

Files containing mixed delimiters (tab and space):

$ cp tab.txt mixed.txt && sed -i 's/\(9.\)\t/\1 /g' $_ && cat $_
11      12      13      14      15      16      17      18      19
21      22      23      24      25      26      27      28      29
31      32      33      34      35      36      37      38      39
41      42      43      44      45      46      47      48      49
51      52      53      54      55      56      57      58      59
61      62      63      64      65      66      67      68      69
71      72      73      74      75      76      77      78      79
81      82      83      84      85      86      87      88      89
91 92 93 94 95 96 97 98 99

will print any line that contains no delimiter character:

$ cut -f 1 mixed.txt
11
21
31
41
51
61
71
81
91 92 93 94 95 96 97 98 99

Or use -s or --only-delimited option to omit those lines.

$ cut -sf 1 mixed.txt
11
21
31
41
51
61
71
81

But the better approach is to do data cleansing prior.

What about multiple TAB characters in the file:

$ sed -i 's/\(1.\)\t/\1\t\t/' mixed.txt && cat $_
11              12      13      14      15      16      17      18      19
21      22      23      24      25      26      27      28      29
31      32      33      34      35      36      37      38      39
41      42      43      44      45      46      47      48      49
51      52      53      54      55      56      57      58      59
61      62      63      64      65      66      67      68      69
71      72      73      74      75      76      77      78      79
81      82      83      84      85      86      87      88      89
91 92 93 94 95 96 97 98 99

An empty field is still a field:

$ cut -sf 2 mixed.txt
22
32
42
52
62
72
82

Therefore, the drawback here is that there cannot be multiple delimiter sticking together. Must perform data cleansing to reduce consecutive delimiters into a single one:

$ sed -i 's/\t\+/\t/g' mixed.txt && cat $_
11      12      13      14      15      16      17      18      19
21      22      23      24      25      26      27      28      29
31      32      33      34      35      36      37      38      39
41      42      43      44      45      46      47      48      49
51      52      53      54      55      56      57      58      59
61      62      63      64      65      66      67      68      69
71      72      73      74      75      76      77      78      79
81      82      83      84      85      86      87      88      89
91 92 93 94 95 96 97 98 99

Multiple fields can be cut:

$ cut -f 1,3,5,7,9 tab.txt
11      13      15      17      19
21      23      25      27      29
31      33      35      37      39
41      43      45      47      49
51      53      55      57      59
61      63      65      67      69
71      73      75      77      79
81      83      85      87      89
91      93      95      97      99

Cut a range:

$ cut -f 3-5 tab.txt
13      14      15
23      24      25
33      34      35
43      44      45
53      54      55
63      64      65
73      74      75
83      84      85
93      94      95

Cut up to or from a field:

$ cut -f -3 tab.txt
11      12      13
21      22      23
31      32      33
41      42      43
51      52      53
61      62      63
71      72      73
81      82      83
91      92      93
$ cut -f 7- tab.txt
17      18      19
27      28      29
37      38      39
47      48      49
57      58      59
67      68      69
77      78      79
87      88      89
97      98      99

When cut multiple fields, the fields are separated by the same delimiter used (indicated by -d field or TAB as the default). If change the output delimiter, it’s not the job of cut, pipe to another program:

$ cut -f 3-5 tab.txt | sed 's/\t/ /g'
13 14 15
23 24 25
33 34 35
43 44 45
53 54 55
63 64 65
73 74 75
83 84 85
93 94 95

Sep 26, 2015

Node.js Crypto Starters

Starter for the most common usage.

Create a MD5 hex hash:

1	require('crypto').createHash('md5').update(data, 'utf8').digest('hex');

Create a SHA256 base64 hash:

1	require('crypto').createHash('sha256').update(data, 'utf8').digest('base64');

Calculating the digest varies with encoding:

> data = 'secret'
'secret'
> require('crypto').createHash('sha256').update(data, 'utf8').digest('hex');
'2bb80d537b1da3e38bd30361aa855686bde0eacd7162fef6a25fe97bf527a25b'
> require('crypto').createHash('sha256').update(data, 'utf8').digest('base64');
'K7gNU3sdo+OL0wNhqoVWhr3g6s1xYv72ol/pe/Unols='
> require('crypto').createHash('sha256').update(data, 'utf8').digest('binary');
'+¸\rS{\u001d£ãÓ\u0003aªV½àêÍqbþö¢_é{õ\'¢['
> require('crypto').createHash('sha256').update(data, 'utf8').digest();
<Buffer 2b b8 0d 53 7b 1d a3 e3 8b d3 03 61 aa 85 56 86 bd e0 ea cd 71 62 fe f6 a2 5f e9 7b f5 27 a2 5b>
> require('crypto').createHash('sha256').update(data, 'utf8').digest().toString();
'+�\rS{\u001d����\u0003a��V�����qb���_�{�\'�['

Both hex and base64 are commonly used ones.

Needs an input_encoding when update data, otherwise there will be different results:

> data = '秘密'
'秘密'
> require('crypto').createHash('sha256').update(data, 'ascii').digest('hex');
'7588bc51fd0ff10db9c66549e5cb7969b9f6b7cf3ccee080137a8eb1aa06e718'
> require('crypto').createHash('sha256').update(data, 'utf8').digest('hex');
'062a2931da683a9897c5a0b597113b1e9fd0d5bfb63e2a5d7c88b724f7f55c02'
> require('crypto').createHash('sha256').update(data, 'binary').digest('hex');
'7588bc51fd0ff10db9c66549e5cb7969b9f6b7cf3ccee080137a8eb1aa06e718'
> require('crypto').createHash('sha256').update(data).digest('hex');
'7588bc51fd0ff10db9c66549e5cb7969b9f6b7cf3ccee080137a8eb1aa06e718'

Don’t assume ASCII.

Parameter data must be a string or buffer, otherwise there will be an error:

1	TypeError: Not a string or buffer

Sep 24, 2015

Add a Custom Domain to the Specific Module of Google App Engine Managed VMs

I would like to add a custom domain to a module other than the default one in Google App Engine Managed VMs. For example, I have the following sub-domains:

Custom domain names | SSL support | Record type | Data                 | Alias
------------------- | ----------- | ----------- | -------------------- | -----
api.example.com     | none        | CNAME       | ghs.googlehosted.com | api
admin.example.com   | none        | CNAME       | ghs.googlehosted.com | admin

listed in the application console:

https://console.developers.google.com/project/[project]/appengine/settings/domains

The domain api.example.com points to the default module. And I would like the other domain admin.example.com to point to the module admin. However, by merely adding the custom domain in the settings, both api.example.com and admin.example.com are pointed to the same module: the default module. Then how to point the custom domain and route the traffic to the admin module? The answer is on the dispatch file. But first need to understand the concept of modules in Google App Engine.

Google App Engine Module Hierarchy
Source: Google App Engine Docs

The above chart illustrates the architecture of a Google App Engine application:

Application: “An App Engine application is made up of one or more modules”. [1]
Module: “Each module consists of source code and configuration files. The files used by a module represents a version of the module. When you deploy a module, you always deploy a specific version of the module.” [1] “All module-less apps have been converted to contain a single default module.” [1] Every application has a single default module.
Version: “A particular module/version will have one or more instances.”
Instance: “Each instance runs its own separate executable.”

Another concept is the resource sharing:

“App Engine Modules let developers factor large applications into logical components that can share stateful services and communicate in a secure fashion.” [1]
“Stateful services (such as Memcache, Datastore, and Task Queues) are shared by all modules in an application.” [1]

Remember that Google Cloud Platform is project based. If there are a web application and a mobile application, and they are both making requests to another application, the API. Then there should be three projects. However, in this case, both api and admin share the same services, such as the same database and file storage. We should put them together in the same application as separate modules.

With that in mind, how to route requests to a specific module? “Every module, version, and instance has its own unique URI (for example, v1.my-module.my-app.appspot.com). Incoming user requests are routed to an instance of a particular module/version according to URL addressing conventions and an optional customized dispatch file.” [1] Since a custom domain is used, we have to use the customized dispatch file. And this is done by creating a dispatch file dispatch.yml to route requests based on URL patterns.

# Dispatch
# ========
---
dispatch:
  - url: '*/favicon.ico'
    module: default
  - url: 'admin.example.com/*'
    module: admin

The application field is not necessary, otherwise, you will get:

1 2	WARNING: The [application] field is specified in file [/home/chao/example/dispatch.yml]. This field is not used by gcloud and should be removed.

Glob characters (such as asterisk) can be used, and support up to 10 routing rules.

Deploy the dispatch file is simple:

1	$ gcloud preview app deploy dispatch.yml

And it should require almost no time to wait. Once it is ready, the requests sent to admin.example.com will be routed properly to the admin module.

Sep 23, 2015

The First Release Version

When you have a release for the first time, how do you name your release version based on SemVer? 1.0.0, 0.1.0 or 0.0.1?

I like to start off with 0.1.0. Because version 0.0.0 means there is nothing. Then from nothing to something, that’s breaking change, hence version 0.1.0.

Sep 16, 2015

Use a Single Dockerfile

I was planning to build two different Docker images: one for production and one for test. However, the more I had coded, the more frustrated I had become. I want something simple, and more means complexity:

Having multiple images means more configuration files to manage and update.
More things to do when working with third-party tools, for example, Google App Engine Managed VMs deployment will only recognize the application root level Dockerfile out of the box.
Steeper learning curve for new developers.

Keep our application structure simple is important. If you need multiple dockerfiles, then your application is too complex. The difference between npm install --production and npm install isn’t a lot. But it will save you time and effort from managing multiple dockerfiles. There is no reason to have more than one Dockerfile. Just use a different command such as npm test when running tests.

Sep 16, 2015

Wait and then Grab the Lock and Update

When provisioning a new machine in Google Compute Engine with Vagrant, I spotted the following error messages when updating the system:

1 2	Could not get lock /var/lib/apt/lists/lock - open (11: Resource temporarily unavailable) E: Unable to lock directory /var/lib/apt/lists/

It appears to be that I was just behind on the race to grab the lock on the APT update. Without getting the latest index, I am not able to upgrade the packages. An simple solution is to wait it out:

#!/usr/bin/env bash
#
# System Update
# =============
apt-get update
while [ "$?" != "0" ]; do
  sleep 5
  apt-get update
done
apt-get -y dist-upgrade

Sep 15, 2015

Scaffold an Explicit Document to Avoid Error When Creating Indexes in Google Cloud Datastore

When working with Google Cloud Datastore, I would like to design each kind (collection or table) with its own schema and index files:

$ tree
.
├── index.yml
└── schema.yml

But not every kind has to have something in the index.yml file. So, what to do? Will the index creation command accept an empty index.yml file with zero byte? Let’s give try:

$ gcloud preview datastore create-indexes index.yml
ERROR: (gcloud.preview.datastore.create-indexes) An error occurred while parsing file: [/home/chao/kind/index.yml]
The file is empty

No, it does not like it. Then, what’s the minimum required? The answer is an explicit document with an empty document content:

---

The three dashes form a directives end marker line. YAML uses three dashes to separate directives from document content.

When creating indexes with the file, it will proceed without errors or warnings:

$ gcloud preview datastore create-indexes index.yml
You are about to update the following configurations:
- myproj/index  From: [/home/chao/kind/index.yml]
Do you want to continue (Y/n)?

Therefore, to avoid error, when scaffold an index.yml file for indexing, use an explicit document with an empty document content.

Sep 14, 2015

Google Cloud Datastore Starter: Dataset

Google Cloud Datastore dataset starter, part of a starter collection:

// Google Cloud Datastore Starter: Dataset
// =======================================
'use strict';
// Define a dataset.
var params = {
  projectId: 'MY_PROJECT',
  keyFilename: '/path/to/key/file.json',
  namespace: 'MY_NAMESPACE'
};
var dataset = require('gcloud').datastore.dataset(params);
// Define an entity (including both key and data).
var kind = 'MY_KIND';
var key = dataset.key(kind);
var data = [
  {
    name: 'title',
    value: 'Google Cloud Datastore Starter',
    excludeFromIndexes: false
  }
];
// var data = { title: 'Google Cloud Datastore Starter' }; // Simple version
var entity = {
  key: key,
  data: data
};
// Save a single entity.
dataset.save(entity, function (err) {
  if (err) {
    throw err;
  }
});

Sep 14, 2015

Understand the Limitation of the String Property Type of Google Cloud Datastore

Google Cloud Datastore supports a variety of data types for property values:

Integers
Floating-point numbers
Strings
Dates
Binary data

Strings are likely the most frequently used. When working with strings, the most important question to ask is how many bytes can be inserted into the property/field. According to the table listed in properties and value types:

Up to 1500 bytes if property is indexed, up to 1MB otherwise

Let’s give a try:

// String Property Type
// ====================
//
// Investigate the string property type of Google Cloud Datastore.
//
// Dependencies tested:
//
// - `gcloud@v0.21.0`
// - `lorem-ipsum@1.0.3`
'use strict';
// Define dependencies.
var params = {
  projectId: 'myproj',
  keyFilename: '/path/to/key/file.json',
  namespace: 'test'
};
var dataset = require('gcloud').datastore.dataset(params);
var kind = 'String';
var lorem = require('lorem-ipsum');
// Save an entity to the datastore.
function run(params) {
  var size = params.size;
  var index = params.index;
  var str = lorem({ count: size }).substr(0, size);
  var key = dataset.key(kind);
  var data = [
    {
      name: 'title',
      value: str,
      excludeFromIndexes: !index
    }
  ];
  var entity = {
    key: key,
    data: data
  };
  dataset.save(entity, function (err) {
    console.log(size, new Buffer(str).length, index,
      err ? err.code + ' ' + err.message : '200 OK');
  });
}
// Explanation of fields:
//
// - `size`: Total number of bytes to produce
// - `index`: Whether to index the string field
//
// In-line comment indicates the expected result.
[
  { size: 1500, index: true  },             // OK
  { size: 1501, index: false },             // OK
  { size: 1501, index: true  },             // Error
  { size: 1024 * 1024 - 92, index: false }, // OK
  { size: 1024 * 1024 - 91, index: false }, // Error
].forEach(run);

Don’t ask me where 91 or 92 is coming from. Apparently, this is somewhere closer to 1 MB, but not exactly. The result of the test script:

Sep 7, 2015

Install Docker on Ubuntu Trusty 14.04

Install Docker on Ubuntu Trusty 14.04 is fairly straightforward:

1
2
3

$ curl -sSL https://get.docker.com/ | sh
$ sudo usermod -aG docker ${USER}
$ exit

Log out, then log back in to verify the installation by running a sample container:

1	$ docker run hello-world

Done!

realguess

Don't afraid to try!

Archives