Accessing Upwork JSON Data without the API

Upwork, formerly Elance-oDesk, is the world’s largest freelancing marketplace. I’m interested to know what types of jobs they are in the platform, and how many. For a lazy programmer, browsing each job category and clicking on each link, and copying those numbers is not the way to go. I need to automate this. There is an API. But before diving into the API documentation, let’s see if there is another way (“Rule of Diversity”).

Before continuing, a word of warning, this is prohibited:

Using any robot, spider, scraper, or other automated means to access the Site for any purpose without our express written permission or collecting or harvesting any personally identifiable information, including Account names, from the Site;[^2]

After poking around the web app, it communicates with its backend by using JSON data exchange format via the URL: https://www.upwork.com/o/jobs/browse/url. However, if accessing the URL directly, it will respond with 404 page not exist error. Something is missing.

Well, the web app is able to successfully make the request, so this is not difficult to tackle. Just use the process of elimination from the working request, it will reveal the required information.

After a couple tries, just need to add the request header: X-Requested-With: XMLHttpRequest, then the JSON response with the status code 200 will be returned:

1
2
3
4
5
6
7
8
9
$ http --verbose https://www.upwork.com/o/jobs/browse/url \
X-Requested-With:XMLHttpRequest
GET /o/jobs/browse/url HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: www.upwork.com
User-Agent: HTTPie/0.9.6
X-Requested-With: XMLHttpRequest

The default sort is by creation time in descending order, so you don’t need to add the query parameters: sort==create_time+desc (HTTPie).

Let’s load the response data into Node.js and perform a quick analysis:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
$ node
> data = require('./upwork.json')
{ url: '/o/jobs/browse/',
searchResults:
{ q: '',
paging: { total: 87654, offset: 0 },
spellcheck: { corrected_queries: [] },
jobs:
[ [Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object] ],
smartSearch: { downloadTeamApplication: false },
facets:
{ jobType: [Object],
workload: [Object],
duration: [Object],
clientHires: [Object],
contractorTier: [Object],
categories: [Object],
previousClients: [Object],
subcategories: [] },
isSearchWithEmptyParams: true,
subcategories: [],
currentQuery: {},
rssLink: '/ab/feed/jobs/rss?api_params=1&q=',
atomLink: '/ab/feed/jobs/atom?api_params=1&q=',
queryParsedParams: [],
pageTitle: 'Freelance Jobs - Upwork' } }

The property searchResults.paging.total is the total number of jobs available:

1
2
> data.searchResults.paging
{ total: 87654, offset: 0 }

But, the number is different from the web app, a lot less, 50% less jobs found. Is that because the request is not recognized as a logged-in user? Let’s find out.

Installing jq from Source

Packages built in both Ubuntu and Debian packages lack behind, therefore, to get the latest version of jq, build from source.

There are a few prerequisites to install:

  • GCC
  • Make
  • Autotools

Both GCC and Make are usually installed if you do development, but not Autotools. Luckily, this is easy to fulfill:

1
$ sudo apt-get install automake

Install from source:

1
2
3
4
$ sudo git clone https://github.com/stedolan/jq.git
$ cd jq
$ sudo git checkout jq-1.5
$ sudo ./configure && sudo make && sudo make install

The installed path is at:

1
2
$ which jq
jq is /usr/local/bin/jq

However, this gives me an unexpected tag:

1
2
$ jq --version
jq-1.5-dirty

Will Docker Container Restart Pick Up Updated Image?

When a Docker image has been updated, will restarting the running container via docker restart pick up the change? Educated guess will be no, because like restarting a process, the memory is still retained. The best way to find out is to give a try.

Let’s start with a Dockerfile:

1
2
3
# Version Foo
FROM debian:8.5
CMD while true; do echo foo; sleep 5; done

The command will keep printing foo every 5 seconds.

Create the image:

1
2
3
4
5
6
7
8
9
$ docker build -t example .
Sending build context to Docker daemon 2.048 kB
Step 1 : FROM debian:8.5
---> 1b088884749b
Step 2 : CMD while true; do echo foo; sleep 5; done
---> Running in 38fdeb15f629
---> 6a56a50ef254
Removing intermediate container 38fdeb15f629
Successfully built 6a56a50ef254

Notice the image ID starting with 6a56.

Start the container:

1
2
$ docker run -d --name example example
dac42e7194e4ec2bdca8e24db29a3333ae2f422d316e341c5cb1499034a4357b

Check the log:

1
2
3
$ docker logs example
foo
foo

This is expected output.

Inspect the container:

1
$ docker inspect example

The important field is the corresponding image, which matches to the previous built image:

1
2
3
4
5
{
...
"Image": "sha256:6a56a50ef254bb1d07117b0a0750ef81fafe9735ab3b0f2b0a14511f38d5b83d"
...
}

Now update the Dockerfile:

1
2
3
# Version Bar
FROM debian:8.5
CMD while true; do echo bar; sleep 5; done

This time it prints bar instead of foo.

Rebuild the image:

1
2
3
4
5
6
7
8
9
$ docker build -t example .
Sending build context to Docker daemon 2.048 kB
Step 1 : FROM debian:8.5
---> 1b088884749b
Step 2 : CMD while true; do echo bar; sleep 5; done
---> Running in 7fc297e12005
---> a6c04345afb9
Removing intermediate container 7fc297e12005
Successfully built a6c04345afb9

Now we have a different image. The image ID is different: a6c0. But the old image is still there:

1
2
3
4
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
example latest a6c04345afb9 24 seconds ago 125.1 MB
<none> <none> 6a56a50ef254 3 minutes ago 125.1 MB

Restart the container:

1
2
$ docker restart example
example

Got bar? No still foo all the way with the log. And when you inspect the container, it still uses the old image.

So, docker restart will not pick up the changes from updated image, it will still use the old image built previously. Therefore, the correct way is to drop the container entirely and run it again:

1
2
3
4
$ docker stop example && docker rm example && docker run -d --name example example
example
example
55cec9110fed0257060673a085a08f143003336b1720894f43c6ac5a22104935

The log shows the correct message:

1
2
3
$ docker logs example
bar
bar

Inspecting the container, now it has the correct image:

1
2
3
4
5
6
$ docker inspect example
{
...
"Image": "sha256:a6c04345afb953ab392241f56c04f72110c772a6ee3a36e248c1ffd03f81b7d6"
...
}

And don’t forget to delete the old image.

Settings:

1
2
$ docker --version
Docker version 1.12.0, build 8eab29e

Fixing Authorization Failure in AWS CLI by Synchronizing the Clock

Running into an error when executing an AWS command:

1
2
3
4
$ aws ec2 describe-instances
An error occurred (AuthFailure) when calling the DescribeInstances operation: AWS
was not able to validate the provided access credentials

From the error message, it appears to be an error with access credentials. But after updating to a new credential, and even updated the AWS package, the error still persisted. After trying out other commands, there was an error message containing “signature not yet current” with timestamps. So, the actual problem was due to inaccurate local clock. Hence, the solution is to sync the local date and time by polling the Network Time Protocol (NTP) server:

1
$ sudo ntpdate pool.ntp.org

ntpdate can be run manually as necessary to set the host clock, or it can be run from the host startup script to set the clock at boot time. This is useful in some cases to set the clock initially before starting the NTP daemon ntpd. It is also possible to run ntpdate from a cron script. However, it is important to note that ntpdate with contrived cron scripts is no substitute for the NTP daemon, which uses sophisticated algorithms to maximize accuracy and reliability while minimizing resource use. Finally, since ntpdate does not discipline the host clock frequency as does ntpd, the accuracy using ntpdate is limited.[^1]

From the description, we can learn that we can make things even easier by installing NTP package:

1
$ sudo apt-get install -y ntp

Network Time Protocol daemon and utility programs NTP, the Network Time Protocol, is used to keep computer clocks accurate by synchronizing them over the Internet or a local network, or by following an accurate hardware receiver that interprets GPS, DCF-77, NIST or similar time signals.[^2]

Verify the installation and execution:

1
2
$ ps -e | grep ntpd
4964 ? 00:00:00 ntpd

with the environment:

1
2
$ aws --version
aws-cli/1.10.53 Python/2.7.6 Linux/3.13.0-92-generic botocore/1.4.43

[^1]: $ man nptdate
[^2]: $ apt-cache show ntp

Creating a Data Volume Container in Dockerfile

Create a Docker data volume container in Dockerfile is unbelievably simple, just use the VOLUME instruction:

1
2
FROM debian:8.5
VOLUME ["/data"]

The instruction creates a mount point and attach the volumes from native host or other containers.

Build the data container:

1
2
3
4
5
6
7
8
9
$ docker build -t data .
Sending build context to Docker daemon 2.048 kB
Step 1 : FROM debian:8.5
---> 1b088884749b
Step 2 : VOLUME /data
---> Running in 5511f34a489c
---> 7b723b2b3d13
Removing intermediate container 5511f34a489c
Successfully built 7b723b2b3d13

The built size is just about 125.1 MB.

1
$ docker create --name data data

The first data is the name of the container, the second data is the name of Docker image.

To attach the data volume container to another, we use --volumes-from option:

1
$ docker run -it --rm --name foo --volumes-from=data debian:8.5 /bin/bash

If there’re initial data to copy, then add the COPY instruction:

1
2
3
FROM debian:8.5
VOLUME ["/data"]
COPY . /data

Settings:

1
2
$ docker --version
Docker version 1.11.2, build b9f10c9

Escaping in JSON with Backslash

Escape characters are part of the syntax for many programming languages, data formats, and communication protocols. For a given alphabet an escape character’s purpose is to start character sequences (so named escape sequences), which have to be interpreted differently from the same characters occurring without the prefixed escape character.[^2]

JSON or JavaScript Object Notation is a data interchange format. It has an escape character as well.

In many programming languages such as C, Perl, and PHP and in Unix scripting languages, the backslash is an escape character, used to indicate that the character following it should be treated specially (if it would otherwise be treated normally), or normally (if it would otherwise be treated specially).[^3]

JavaScript also uses backslash as an escape character. JSON is based on a subset of the JavaScript Programming Language, therefore, JSON also uses backslash as the escape character:

A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes.[^1]

A character can be:

  • Any Unicode character except " or \ or control character
  • \"
  • \\
  • \/
  • \b
  • \f
  • \n
  • \r
  • \t
  • \u + four-hex-digits

Only a few characters can be escaped in JSON. If the character is not one of the listed:

1
2
$ cat data.json
"\a"

it returns a SyntaxError[^4]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ node -e 'console.log(require("./data.json"))'
module.js:561
throw err;
^
SyntaxError: /home/chao/tmp/js/data.json: Unexpected token a in JSON at position 2
at Object.parse (native)
at Object.Module._extensions..json (module.js:558:27)
at Module.load (module.js:458:32)
at tryModuleLoad (module.js:417:12)
at Function.Module._load (module.js:409:3)
at Module.require (module.js:468:17)
at require (internal/module.js:20:19)
at [eval]:1:13
at ContextifyScript.Script.runInThisContext (vm.js:25:33)
at Object.exports.runInThisContext (vm.js:77:17)

Getting the Version of the Latest Release

What’s the latest release of Docker?

Its homepage doesn’t tell you anything. Have to poke around, click on a few links, may or may not get you what you want. If there’s a quick way, even better a CLI method, that will be great.

Couple things we can do. First, when installing docker, we use the URL https://get.docker.com/. It has a path that will return an installation instruction with the version number:

1
2
3
4
5
$ curl https://get.docker.com/builds/
# To install, run the following command as root:
curl -sSL -O https://get.docker.com/builds/Linux/x86_64/docker-1.11.2.tgz && sudo tar zxf docker-1.11.2.tgz -C /
# Then start docker in daemon mode:
sudo /usr/local/bin/docker daemon

There is another way. Well, there is always another way. Docker project is hosted in GitHub, we can use this URL:

1
https://github.com/docker/docker/releases/latest

which will be redirected to the latest release:

1
https://github.com/docker/docker/releases/tag/v1.11.2

Since it’s a redirect, we can use HTTP HEAD method without download the entire response body:

1
2
3
4
5
6
7
8
9
$ curl --silent --head https://github.com/docker/docker/releases/latest
HTTP/1.1 302 Found
Server: GitHub.com
Content-Type: text/html; charset=utf-8
Status: 302 Found
Cache-Control: no-cache
Vary: X-PJAX
Location: https://github.com/docker/docker/releases/tag/v1.11.2
Vary: Accept-Encoding

Extract and process the value of the Location field will get us what we are looking for.

Let’s construct a simple command to obtain such an information:

1
2
3
4
5
6
7
8
9
$ curl \
--silent \
--head \
--url https://github.com/docker/docker/releases/latest | \
grep \
--regexp=^Location | \
cut \
--delimiter=/ \
--fields=8

or:

1
2
3
$ curl -sI https://github.com/docker/docker/releases/latest | \
grep ^Location | \
cut -d / -f 8

Both commands will return v1.11.2.

By using GitHub, not only we can get the latest stable release version of Docker, we can also obtain other projects. In fact, if the project was hosted in GitHub, and it was tagged properly with the releases, you can use this method to obtain the version. However, if it’s not properly tagged, such as Node.js, you need to find another way.

Node.js Installation Methods July 2016 Edition

Follow up with the previous blog on various methods to install Node.js, here is the updated one with Node.js v4.x and v6.x versions.

Install Node.js v4.x via NodeSource[^1] setup script:

1
2
$ curl -sL https://deb.nodesource.com/setup_4.x | sudo -E bash -
$ sudo apt-get install -y nodejs

sudo -E option indicates to the security policy that the user wishes to preserve their existing environment variables[^2].

With Docker and Dockerfile, sudo should be removed, because root:

1
2
RUN curl -sL https://deb.nodesource.com/setup_4.x | bash - && \
apt-get install -y nodejs

Install Node.js v6.x via the same method:

1
2
$ curl -sL https://deb.nodesource.com/setup_6.x | sudo -E bash -
$ sudo apt-get install -y nodejs

The above installation method has been tested on:

  • Debian 8.5
  • Node.js v4.4.7 and v6.2.2

In summary, the methods to install Node.js covered are:

  • Package manager
  • Nodesource script
  • From the source

[^1]: NodeSource provides binary distribution setup and support scripts.
[^2]: See man sudo.

Resetting GitLab User Password with a Simple Shell Script

Problem

Resetting password is one of the most common requests virtually in any system. In GitLab, user password can be updated by visiting the /admin/users page. But if you forgot the password for the root user or the admin user. You need another method to reset it.

Objective

The goal is to simplify the process of resetting GitLab user password by using CLI, so next time when encountering the same problem again, it will be quick and easy.

Settings

Self-hosted GitLab, installed in a Docker container.

Solutions

Step by step procedure to reset the root password is already provided by this GitLab documentation. By converting it into a Bash shell script and placing it in user’s home bin directory as an executable ~/bin/gitlab-password-reset file, we will have created a simple command to be run repeatedly:

1
2
3
4
5
6
7
8
9
10
#!/usr/bin/env bash
echo -n "Email: "; read MAIL
echo -n "Password: "; read -s PASS
echo # Ensure a new line
docker exec gitlab gitlab-rails runner -e production " \
user = User.find_by(email: '$MAIL'); \
user.password = user.password_confirmation = '$PASS'; \
user.save!"

We could simply run the long docker command instead of shell script. But since we’re dealing with password, it’s a good practice to avoid placing sensitive information on command line history log.

No trailing spaces are allowed on the password field, by the way.

Another solution is turning the Ruby evaluation into a script and save into somewhere like /srv/gitlab/config directory. Then, we can just run:

1
$ docker exec gitlab gitlab-rails runner -e production /etc/gitlab/scripts/password-reset.rb

Because we are using Docker to run GitLab, and the following directories are mapped from the host to the guest:

1
2
3
/srv/gitlab/config:/etc/gitlab
/srv/gitlab/logs:/var/log/gitlab
/srv/gitlab/data:/var/opt/gitlab

Therefore, when executing the Ruby script, it’s /etc/gitlab instead of /srv/gitlab. However, you will need to figure out how to get the email and password into the script. That’s for you to answer.