Data Science

Accessing Upwork JSON Data without the API

Upwork, formerly Elance-oDesk, is the world’s largest freelancing marketplace. I’m interested to know what types of jobs they are in the platform, and how many. For a lazy programmer, browsing each job category and clicking on each link, and copying those numbers is not the way to go. I need to automate this. There is an API. But before diving into the API documentation, let’s see if there is another way (“Rule of Diversity”).

Before continuing, a word of warning, this is prohibited:

Using any robot, spider, scraper, or other automated means to access the Site for any purpose without our express written permission or collecting or harvesting any personally identifiable information, including Account names, from the Site;[^2]

After poking around the web app, it communicates with its backend by using JSON data exchange format via the URL: https://www.upwork.com/o/jobs/browse/url. However, if accessing the URL directly, it will respond with 404 page not exist error. Something is missing.

Well, the web app is able to successfully make the request, so this is not difficult to tackle. Just use the process of elimination from the working request, it will reveal the required information.

After a couple tries, just need to add the request header: X-Requested-With: XMLHttpRequest, then the JSON response with the status code 200 will be returned:

1
2
3
4
5
6
7
8
9
$ http --verbose https://www.upwork.com/o/jobs/browse/url \
X-Requested-With:XMLHttpRequest
GET /o/jobs/browse/url HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: www.upwork.com
User-Agent: HTTPie/0.9.6
X-Requested-With: XMLHttpRequest

The default sort is by creation time in descending order, so you don’t need to add the query parameters: sort==create_time+desc (HTTPie).

Let’s load the response data into Node.js and perform a quick analysis:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
$ node
> data = require('./upwork.json')
{ url: '/o/jobs/browse/',
searchResults:
{ q: '',
paging: { total: 87654, offset: 0 },
spellcheck: { corrected_queries: [] },
jobs:
[ [Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object] ],
smartSearch: { downloadTeamApplication: false },
facets:
{ jobType: [Object],
workload: [Object],
duration: [Object],
clientHires: [Object],
contractorTier: [Object],
categories: [Object],
previousClients: [Object],
subcategories: [] },
isSearchWithEmptyParams: true,
subcategories: [],
currentQuery: {},
rssLink: '/ab/feed/jobs/rss?api_params=1&q=',
atomLink: '/ab/feed/jobs/atom?api_params=1&q=',
queryParsedParams: [],
pageTitle: 'Freelance Jobs - Upwork' } }

The property searchResults.paging.total is the total number of jobs available:

1
2
> data.searchResults.paging
{ total: 87654, offset: 0 }

But, the number is different from the web app, a lot less, 50% less jobs found. Is that because the request is not recognized as a logged-in user? Let’s find out.

Getting the Version of the Latest Release

What’s the latest release of Docker?

Its homepage doesn’t tell you anything. Have to poke around, click on a few links, may or may not get you what you want. If there’s a quick way, even better a CLI method, that will be great.

Couple things we can do. First, when installing docker, we use the URL https://get.docker.com/. It has a path that will return an installation instruction with the version number:

1
2
3
4
5
$ curl https://get.docker.com/builds/
# To install, run the following command as root:
curl -sSL -O https://get.docker.com/builds/Linux/x86_64/docker-1.11.2.tgz && sudo tar zxf docker-1.11.2.tgz -C /
# Then start docker in daemon mode:
sudo /usr/local/bin/docker daemon

There is another way. Well, there is always another way. Docker project is hosted in GitHub, we can use this URL:

1
https://github.com/docker/docker/releases/latest

which will be redirected to the latest release:

1
https://github.com/docker/docker/releases/tag/v1.11.2

Since it’s a redirect, we can use HTTP HEAD method without download the entire response body:

1
2
3
4
5
6
7
8
9
$ curl --silent --head https://github.com/docker/docker/releases/latest
HTTP/1.1 302 Found
Server: GitHub.com
Content-Type: text/html; charset=utf-8
Status: 302 Found
Cache-Control: no-cache
Vary: X-PJAX
Location: https://github.com/docker/docker/releases/tag/v1.11.2
Vary: Accept-Encoding

Extract and process the value of the Location field will get us what we are looking for.

Let’s construct a simple command to obtain such an information:

1
2
3
4
5
6
7
8
9
$ curl \
--silent \
--head \
--url https://github.com/docker/docker/releases/latest | \
grep \
--regexp=^Location | \
cut \
--delimiter=/ \
--fields=8

or:

1
2
3
$ curl -sI https://github.com/docker/docker/releases/latest | \
grep ^Location | \
cut -d / -f 8

Both commands will return v1.11.2.

By using GitHub, not only we can get the latest stable release version of Docker, we can also obtain other projects. In fact, if the project was hosted in GitHub, and it was tagged properly with the releases, you can use this method to obtain the version. However, if it’s not properly tagged, such as Node.js, you need to find another way.