Graph-backed Web Application

Overview

While the focus of several previous posts have been slanted towards the back of a graph-based information storage, we will in this post connect with the user, albeit in a crude and proof-of-concept-ish manner. We’ll show a particular back-end web application infrastructure choice as well as one for the front-end but will stop short of providing a complete sample application.

In order to keep this post concise and able to focus on the key concepts, many shortcuts have been taken that unquestionably disqualifies the code herein from being used verbatim in a production setting. The complete absence of security, even rudimentary error handling and not to mention the glaring omission of even basic DevOps concepts should make that more than clear.

Architecture

With the disclaimer out of the way, what are we trying to accomplish then? The idea is that we have a graph store containing information and knowledge that can be inferred from reasoning over it. Suppose we need a zero-client means of interacting with said graph for the purposes of traversal, property augmentation or even structural change.

Server-side web applications that spit HTML at the browser, while lean on the client side, are ruled out for this task owing to the flickering nature of wholesale page refreshes. AJAX is a more logical option that retains client state and excels for in-place page updates. Specifically, we have chosen to write the application in AngularJS backed by RESTful services running in Node.js, using the koa.js framework specifically.

The simplified schematic below outlines the web application components involved:

Graph Web App Arch

From the client’s perspective, AngularJS is loaded via CDN and the rest of the application is served as static (.js, .html and .css) from the Node.JS server. The very same Node.JS server (this detail is quite important) exposes the APIs through RESTful web services that the AngularJS-based application calls.

As this architecture is based on koa.js, common logic is logically factored out as middleware. This system relies on both existing middleware such as koa-route to route RESTful requests and koa-static to serve static content, i.e. the client side of the application that executes in the browser. These APIs do however also rely on proprietary middleware. Specifically, this is code to facilitate communication with the Gremlin server. More on that next.

Graph middleware

Koa.js (from the authors of Express.js) is an exceedingly clever, very contemporary, framework that promotes cohesive, modular design that takes advantage of the powerful features in ES6+ JavaScript.

The code below is the complete definition of the Gremlin middleware for parameterized queries.

var gremlin = require('gremlin');

module.exports = function (opts) {
    "use strict";
    opts = opts || {};
    var port = opts.port || 8182;
    var host = opts.host || 'localhost';

    return function *(next) {
        var client = gremlin.createClient(port, host);

        this.gr = {
            call: function (query, params) {
                return new Promise(function (resolve, reject) {

                    // Perform the call to the Gremlin Server
                    client.execute(query, params, function (err, result) {
                        if (!err)
                            resolve(result);
                        else {
                            console.log('Execute error. Message: ' + err.message + '\n.Query: ' + query);
                            reject(err.message);
                        }
                    });
                });
            }
        };

        try {
            yield next;
        }
        catch (err) {
            console.log('Returning error to the caller: ' + err);
            this.status = 500;
            this.body = err;
        }
    };
};

The important parts here are that the middleware returns a generator whose ‘next’ object it must yield to in order to play nicely with other middleware providers. The ‘gr’ object exposes a call function that returns a Promise of a response from the Gremlin server. In order to maximize Gremlin server’s benefit from query compilation, the call wrapper supports query parameterization.

Server logic

The server definition is rather plain in its declarative approach to server definition.

var app = require('koa')();
var gremlin = require('./mw-gremlin');
var router = require('koa-router')();
var serve = require('koa-static');

app.use(gremlin({port: 8183, host: '10.22.66.1'}))
    // API routes
    .use(router.routes())
    .use(router.allowedMethods())
    // AngularJS is served as static files
    .use(serve('./public'))
    .listen(9000, 'localhost');

It’s crude and it’s hard-coded but it shows the broad strokes of server declaration:

  • Dependencies
  • Middleware configuration
  • Server network configuration

RESTful API implementation

With a sensibly designed middleware-based application architecture, the RESTful API implementations are focused on the logic that justifies their existence and not on repetitive overhead.

/// GET a node's neighbors
router.get('/api/neighbors/:p0', function *() {
    "use strict";
    let m = this.req.url.match('[0-9]+$');
    if (m !== null) {
        let p0 = Number(m[0]);
        this.body = yield this.gr.call("g.V().hasId(p0).union(outE().as('e').inV(), __.inE().as('e').outV()).as('node').select('e', 'node')", { p0: p0 });
    }
});

After doing some basic URL element parsing (which of course could/should be broken out into a function), we simply yield on the Promise returned by the Gremlin middleware’s call method which achieves the synchronous style calling in the API that enables easier-to-read code and robust error handling.

AngularJS Service implementation

Without going into too much AngularJS detail, suffice to say AngularJS applications communicate with web services through AngularJS services that are defined using the factory method. These services, through the magic of dependency injection, can then be made available in controllers, directives or other services for that matter.

Below is the definition of a graph service that facilitates traversal through neighbor nodes. The injected service dependency of ‘traverseState’ is to control the type of traversal, a concept that is specific to the implementation on which this post is based and is not material to the overall gist.

    /// Service communicating with underlying Gremlin graph
    ///
    .factory('graph', ['$resource', 'traverseState', function ($resource, traverseState) {
        "use strict";
        return {
            getNode: function (actorId, f) {
                return $resource('/api/node/:nodeId', { nodeId: actorId }).query()
                    .$promise.then(response => f(response));
            },

            getResponsibilities: function (actorId, f) {
                return $resource('/api/responsibilities/:nodeId', { nodeId: actorId }).query()
                    .$promise.then(response => f(response));
            },

            // Neighbors can be restricted to traverse outcome-related edges only
            getNeighbors: function (nodeId, f) {
                let apiName = '/api/' + (traverseState.isOutcomeTraversal() ? 'outcomes' : 'neighbors') + '/:nodeId';
                return $resource(apiName, { nodeId: nodeId }).query()
                    .$promise.then(response => f(response));
            }
        }
    }])

Conclusion

We’ve stepped top-down into a contemporary web application design that leverages modern framework and language features that are highly relevant in order to create a robust application framework that connects our end users with a Gremlin-interfaced graph database.

Graph DB Undo stack

Overview

This post outlines a relatively simple means of implementing an undo/restore stack to handle accidental deletions of vertices and edges from a graph database. While the ability to restore incorrectly deleted elements give peace of mind even in single-user contexts, it’s when multiple users collaborate on the same graph where this feature becomes really valuable.

Scope

In order to manage the complexity inherent in such operations, we’ll start with a maximum of restrictions which we will loosen incrementally in subsequent posts.

In sum:

  • Only permit deletion of one vertex or edge at a time.
  • Only permit restoration of one element at a time
  • Always restore the top-level (most recently pushed) element off the stack. I.e. all more recently deleted elements must be restored prior to restoring an element.
  • Only permit deletion of orphaned vertices, i.e. those with no in- or outbound edges; zero degrees. This has the effect that edges will always be deleted prior to vertices they are bound do and we’re thus spared from the complexity otherwise stemming from restoring an edge to a deleted vertex.

Environment

Our graph will be hosted in a Neo4j-backed version of 3.1.1 Gremlin server and will thus be accessed through Gremlin queries.

While we only have one ‘physical’ graph (named ‘graph’) with one traverser (named ‘g’), we will for the purposes of showing contextual partitioning introduce the concept of model identifier, which serves as a logical grouping or context designation for each vertex. In our example this designation is manifest through a property named ‘model’, but could just as well (and perhaps better) be implemented as a dedicated edge between the context vertex and each constituent vertex.

We’ll be using the words node and vertex interchangeably.

Conceptual design

The following schematic outlines the undo graph concept. Elements deleted are  highlighted in green and with a 5pt line weight.Undo Concept graph1

 

In this design there is one and only one Undo root vertex linked with one Undo model vertex for each distinct model instance in the graph.

The Undo stack itself is defined by the blue vertices. These vertices are meta-nodes and link the ‘payload’; i.e. element that has been deleted either directly as is the case with deleted vertices or as a label and property copy as is the case with deleted edges.

Operations

Setup

Prior to being able to push deleted elements onto the undo stack, the model instance-centric heads need to be setup. This can be accomplished with a query similar to the following (line-breaks added for readability):

root = g.V().hasLabel('undoRoot').tryNext()
.orElseGet({g.addV('undoRoot').next()}); 

g.V(root).out('model').and(hasLabel('undoModel'), has('name', 'ModelB'))
.tryNext()
.orElseGet({g.V(root).as('r').addV('undoModel')
.property('name', 'ModelB').as('n')
.addE('model').from('r').inV()
.addV('undoNull').addE('next').from('n').outV().next()})

In order to simplify subsequent queries, we’ll be assuming that the ‘undoModel’ labelled vertex has ID 4678 and will reference it directly.

Vertex deletion

In the case with a strict zero-degree constraint on vertex deletion, deletion is effectively a linkage operation, where we create a meta-node that links to the deleted object and push said meta-node onto the restore stack.

If the orphaned vertex we wished to delete had the ID of 94, our Gremlin query could look something like this:

g.V(4678).as('r').addV('undoVertex').as('uv')
.addE('deleted').from(V(94))
.select('uv').addE('next').from('r')
.outV().outE('next')
.match(__.as('e').inV().as('t')).where('t', neq('uv'))
.select('t').addE('next').from('uv')
.select('e').drop()

Let’s break this query down to understand what is going on.

  1. Declare our model root at ‘r’ form which we add an undoVertex node that we call ‘uv’
  2. The ‘uv’ vertex links to the node that we wanted to delete, i.e. with ID 94.
  3. Add a ‘next’ edge from said root to our undoVertex.
  4. For all outbound ‘next’ edges from the newly created edge’s source vertex, i.e. root’s ‘next’ edges
  5. Match all nodes that that are different from the one we just added. There should only ever be one here, which is the previous vertex that we’re pushing one level deeper into the stack.
  6. Add a new ‘next’ edge from our still new ‘uv’ vertex to this previous node.
  7. Drop the edge that linked the root ‘r’ to the previous vertex that ‘uv’ is now responsible for keeping track of.

The following image illustrates our resultant graph assuming no vertices had previously been deleted:

Edge deletion

Edge deletion differs from vertex deletion in that edges are immutable, i.e. Gremlin won’t let you reassign which vertices it connects. As a result, our approach is to copy the edge label and properties into a new edge that connects dummies, or proxy vertices, that serve no other purpose than to assist the restoration process with identifying the true vertices that the logical edge was originally connected to.

In the example below, the Undo Model root ID is 4678 and the ID of the edge to delete is 8341 illustrated by the following sub-graph. These two IDs are the only two IDs that the delete operation needs in order to complete.

Single Edge Deletion

ue = g.addV('undoEdge').as('ue').addE('next').from(V(4678))
.select('ue').next();

oe = g.E().hasId(8341).next();

ne = g.V(ue).as('ue')
.addV('undoProxy').property('proxyID', g.E(oe).outV().id()).as('pt')
.select('ue').addE('tail').from('pt')
.addV('undoProxy').property('proxyID', g.E(oe).inV().id()).as('ph')
.select('ue').addE('head').from('ph')
.select('ph').addE(oe.label()).from('pt').next();

ElementHelper.propertyValueMap(oe).each(ne.&property);

g.V(ue).as('ue').V(4678).as('r').outE('next').union(__.match(__.as('e')
.inV().as('t')).where('t', neq('ue')).select('t')
.addE('next').from('ue').select('e'),g.E(oe)).drop()

The query is split into the following five sub-sections:

  1. Create the undo edge meta node and link it to the root.
  2. Get a reference to the old (to be deleted) edge.
  3. Add proxies linked to the undo edge meta node and add an edge between the two proxies that carries the label of the edge to delete and return said edge.
  4. Copy the property values from the old edge to the new
  5. Complete the insertion of the meta node and drop any old next edges plus the old edge itself.

An empty Undo stack would look like this once the compound operation had completed:

Vertex restoration

When the vertex was deleted, it was an orphan. After it has been restored it will be an orphan anew and subsequent operations beyond the scope of restoration are required to lift its orphan status.

Abiding by the constraint to always pop at the top, we only need the ID of the undoModel vertex (4678 in our examples) to execute the restore.

g.V().hasId(4678).as('r').out('next').as('ou').out('next')
.addE('next').from('r').select('ou').drop()

Edge restoration

The complexity of edge deletion is also reflected in their restoration.

oe = g.V().hasId(4678).out('next').in('head').inE().next();
tid = g.E(oe).inV().values('proxyID').next();
hid = g.E(oe).outV().values('proxyID').next(); 

ne = g.V().hasId(hid).as('h').V().hasId(tid)
.addE(oe.label()).from('h').next(); 

ElementHelper.propertyValueMap(oe).each(ne.&property);

g.V().hasId(4678).as('r').out('next').as('ou').out('next')
.addE('next').from('r').select('ou')
.union(__.as('s'), __.in('tail'), __.in('head')).drop()

The following operations are performed:

  1. Find the old edge, i.e. the clone that resulted from the delete operation
  2. Extract the vertex IDs from the two proxy nodes; t means tail and h means head.
  3. Create a new edge between the nodes in question and set their label from the clone
  4. Copy all edge properties
  5. Remove the undo edge from the stack and delete the constituent objects.

In future posts, we’ll look at more complex dependency control that will afford us better restoration flexibility and also the ability to expire deleted objects from the undo stack.

Graph Database Cloud Deployment

Overview

In this post, we’ll cover the first few steps towards deploying a graph database in the cloud to facilitate centralized persistence and collaboration. To keep costs down (zero in fact if you haven’t yet exhausted your AWS Free Tier), we’ll deploy a rather scaled down version of the Titan Graph DB that can run on a t2.micro instance. However, just because it can run on such an instance, it should be understood that this is by no means recommended to run a productive Titan server in this manner. Proper production sizing should prevail and this post should be seen as a  rough template that can be scaled to suit bigger needs.

The key takeaways for the deployment are:

  1. Scalable storage and computation
  2. Secure with end-user processes that meaningful and can be enforced

Architecture

The server deployment outlined herein is a very classic one in the sense that it relies on a full-fledged Linux installation as an EC2 node and as such does not leverage containers or lambdas or any other light-weight mechanisms to achieve elastic scalability.

Titan Deployment

Storage

Currently, AWS is less than stellar in that it doesn’t allow for a launch profile to directly attach an instance to an existing block device. This is desirable as with a detached EBS volume backing our database, we could very easily take advantage of Spot instances with an AMI based boot snapshot and a standalone EBS volume where upon instance termination, our database backing store would remain and ready to be mounted the next time we launched a spot instance. Even so, to prepare for this eventuality, we’ll create an EBS volume that we’ll attach to the instance.

In our case, on a RHEL 7 image, an EBS device name of /dev/sdf showed up in linux as /dev/xvdf. While completely optional, we’ll here create a single LVM partition that will give us much flexibility later on when deciding how to divvy up the storage space.

Use fdisk to create a single, primary, partition that spans the device and has device type 8e (Linux LVM). Optionally, you can create a swap partition (type 82) as well as your image by default won’t have this. If you do, of course, this partition needs to be formatted with mkswap, used with swapon and mounted in /etc/fstab

# fdisk /dev/xvdf

In our LVM, we want one really small (e.g. 64MB) Logical Volume (LV) for the Titan data directory as that’s where we’ll keep our credentials graph. The second LV is the bigger one (e.g. 2GB) where the database files will be stored. The reason these are so small despite that the free tier covers up to 30GB is that the free tier eventually runs out and you don’t want to allocate more than you need. LVM makes it easy to grow these as needed. The LV creation follows these steps, presuming /dev/xvdf1 is your LVM partition:

# pvcreate /dev/xvdf1
# vgcreate vg /dev/xvdf1
# lvcreate -L 64M -n tdata vg
# lvcreate -L 2G -n dbdata vg
# mkfs.xfs /dev/mapper/vg-tdata
# mkfs.xfs /dev/mapper/vg-dbdata
# lsblk

The last command lists the device hierarchy.

OpenVPN

This post is not an in-depth elaboration of how to setup an OpenVPN server. There are maintained references for that. Instead, we’ll cover specific choices made when setting it up.

Specifically, we installed OpenVPN from the official RPM, put keys under /etc/openvpn/keys, i.e. the server’s key and certificate, the CA’s certificate and the DH .pem file. The configuration file is /etc/openvpn/server.conf in which we kept the 1194/udp port and protocol. For its simplicity and performance, we use a routed configuration (as opposed to bridged).

Choose a particular subnet and netmask at the server tag. E.g.

server 10.33.44.0 255.255.255.0

Your OpenVPN server will assign IP address 10.33.44.1, which is the address your client application, i.e. those connecting to Titan once the VPN tunnel is up and running, will be using.

From a security point of view, TLS-Auth and running under an unprivileged account is highly recommended as we need to enable world-wide (0.0.0.0/0) access to the VPN server in our AWS Security Group and as such need to arm ourselves against DoS attacks.

Open the requisite firewall port:

# firewall-cmd --zone-public --add-port=1194/udp
# firewall-cmd --zone=public --add-port=1194/udp --permanent

Generate client certificates either directly (including the .key file that then must be transferred over a secure line) or from certificate requests (.csr files).

Once you’ve confirmed that your setup is working, the final piece is to enable the systemd services file that the OpenVPN RPM provided. The .service file provided is a template file so that when you enable it, make sure that the name provided matches the configuration file, in our case server.conf. Hence:

# systemctl enable openvpn@server.service
# systemctl start openvpn@server.service

Titan

We have covered Titan deployments in previous posts and won’t be going into the same level of detail here. In sum, though, we want Titan installed under /var/lib/titan, running under a dedicated account (also named titan). We will be using the titan account to also run the console when necessary, so it’s important that it’s given a login shell (bash) and that it has a home directory. The Gremlin Console insists on writing user preferences and will throw an exception if it’s not there. Alternatively, create the titan user without a shell and home directory and add the titan group to the user account you’ll be running the Gremlin Console with.

Next, we’ll make titan use the LVs we have created. Basically:

# cd /var/lib/titan
# mkdir db
# mount /dev/mapper/vg-dbdata db
# mkdir db/berkeley
# mkdir db/es
# mv data data.old
# mkdir data
# mount /dev/mapper/vg-tdata data
# mv data.old/* data/
# rmdir data.old

Now the key Titan directories are using the EBS partitions. Update /etc/fstab to mount these on boot as e.g.

/dev/mapper/vg-dbata /var/lib/titan/db xfs defaults,noexec 0 2
/dev/mapper/vg-tdata /var/lib/titan/data xfs defaults,noexec 0 2

Next, we’ll copy and modify the following files and make changes such that Titan will use the BerkeleyDB location we specified.

  1. Copy conf/remote.yaml and change the IP to the one Titan listens on.
  2. Copy conf/titan-berkeleyje-es.properties and change the paths to /var/lib/titan/db/berkeley and /var/lib/titan/db/es. Also add the following entry at the bottom:
    gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
  3. Copy conf/gremlin-server/gremlin-server.yaml and change the IP address and the path to the .properties file from the previous step.

Since we’re done with the file config, make the titan account own the install.

# cd /var/lib
# chown -R titan:titan titan

At this point it makes sense to ensure that Titan starts without errors and that as user titan, you can launch the Gremlin Console and connect to the Titan server.

Once that’s complete, it’s time for the final piece; to create a .services file so that systemd launches it on system boot. Start by changing to the system directory for systemd:

# cd /usr/lib/systemd/system

In there, create a file, named e.g. titan.service and tailor the contents to match the name of the server configuration file created above. E.g.

# /usr/lib/systemd/system/titan.service

[Unit]
Description=Titan
After=openvpn@server.service

[Service]
User=titan
Group=titan
WorkingDirectory=/var/lib/titan
ExecStart=/var/lib/titan/bin/gremlin-server.sh conf/gremlin-server/my-server.yaml
Restart=always

[Install]
WantedBy=multi-user.target

Note the dependence on the OpenVPN service launched earlier. This is required as IP address the Titan server listens to is on the tun0 network device created by OpenVPN and it would not do to start Titan before the OpenVPN service has completely started. Enable and start the Titan service:

# systemctl enable titan.service
# systemctl start titan.service

As we’re also running firewalld, open up the port to allow our clients access to the Titan server:

# firewall-cmd --zone-public --add-port=8182/tcp
# firewall-cmd --zone=public --add-port=8182/tcp --permanent

Process

Our server is now self-sustaining in the sense that its storage is organized and services setup. You can reboot, stop and start and even create AMIs from it that can be launched and it will come up properly.

In order to manage new users we need a few simple processes:

  1. Sign certificates from certificate requests. This is simple to do securely as it does not involve the transference of private keys. However, if you have opted for the TLS-Auth option (and it’s recommended that you do), that pre-shared key needs to be distributed to each client in a secure fashion.
  2. Distribute a pre-configured client.conf (or client.ovpn) configuration file to facilitate a hassle-free connection experience.
  3. Once the clients’ VPN tunnels are running, they should connect to Titan via the OpenVPN assigned address, which in our case was 10.33.44.1.
  4. If your Titan server requires authentication, define these credentials in the credentials graph and restart Titan.

 

Distributed Graph Database – Part 2

Overview

In the previous post, we deployed a Cassandra-backed Titan graph database. In this post we will continue to build on this by introducing basic means of securing access to the server. Specifically, we’re going to be defining username and password credentials as well as generating a CA-signed certificate for SSL.

Credentials

Gremlin-Server out of the box supports a simple authentication scheme based on username and password. These pairs are unsurprisingly managed using a graph, albeit a special one that is not manipulated through regular Gremlin syntax but rather through a Credentials wrapper that ensures passwords are not persisted in clear text.

By virtue of the credentials being managed through a distinct graph that is only ingested into Gremlin-Server upon startup, credential changes necessitate a Gremlin-Server restart. Over the next couple of sections we will create credentials and configure Gremlin-Server to require authentication defined by said credentials graph.

Defining the Graph

We manage the credentials graph and the distinct file in which it will be persisted through the Gremlin-Console. The one pre-requisite for the Gremlin-Console is that the tinkperpop.credentials plugin must be activated.

gremlin> :plugin use tinkerpop.credentials

Create the empty, in-memory graph and wrap it in the CredentialGraph construct:

gremlin> graph = TinkerGraph.open()
gremlin> creds = credentials(graph)

Add a couple of username and password pairs:

gremlin> creds.createUser('demoUser', 'bad-password')
gremlin> creds.createUser('otherUser', 'simple-password')

In principle, this is sufficient to authenticate two distinct principals and we can therefore go ahead and persist this to a file to be loaded into Gremlin-Server. We use the compact and lossless format Gryo.

gremlin> graph.io(IoCore.gryo()).writeGraph('data/my-creds.kryo')

Even though we have successfully persisted the graph at this stage, there are a number of operations that are relevant to the maintenance of the credentials graph.

Loading the Gryo file into an empty TinkerGraph as constructed above:

gremlin> graph.io(IoCore.gryo()).readGraph('data/my-creds.kryo')

Querying the number of username and password pairs in the graph:

gremlin> creds.countUsers()

Querying the usernames in the graph:

gremlin> g = graph.traversal()
gremlin> g.V().values('username')

Remove a username and password pair from the credentials graph:

gremlin> creds.removeUser('otherUser')

Configuring the Server

In line with the approach in the previous post, we won’t be writing many configuration files from scratch but rather copy and modify the ones that closest resemble our requirements. On the Gremlin-Server side, our objective is to ensure the server starts with the credentials we have defined in the Gremlin-Console above.

Start by copying the credentials graph file into the server’s data directory, e.g.:

cp ../gremlin-console/data/my-creds.kryo data/

Copy and modify the server configuration file that closest resemble the our requirements, i.e. conf/gremlin-server-secure.yaml

cp conf/gremlin-server-secure.yaml conf/my-secure-server.yaml

Open the new config file in a text editor of your choice and search for the credentialsDbLocation key and change its value to reflect the name of the credentials file; in our case: data/my-creds.kryo

Save, exit and start the Gremlin-Server:

$ bin/gremlin-server.sh conf/my-secure-server.yaml

Test the Credentials

The simplest test is performed using the Gremlin-Console and we’ll start with creating the client configuration to connect. We will use conf/remote-secure.yaml as the source of inspiration and thus copy that to something we modify e.g.:

cp conf/remote-secure.yaml conf/my-secure-remote.yaml

Edit the new configuration file to update the username and password keys to reflect the values defined in the credentials graph above. Save and exit.

Start the Gremlin-Console and perform a connection test. Note that simply connecting is insufficient to test the credentials; Gremlin-Console won’t attempt to log on until you ask it do something involving resources on the Gremlin-Server:

gremlin> :remote connect tinkerpop.server conf/my-secure-remote.yaml
gremlin> :> g.V().values('name')

Other than getting a warning about a self-signed certificate which is not suitable for production, the test should return the expected results without errors.

SSL Certificate

While it’s encouraging to have authentication correctly set up, passwords are transmitted in clear text between client and server unless SSL is configured and SSL is currently reporting that by virtue of using a self-signed certificate it’s not in a production-worthy state. That the certificate is self-signed means it was not issued by a trusted authority that would require proof of authenticity and hence while the certificate can be used for encrypting a channel, as a client you have no way of knowing that the party on the other side is actually who you think it is.

Setup

Without going into great depth of PKI – Public Key Infrastructure – considerations, we can intuitively infer that we need a trusted entity to sign certificates so we can reason that if we trust A and A said that B is B, we can trust that B is B. A in this simple logic is a Certificate Authority and while many such true authorities exist on the Internet, we don’t need one for our simple purposes here to illustrate micro-scale trust.

In terms of PKI to enable this scenario we’re going to be using the OpenSSL-backed easy-rsa. For those on RPM-based distributions, add the epel-release repository and install the easy-rsa package through yum. Before starting in earnest, copy the entire easy-rsa directory away from its /usr/share location to protect your vars update from being overwritten when easy-rsa is updated.

Certificate Generation

Readers unfamiliar with easy-rsa are encouraged to at least get a basic understanding of this simple yet highly useful wrapper that raises the abstraction level  for PKI relevant operations on top of OpenSSL.

It is assumed that the vars file has been updated and sourced into the current shell before the commands below are executed. Start by generating the key-pair etc. for the Certificate Authority itself:

./build-ca

The CA generated is capable of understanding certificate requests, but since we’re on the same host as the CA we can use a convenience function that bypasses the explicit generation and approval of the certificate request and thus can go on to directly generating the certificate itself.

./build-key-server gremlincert

After answers to the questions have been provided, including an empty password, this command outputs the key, certificate request and certificate files into the keys directory. The certificate is stored in keys/gremlincert.crt.

Configuration

After copying the gremlincert.key and gremlincrt.crt files to the data area of the Gremlin-Server directory structure, edit the ssl area of the my-secure-server.yaml file to add the following after the opening curly brace and before the enable: true:

keyFile: data/gremlincert.key,
keyCertChainFile: data/gremlincert.crt,

Gremlin-Server should now start cleanly using the certificate provided.

Distributed Graph DB – Part 1

Overview

While traditional relational databases remain an integral part of the solution space for information storage and retrieval, there can be no denying that NoSQL options are presenting viable alternatives in spaces outside the RDBMS sweet-spot. One such such space where relational storage and query is suboptimal is in the handling of graph-based structures. In previous posts, we’ve looked very briefly at the currently most widely adopted graph DB, Neo4j. Neo4j is a robust, commercially maintained and enterprise-worthy graph database. Neo4j exposes a query language named Cypher that allows query developers to author queries declaratively.

Cypher is relatively easy to learn for those with a database background, however it suffers from one potentially serious shortcoming; Cypher is Neo4j-centric and hence any application that relies on a graph DB backend will be tightly coupled with Neo4j. We will instead in this post be looking at Tinkerpop3 as an abstraction layer that allows us to decouple clients from the graph storage backend. Specifically, we will be deploying a development-grade server with a Titan graph database backed by an instance of Apache Cassandra.

Architecture

The diagram below illustrates a single server node architecture, suitable for small production deployments and non-production, such as test and development scenarios.

Titan Dev Server

Titan, our graph database abstraction of choice, supports many deployment scenarios and here we’ll focus on a remote, non-embedded architecture. With the addition of a load balancer, this model will scale very well horizontally owing to Cassandra’s intrinsic clustering and Titan’s effective statelessness. We will however not address that option in this post.

Gremlin

Gremlin is part of the Tinkerpop suite and can be thought of as a Domain Specific Language for graph traversal (and also structure). Gremlin-Server is a wrapper that through Web Sockets or REST APIs receive Gremlin scripts, execute them through a pluggable driver architecture that abstracts the actual graph storage implementations and marshals the results to the client. Gremlin-Server was named Rexster in the no longer supported version 2 of Tinkerpop.

Gremlin doesn’t require a server however. In principle, the Gremlin-Shell can be configured to leverage any supported graph database by loading the appropriate plug-in/driver. This is not how we’ll be using the Gremlin-Shell in this post though; We’ll be using it to submit scripts to the Gremlin-Server and thus the graphs will not be loaded in the Gremlin-Shell memory.

Setup

As our server image, we will be using a minimal CentOS7 image running in VMware Fusion 7 Pro. The image is light-weight with 2 virtual cores and a modest 2GB of RAM. The guest image NIC is NAT’d.

Components

Before starting, download the necessary components:

  1. Java8 JVM. For our CentOS image we need a 64bit Linux RPM.
  2. Most recent version of Cassandra. While version 3.0.0 is released at the time of this writing, 2.2.3 is the most stable, production-worthy version so we’re going with that.
  3. The recommended version of Titan. At the time of writing, this was Titan 1.0.0 with Hadoop 1.0.

To keep the sudo-ing to a minimum, we’ll install said server components under our $HOME directory. This is clearly inappropriate for production deployments, however small, but it suits our needs for a development-centric environment.

Steps

Java

Install the JVM, systemwide:

$ sudo rpm -i jdk-8u66-linux-x64.rpm

Verify that the Java version is the correct one:

$ java -version

If it isn’t, run the following interactive utility and select the appropriate JVM version:

$ sudo alternatives --config java

It may be necessary/appropriate to declare $JAVA_HOME to point to the JVM as well.

Apache Cassandra

Because we’re only deploying a development-grade server, we will at this point not worry about any tuning or even changing the network interface on which the server will be listening. Recall from the diagram above that Titan will be running on the same host as Cassandra, so it’s entirely appropriate and even desirable for Cassandra to only accept non-networked requests. This is more secure and also better performing  by virtue of not having to involve the network stack when Cassandra and Titan communicate. Uncompress the Cassandra tar ball:

$ tar -xzf apache-cassandra-2.2.3-bin.tar.gz

cd into the top-level directory of Cassandra and start the server on the console in the background.

$ bin/cassandra -f &

Once it has settled down, you need to enable the Thrift API. This API is required by Titan when using Cassandra as the storage backend.

$ bin/nodetool enablethrift

By checking the opened ports you should see that 9160/tcp is now active. In addition with the following netstat parameters you’ll also see all other ports associated with the Cassandra PID:

$ sudo netstat -tulpn

Now that Cassandra is running, you can move Cassandra to the foreground and our work continues in another session.

$ fg

In a different terminal session, cd into the Cassandra directory and start the Cassandra Query Language Shell and list the keyspaces available just to confirm that you can communicate with the server and also get a baseline of the keyspaces available:

$ bin/cqlsh
cqlsh> describe keyspaces;

Titan

The Titan package is effectively Gremlin (Shell and Server) with the Titan add-on pre-installed. Nonetheless, our exact scenario is not covered by the convenience files provided so we have to make some minor additions. Start by unpacking Titan:

$ unzip -q titan-1.0.0-hadoop1.zip

Next we want to create a server configuration where a standalone, local Cassandra is the storage backend. We do this by optionally making a copy of the Cassandra properties file used by the Gremlin-shell and an optional addition:

$ cd titan-1.0.0-hadoop1/conf/gremlin-server
$ cp ../titan-cassandra.properties ./titan-cassandra-server.properties

Using your editor of choice, open the newly copied file and under the storage.hostname assignment, add the following:

# Define custom keyspace
storage.cassandra.keyspace=titan_test

This is not necessary in order to get the configuration up and running. It is however useful to know when co-locating many different Titan servers on the same Cassandra cluster.

Next we move on to creating the Gremlin-Server configuration, i.e. the file that we pass to Gremlin-Server, telling Gremlin-Server about our Cassandra storage particulars. Gremlin configuration is in YAML format. Here, too, we copy and modify:

$ cp gremlin-server.yaml gremlin-cassandra-server.yaml

At this point our changes are limited to making sure Gremlin-Server listens on a network adapter (or all, as it were with the changes below):

host: 0.0.0.0

With static IP servers, said IP address could be entered instead. We also have to point Gremlin-Server to the Cassandra properties file we just configured:

graph: conf/gremlin-server/titan-cassandra-server.properties

That is it for Gremlin-Server configuration. Go back to the Titan root and start the server:

$ cd ../..
$ bin/gremlin-server.sh conf/gremlin-server/titan-cassandra-server.yaml

It will spit out a staggering amount of information that we shouldn’t have to care too much about. Be sure to wait for the message that says it’s listening on port 8182 though.

Open up the firewall to enable communication on the Gremlin-Server port:

$ sudo firewall-cmd --zone=public --add-port=8182/tcp --permanent

To access the Gremlin-Server from the host machine (and beyond), don’t forget to add the port mapping in VMware’s nat.conf as outlined here.

Test

At this stage, we have our server image running Titan with a standalone, locally accessible instance of Cassandra as storage backend. To test this, we will use the Gremlin-Shell as it’s by far the easiest. In addition, we will go back to the Cassandra Query Shell to confirm that things are happening in that instance. Start there by confirming that the new keyspace (defined in the .properties file above) is available and that empty tables have been added.

cqlsh> describe keyspaces;
cqlsh> use titan_test;
cqlsh:titan_test> describe tables;
cqlsh:titan_test> select * from graphindex;

To continue the test we’re going to run the Gremlin-Shell. Even though we have configured Gremlin-Server to listen for requests over the network, we’re going to run the Gremlin-Shell locally for the test. Note that remote Gremlin-Shells require the hostname of the server to be defined in conf/remote.yaml.

$ bin/gremlin.sh

To connect to the Gremlin-Server inside the Gremlin-Shell, issue the following command:

gremlin> :remote connect tinkerpop.server conf/remote.yaml

We’ll create a small graph to prove that we persist it to the Cassandra system and that we can query it using Gremlin syntax. Note that :> is shorthand for :submit, i.e. it tells the Gremlin-Shell to send it to the active server connection instead of trying to execute it itself.

gremlin> script = """
cs = graph.addVertex('name', 'Cassandra Server')
ts = graph.addVertex('name', 'Titan Server')
gs = graph.addVertex('name', 'Gremlin Server')

ts.addEdge('queries', cs)
gs.addEdge('hosts', ts)
"""
gremlin> :> @script

Traverse the graph using Gremlin to determine the name of entity that ‘Titan Server’ queries:

gremlin> :> g.V().has('name', 'Titan Server').out('queries').values('name')

It should return ==> Cassandra Server

We can now go back to the Cassandra Query Shell to verify that something has been added to the data store. As the Titan graph schema is non-relational, we will simply see BLOB entries in the tables. The important part and what we’re testing here is that the graph was indeed added to the database we spun up outside the Titan install. As humans interested in our graph content, issue Gremlin scripts against Gremlin-Server, not queries directly against Cassandra.

Read GraphML

Not all graphs will be created from scratch using structural Gremlin, however. There are a number of supported file formats for serialized graphs that we can read from the file system and load into our graph of choice. In this test, we will load the Gremlin-shipped sample graph Grateful Dead choosing the XML-based format GraphML. All samples are located in the data directory. Loading into our existing graph is as simple as:

:> graph.io(IoCore.graphml()).readGraph('data/grateful-dead.xml')

This will load some 808 vertices from the provided file. Count the number of server-side vertices to verify the load (g is already the traverser variable)

:> g.V().count()

Lastly, to convince ourselves that the graph was indeed loaded into the Cassandra system, we query the the edgestore table in the titan keyspace:

cqlsh:titan_test> select count(*) as edges from edgestore;

While we’ll get a warning that we’re aggregating (i.e. row count) without specifying a partition key, the point here was to illustrate that thousands of new edges have successfully been added, which concludes the tests.

Where to next?

Next up, we’ll be securing our Gremlin-Server with authentication and SSL.

Centralizing GIT for a small team

Overview

In this post, we’ll examine how a small distributed team can collaborate using a centralized, i.e. effectively hosted, GIT server. GIT is inherently a distributed version control system that offers an incredibly low barrier to entry to achieving local version control. In fact, we’ll address directly the scenario where we have a locally started code base that gets promoted to said central repository.

Although even the smallest, local-only, repositories benefit from taking advantage of GIT’s powerful and efficient branching paradigm to e.g. perform experimental or otherwise feature-specific development in a topic branch, these practices have even greater impact when dealing with centralized repositories.

In addition and related to topic branching, we’ll take a side in the sometimes rather heated exchange between those who squash and who don’t. More on that below.

Architecture

The following simple schematic depicts the architecture from the originating developer’s point of view. Before centralization, said developer uses a local GIT repository whose origin points to a repository hosted on a local NAS-type device.

Target SCM Architecture

Access

While we generally recommend using SSH for accessing remote GIT repositories, this is not always possible. For illustrative purposes therefore, we’ll access the central repository via HTTPS. HTTPS has one rather significant drawback in that it’s stateless from a connection point of view and in order for users to not get prompted for user name and password every operation involving the remote repo, the GIT client offer various means of caching. Because our developer is on OSX, we’ll leverage the OS-level Keychain to retain the credentials necessary to access the central repo.

Start by enabling the credential helper:

$ git config --global credential.helper osxkeychain

This simply updates the ~/.gitconfig file and after being prompted the first time for a new remote, the credentials will be stored in the keychain.

First push

Before doing anything related to our central repository, we have to confirm the remotes we already have defined.

$ git remote -v

In our developer’s case, it will show two lines detailing the URLs to fetch from and push to the remote named origin. While not special, this is the default remote name and the new, central, remote must be referred to by a different name. Because the developer will start collaborating, we’ll call the new repository collab.

On the command prompt while in the repository directory structure, we add the link to the central remote like so (the URL is completely fictitious):

$ git remote add collab https://myname@mycompany.git.somegithost.com/myproject.git

Running a verbose remote listing again will yield two additional lines with the fetch and push URLs for the remote named collab.

Now for the actual push itself. Much like all GIT commands, at first blush they appear deceivingly simple.  This ‘default is simple’ paradigm is one of GIT’s many strengths. Don’t be fooled though, there’s is real complexity under the hood and complex scenarios will require similarly complex commands and a thorough understanding of the underpinnings in order to not create a hash of things. That being said, GIT is a robust version control system and you’d have to take some rather extreme steps to obliterate committed (and pushed) work to the point it’d probably not be a result of an ignorant accident. About that push command, however…

$ git push collab master

That is it. Use the switch –dry-run to try the command out first to ensure it’s all going to work properly but stopping short of sending anything to the central repository.

Collaborative Topic Branches

GIT’s branching paradigm, through its efficiency and rich and robust operations, encourages branching to be an integral part of version control with GIT. While this is helpful for individual developers looking to compartmentalize feature development, retain a robust master, or as we shall see, be provided with an environment that permits (and perhaps encourages) frequent commits, the branch concept really shines when multiple developers are collaborating not only on the same code base but on the same features.

There are a plethora of branching strategies out there and we don’t profess to using a superior one, but rather ones that fits our purpose for a small development team that can scale reasonably without any change.

Central GIT branches

In a nutshell, the master branch (by default called mainline in other version management systems) is as stable as possible and does not see direct/incremental feature development by developers. Instead, feature development takes place in topic branches where one or more developers implement the feature specific to that branch. Once the feature is complete, including being caught up with master, the code is reviewed before merged back into master. At this point, even though the schematic doesn’t show it, the topic branch can and should be deleted. Therefore, we call topic branches transient.

Topic branches

Local development branches are created most easily with the checkout -b command.

$ git checkout -b topic_branch

At this stage it is (of course) a local-only branch. We can also say it’s an un-tracked branch, which is to say it’s currently not tracking a remote branch, which is also to say there is no upstream branch defined for this (again, local) branch.

The following command pushes said branch to the remote of choice (here, collab) and makes it into its upstream branch, all in one go:

$ git push --set-upstream collab topic_branch

Execute the following branch command to see which branches exist and which are tracking branches in which remotes:

$ git branch -vv

But wait. What about our remote named origin? You can still push to (and fetch from), but an operation involving a (remote) branch the branch isn’t currently tracking needs to explicitly state the remote (and branch, if named differently). E.g.

$ git push origin topic_branch

Release Branches

As for official releases, these are tagged in GIT with so called Annotated Tags to enable full check-in notes and user lineage:

$ git tag -a tag_name

This opens the editor for the commit notes. -m can be used but is discouraged for official release tags. Branches don’t strictly speaking  need to be created at this point. Instead, branch at the point of the appropriate release tag when it’s necessary to do so, typically to create release-specific fix.

Squashing

The topic of squashing has seen a lot of colorful commentry. To understand why, know that squashing refers to taking a number of commits and collapsing, i.e. squashing, them into one. This can be beneficial in local topic branches as it promotes frequent commits and thus help mitigate the risk of lost work. Furthermore, as there is less emotional attachment to local topic commits, the commit notes need not be elaborate and thus go beyond a simple -m on the commit command line.

Squashing can occurs once the feature is development-complete in the topic branch but before master (or whichever source branch the topic was created off of) is merged in. Squashing starts off as an interactive rebase from within the topic branch:

$ git rebase -i master

This opens the editor with all commits that have happened since the branch was synchronized from master. Replace the leading pick word with with squash on all lines except the first one. Save and quit. You get another editor where the official commit note is to be entered.

Again, only squash on local topic branches before merging with tracked branches. Do e.g. not rebase central, collaborative topic branches.

Merging with master

We mentioned that in our configuration, topic branches are transient. This means we at some point want to merge the enhancements and fixes in our topic branch with master once the former is deemed stable. When you only have one remote, this is fairly straight-forward and well-documented elsewhere. With multiple remote repositories we have some extra considerations to make.

To begin with, our remote merges take place in a distinct clone of the remote. I.e. we clone our centralized repository into a distinct local repository that we nonetheless name consistently to ensure we avoid confusing remotes by calling it origin in one clone and collab in another. As we called the central repo collab in our development area, we’ll call it exactly the same in our merge area:

$ git clone -o collab https://...

Should you already have cloned the repository into origin, simply rename it:

$ git remote rename origin collab

Now, our objective is to track our remote’s master and topic branch, merge topic into master and delete the topic branch on the server and from all local repositories (which will number two in our case; one for development and one for merge). To make things a little more interesting we’ll name our client-side topic branch client_topic to distinguish it from the remote’s topic_branch.

Our starting point after the clone is in the tracked master branch.

$ git branch client_topic
$ git branch --set-upstream-to=collab/topic_branch client_topic
$ git checkout client_topic
$ git pull
$ git checkout master
$ git merge client_topic
$ git push collab --delete topic_branch

This is not the most compact manner in which these operations can be performed, but it illustrates the main points. Particularly noteworthy is combined push and remote branch deletion. This command can be split up into a push followed by another push and the delete request. This approach may be desirable as it gives you an opportunity to verify the target branch look good before deleting the topic branch.

The remote branch is now gone from the list of known remote branches:

$ git remote --list -r

Interestingly, perhaps when inspecting tracked branches, our client_topic’s tracking status is now “gone”.

$ git branch -vv

Regardless, we’re now well positioned to delete the local branch

$ git branch -d client_topic

As for the development repository, the operations are fairly intuitive in that you need to pull in the changes from collab and delete the local branch

$ git checkout master
$ git pull collab
$ git branch -d dev_client_topic

The topic branch has now been merged and removed.

Spark on Hadoop

Overview

Data engineers are required to to have a functional understanding of a broad range of systems and tools that operate somewhere in the space between source system and end users. Said tools and systems typically store, transform, visualize – or some combination thereof – the information  as it makes its way to the consumers. One such system has been surrounded by sufficient mystique that people ostensibly have been hired purely by virtue of claiming some unquantified degree knowledge thereof. This system is Apache Hadoop; the open source 800 pound gorilla that knowledge thereof, in addition to providing fodder for oohs and aahs at certain cocktail parties, enables distributed storage and processing of data that doesn’t have to have inherent structure.

While many great packaged distributions of Hadoop along with truly awe-inspiring enhancements are readily available often in Community-style editions, there is benefit to, before rushing to complex integrated solutions, spend a little bit of time with the base components on their own to develop an intuition and feel for how they work in isolation. The purpose of this post is to outline the steps performed to first and foremost get a working Hadoop system. Once this foundation is in place, we’ll add an in-memory processing engine to the mix; Apache Spark.

Arguably, the most sincere effort to understand the software we’re using would require us to work with the source where building it is an absolute minimum. However, in this post we won’t be working with the source as it’s not congruent with our goal to get something up and running quickly. After all, this is about being able to use Spark on Hadoop, not setting up build environments.

Environment

Speaking of environment; we’ll be setting up Spark 1.4.1 on Hadoop 2.7.1 on a CentOS 7 guest image running in VMware Fusion 7 Professional on a MacBook Pro running Yosemite.

A degree of care needs to be taken before rushing off to download the software. On the Spark Download Page, be sure to select “Pre-built for Hadoop 2.6 and later. As for Hadoop, simply pick the appropriate binary package from the Hadoop Release Page.

Depending on the objectives with the installation, it may or may not be acceptable or desirable to install under $HOME. As we in this case will be eliminating as many complexities as possible, we will disregard that we can sudo and will thus proceed with a user-local installation which consequently also runs under the current user. Later, we’ll explore what it takes to promote the installation to /usr/local.

As we’ll be performing these activities in a guest virtual  image, it’s beneficial to have performed the SSH key exchange with the host to facilitate simple host access for file copy operations. The guests on my setup have access to the host through host name hostos. That means I can access files on the host system very easily without needing to provide any password provided that the usernames on host and guest match.

guest$ scp hostos:Downloads/hadoop-2.7.1.tar.gz .

Or, perhaps even better, if you have mounted a networked directory, e.g. on a NAS which is the case on our network:

guest$ scp hostos:/Volumes/Server/Common/hadoop-2.7.1.tar.gz .

Getting started

Untar the binary distribution to a location of your choice:

guest$ tar -xf hadoop-2.7.1.tar.gz

Before you’ll be able to run bin/hadoop, you need to tell it which JVM to use. In CentOS7, the latest JVM can be found at /usr/java/latest and you can either point your JAVA_HOME environment variable to that location or you can edit the appropriate location in your hadoop-local etc/hadoop/hadoop-env.sh configuration file. One reason one may want to consider the latter alternative is that non-volatile configuration such as the JVM stays within the hadoop file structure and tends to remove moving parts when making other changes such as which user the server should run under.

We’ll be configuring a so called Pseudo-Distributed Cluster. The steps to configure that are from the Hadoop Quickstart Guide are summarized below. All console operations assume working directory is the hadoop-2.7.1 root.

Define site properties in the following two files as per:

etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

In order for the distributed file system operations to work, ssh to localhost need to work without prompting for a password. Some information about that has been provided as part of an earlier post. Once that’s setup, we can proceed with formatting the file system.

guest$ bin/hdfs namenode -format

With a formatted file system, we’re ready to start the file system daemon:

guest$ sbin/start-dfs.sh

After the namenode and datanode have been started, the server is now listening to three new TCP ports: 9000 and a whole bunch in the 50,000 range where the first is where worker requests are received and the latter for various forms of access. The two most interesting ports from a networking point of view at this stage are 50070 (administrative webapp) and 50075 (HTTP file access).

When the CentOS7 guest image is headless, it’s very helpful to get access to those ports from the host. The steps for enabling that with VMware Fusion 7 and NAT’d CentOS7 guests is outlined in a previous post and briefly summarized again:

Open the TCP port 50070 on the guest OS’ firewall:

guest$ sudo firewall-cmd --zone=public --add-port=50070/tcp --permanent

Add the port mapping in your hosts /Library/Preferences/VMware Fusion/vmnet8/nat.conf and restart VMware Fusion. Yes, this means stopping the guest and once it’s restarted, start the DFS as per above.

Validating DFS

At this point we should have a completely functioning Hadoop single-node pseudo cluster. Again referencing the quick start guide, this is the time to confirm that map-reduce jobs can be run.

The general approach is to use the sbin/hdfs command with dfs argument to create directories or put files. E.g.:

guest$ sbin/hdfs dfs -mkdir /user
guest$ sbin/hdfs dfs -mkdir /user/<your user name>
guest$ sbin/hdfs dfs -put etc/hadoop input

A sequence which creates a three level folder hierarchy and puts all Hadoop configuration files in the input folder.

As for the actual tests themselves, hadoop deploys with example jars including jars containing their source which is really handy. Let’s then run a word count example, the bread and butter for Hadoop:

guest$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount input output

This command starts a hadoop job that will be defined in a specific, example, jar. To the main function of said jar, the first argument indicates which class to direct the remaining arguments to; folder names input and output. When Hadoop runs these  jars, the working directory will be the /user/<your user name>. This is why it can resolve input as the name of the folder that resides on this level. The last argument, output, is the name of the folder that the class behind wordcount will create and fail id it’s there already. To re-run the example and not have to create a growing number of uniquely named output folders, delete the output folder e.g. like so:

guest$ bin/hdfs dfs -rm  -R output

Run the jar through bin/hadoop without any additional arguments to get a list of all possible command maps that have been defined in the jar (in file ExampleDriver.java).

Back to the example though, if the TCP port 50075 has not been opened in the firewall and NAT’d in the virtualization engine, simply download the output folder to the local file system with the DFS command GET:

guest$ bin/hdfs dfs -get output

Pop into the local output and the word count results will be in a file named part-r-0000.

Adding Spark

Because we’re interested in running spark jobs that consume content that resides in a Hadoop cluster (or pseudo-cluster which is the case above), it’s imperative that the spark binaries are built with the proper Hadoop version in mind. For a 2.7.1 Hadoop, we’ll get Spark 1.4.1 built for Hadoop 2.6 and newer.

We untar it right next to our Hadoop install, but before we do anything else, we need to tell Spark about Hadoop and we need to reduce the verbosity of Hadoop’s logging.

For the first one, we define environment variable HADOOP_CONF_DIR to point to Hadoop’s configuration directory. We put it in our ~/.bash_profile for future sessions:

export HADOOP_CONF_DIR=$HOME/hadoop-2.7.1/etc/hadoop

Second change Hadoop’s root log-level from INFO to ERROR by updating the corresponding entry in $HADOOP_CONF_DIR/log4j.properties:

hadoop.root.logger=ERROR,console

We’re now ready to go into our Spark root and run the Python Spark shell, pyspark:

guest$ bin/pyspark

Perform a very basic hdfs:// connectivity test by verifying that Spark can read one of the files that were uploaded earlier into the input folder. Note though that the full path needs to be provided:

>>> testRDD = sc.textFile("hdfs:///user/testuser/input/hadoop-env.sh")
>>> testRDD.count()
99
>>>

While the number (99) may differ between deployments, it should be possible to run the above commands (substituting the testuser as required) without getting any errors.

The shell, be it Python or Scala (or even a slimmed down version of R as of Spark 1.4), is useful for performing simple tests. In order to run more complex logic, use bin/spark-submit to launch the program.

This concludes the very simple basic setup of a single pseudo-cluster node of Hadoop and standalone spark. Much can be done with this simple environment from a functional point of view such as testing genuine Big Data applications, albeit against very modest data sets. Said applications however can then be deployed unchanged onto production grade clusters.

What about YARN?

Production cluster deployments of Hadoop and Spark will include YARN, the Hadoop version 2 resource manager. In a nutshell, Spark is, in addition to a standalone entity, also a YARN application. YARN is so complex it’s bordering on the fantastical and will be discussed in a future post.

 

Basic NAT for VMs

Objective

As builders of analytical landscapes, we have to pay attention to a myriad of details. The topic of this post is carved out of a small portion of a huge subject area: Networking. Specifically, we’ll be focusing on the system builder’s machine setup to facilitate access to virtualized nodes running on a developer-type machine. To illustrate this setup we’ll use a very specific configuration. Different configurations will have different strengths and weaknesses but the example below will show the basic concepts.

The overall objective will be to run the graph database Neo4j inside a virtual machine on the host machine. Said database should be accessible to those who can access the host machine, directly or via the network.

Environment Setup

The host machine in this case is a 13.3″ MacBook Pro running Yosemite with VMware Fusion 7 Professional. Neo4j will be installed on a 64bit CentOS 7 (the firewall configuration on earlier versions is completely different). To keep this topic focused, we will not cover the installation of any software, but rather stay purely on the network configuration side of things.

Network Options

Virtualization solutions such as VMware provide network access to the guest images through a software switch, or a set thereof as is the case in VMware specifically. They are:

  • Host-only network. As the name implies, on the host itself can see and communicate with guests. Useful e.g. to limit impact of untrusted images. This network is provided through a virtual switch named vmnet1.
  • NAT, or Network Address Translation, which is the dominant configuration when dealing with firewalls; images behind the firewall appear as one to those outside the firewall. The firewall in turn makes sure packets are routed back to the correct host for returning/response traffic or forwarded in the case of new in-bound traffic. VMware’s switch for this is called vmnet8.
  • Bridged network turns the guests into ‘topological siblings’ of sorts with the host. This means guests route like the host, are on the same subnet and consequently get their IP-address (in the case of DHCP) from the same server as the host etc. This isn’t always exactly the case, but the purposes of this argument, it is.

Which option to choose? Well, requirements disqualify the host-only option, so it stands between NAT and Bridged. Unless there are very specific requirements that require Bridged, err on the side of NAT. This will keep control with you as you have one more layer between your guests and the network you’re on. This works both ways, if you’re on an otherwise untrusted network, you don’t have to harden the firewall of your development guest images that need to communicate. This is a good thing as it stops you from needing to chase down bugs that turned out to be an overzealous firewall configuration. Going the other way, you have one more way of making sure your guest images won’t embarrass you when your dev images cause your client’s network administrator to run screaming that they’re under attack. Lastly, on the topic of performance, why force host-to-guest and guest-to-guest traffic to route outside your host? Unless you need to sniff that traffic on e.g. your router, it’s just a drag on performance for you and those using your network.

We can now reformulate the problem to: with a NAT’d guest, how do we access it from the outside, i.e. on the host and beyond?

Three-tiered Solution

The answer to that question was already alluded to above; with port forwarding. The concept is familiar to anybody who’s been playing around in their router’s administration app. Most, if not all, modern routers allow you to open a particular port (or range of ports) on the WAN-side and connect with, i.e. forward to, a host and port (range) combination on the LAN side. Basically the exact same principle applies to the ‘router’ that is vmnet8. This forward rule is specified in the minimalist syntax required by the following file:

/Library/Preferences/VMware Fusion/vmnet8/nat.conf

Really old versions of VMware Fusion will have a slightly different path, but the idea remains the same. VMware is kind enough to put in examples of such forwarding rules, you just have to be mindful of the section header for TCP or UDP. E.g. for TCP, the following (commented out) example is provided:

[incomingtcp]

# Use these with care - anyone can enter into your VM through these...
# The format and example are as follows:
#<external port number> = <VM's IP address>:<VM's port number>
#8080 = 172.16.3.128:80

If uncommented and VMware restarted (not just the guest image, but Fusion-proper), http://localhost:8080 in the browser on the host would be resolved and routed to the NAT’d guest with the IP-address in question. In our case with Neo4j, we’re interest in accessing our guest’s port 7474/tcp, so we add the following line under the commented out line above:

7474 = 172.16.126.143:7474

A few things to note: We specified our guest by IP-address. In the VMware case this doesn’t necessitate a static IP for the guest; VMware’s internal DHCP server for vmnet1 and vmnet8 have address affinity and will give your guest a consistent address. Note also that we didn’t have to choose port 7474 on our host, it could in principle be any available port above 1024, e.g. 17474 and in many cases this is actually recommended, especially when forwarding to a guest’s web server that often listens on ports 80 and 443 that very likely are already taken on your host (or by another guest on your host).

But even so, we’re not necessarily done. Why not? Well, all we’ve really done is put up a sign and made a path, so to speak. There may still be an uncompromising firewall erected by the guest that prevents our access. In our case, since we’re using CentOS 7 (again it’s important it’s 7 as 6.x behaves differently) we’ll be using the rather pleasant firewall-cmd tool. Specifically, we have to do the following:

  • Determine the zone for the NIC in question (see below)
sudo firewall-cmd --get-active-zones
  • Poke a hole in the firewall for the port in the appropriate zone, in this case: public.
sudo firewall-cmd --zone=public --add-port=7474/tcp
  • Make changes permanent once confirmed to be working
sudo firewall-cmd --zone=public --add-port=7474/tcp --permanent

While following to conceptual flow of information, this still is somewhat backwards. Reason being, for us to establish the zone, we need to know the NIC and if we knew which NIC we would quickly realize that Neo4j out of the box does not use that NIC. Instead, very sensibly, Neo4j, unless you tell it otherwise will only listen on the loop-back interface to eliminate the risk that external marauders get access to your freshly minted graph database. So, even though we’ve set up the port forward and opened the firewall, we’ve merely configured a route to a NIC where there’s no Neo4j.

Enter the third and final configuration step; Make Neo4j listen on the NIC you’ve opened 7474/tcp access to. Thankfully, we don’t have to say exactly which IP this is, but by telling Neo4j to listen to 0.0.0.0, we’re effectively telling it that we want a ‘real’ (as real as it gets in the realm of virtual servers) NIC and not loop-back. To convince yourself of this all you need to run is this:

sudo netstat -tulpn

It will show that 7474/tcp is listening on 127.0.0.1, i.e. loopback. Again, your objective is to tell Neo4j to listen on 0.0.0.0. This listen setting is defined in Neo4j’s neo4j-server.properties file. Locate and uncomment the following line:

org.neo4j.server.webserver.address=0.0.0.0

Save and restart Neo4j. Now, we’re ready to access our guest’s 7474/tcp port and in Neo4j’s case we do so in the browser: http://localhost:7474

Version Control

Version control is a huge topic and I won’t even attempt to do the rationale for it justice. Suffice to say, a state of the art source control repository is an essential piece of partner software that will keep information, configuration, code etc. properly versioned and, if so desired, open to collaboration.

The version control of choice here is git. Why? It is incredibly efficient, heavily used throughout the Open Source Software community and it’s architected at its core to function while detached from remote repositories.

The Pro Git book is published online and freely available and is in my mind the git resource and I rarely go elsewhere for git info.

So, what are we trying to do here? As I have many VMware images typically running various versions of Linux and running a purely local git repositories on my trusty laptop is not sufficiently redundant to suit my tastes, I need a server that hosts my git remote instance. For this purpose, I’m using my Synology NAS 713+ machine. Not a spectacular piece of hardware, but it’s got redundant drives and performs well enough for this purpose.

The net result I’m after here is that I have a remote repository that I can clone in any of my ‘worker’ images so that I can get reliable access to configuration commands, code etc. without having to resort to shared folder access.

As I have VPN access to my own network when outside of it, I strictly speaking won’t need to forward any ports when working with the remote repository outside my firewalls, but because I enable SSH with certificate authentication only for accessing said repositories, it presents a light-weight options for working truly remotely with the repository.

Basic Configuration

First off, the NAS needs to be configured with a share, optionally a dedicated account and appropriate security definition. To keep things simple as my client account name matches that in my NAS, I’ve opted to not setup a distinct account for access the remote git repository. Again, this is the trivial alternative and people looking for configurations more conducive to robust collaboration, will find plenty of resources outlining their options in detail.
In the steps below, I’ve artificially altered the shell prompt ot indicate where the commands are executed, client or NAS, client$ or nas$.

Authorize your client account to log on to the NAS via SSH without having to specify a password. You do this by adding your client account’s public key to the NAS account’s authorized_keys file. Generate a key-pair with ssh-keygen only if you don’t have a key pair already. The NAS’s hostname is set to ‘syno’ in /etc/hosts

client$ ssh-keygen
client$ scp ~/.ssh/id_rsa.pub syno:
client$ ssh syno
nas$ cat id_rsa.pub >> ~/.ssh/authorized_keys
nas$ rm id_rsa.pub

While you should have been prompted for you NAS account’s password the first time, with the key now be ‘authorized’, an attempt to ssh syno should now provide password-less access.

Once we’re setup with password-less access to remote storage, we’re ready to actually push an empty repository. To do that you need to create the repository locally. I’ll create a repository called ‘own’:

client$ mkdir -p git/own && cd git/own
client$ git init

Now that we have an empty repository, we need to clone a ‘bare’ representation of it

client$ cd ..
client$ git clone --bare own own.git

Before pushing this repository to the NAS, to save on typing when cloning and pushing, I setup a server-side symolic link to point to the ‘storage root’ where all the git remote repositories will be added. This is a trivial operation that impacts the account for which SSH access has been setup only. E.g. in my case:

client$ ssh syno
nas$ ln -s /volume1/depot git

Now we can copy this repository to the server using standard SSH vernacular:

client$ scp -r own.git syno:git/

That’s it. Now you can remove the local own and own.git and fetch the server one. Presuming you have git installed and configured in your VM images, cloning your repository is as simple as:

client$ git clone syno:git/own.git

When working with virtual machine images in a development or lab capacity, it’s recommended to snapshot base images with SSH and git not only installed but also configured. It can then be really efficient to spin up new images or revert back to something that was last known to be working without having to go through the setup steps every time. Also, with a base configuration as a snapshot, you avoid polluting your NAS-side authorized_keys files with a high number of distinct public keys, some of which may belong to long-ago discarded VM images.

Reboot

This site has been dormant for years while I’ve been pursuing traditional, corporate ventures. Its previous incarnation was, as was relevant at the time, clearly slanted towards business intelligence consulting.

While consulting itself was a means to an end, the end of which did not signify any termination to the pursuit of making sense of data. Instead, you could say with everyday distractions having moved off to the side, the focus on creating designs that are ‘right’ as opposed the bare minimum is all of a sudden a within the realm of the possible and eminently achievable.

‘Right’ in this sense is right from a business strategic point of view, it’s right presuming the majority of other decisions in an organization are made the right way, too. Just like ‘good enough’ is open to interpretation and as a result has often received a bad rap, ‘right’ when spoken through the mouth of an unabashed software engineer is easily and often construed as overly elaborate, complex and what’s worse, out of touch with business objectives.

Getting back to actually doing what’s right. This site will be a collection of observations and interactions, detailing the thought, rationale and very real obstacles and decision points one is faced with when building analytical systems from the ground up.

While not everybody needs to, or indeed should, know everything, it’s my experience that projects, and in some cases entire companies, fail by virtue of people’s information isolation levels are too high; the silos are too entrenched for a decision-maker to get an unbiased perspective in order to effectively manage the entity in question.

To combat this modern-day malady, I’m advocating an approach that requires that at least a few key people need to have a seemingly impressive insight into a substantial portion of the domain. In the realm of analytical systems, that means that a hypothetical ‘star resource’ (please, let’s find a better moniker for this person) would keenly understand the motivation of the business drivers behind why people are even talking about putting ‘something’ in place to drive operations, provide insight or whatever it is that needs to be done. In addition, however, said person is capable of translating these business problems (if that’s what they are) into a technical domain and can there articulate the problems, potential solutions in the relevant vernacular.

This means that the business-savvy individual understands analytical software systems from the infrastructure level to the applications. E.g. they are familiar with Linux as a deployment architecture on a deep level, they are very fluent in database technologies and query constructs, both as it relates to traditional RDBMSs as well as those in the NoSQL space, they understand messaging infrastructure and integrate data from disparate sources and can tell when it works properly and when it doesn’t. They understand mathematical and statistical modeling and know the relevant languages such as R and Python to not only prototype model implementations, but in fact can directly contribute to the production deployment.

To round it all off, a solid understanding of and rather sincere respect for  project management as a discipline is a significant asset that will keep progress up.

Subsequent posts will show how well (or not) this skill set is articulated using real analytical landscape patterns.