Graph Database Cloud Deployment


In this post, we’ll cover the first few steps towards deploying a graph database in the cloud to facilitate centralized persistence and collaboration. To keep costs down (zero in fact if you haven’t yet exhausted your AWS Free Tier), we’ll deploy a rather scaled down version of the Titan Graph DB that can run on a t2.micro instance. However, just because it can run on such an instance, it should be understood that this is by no means recommended to run a productive Titan server in this manner. Proper production sizing should prevail and this post should be seen as a  rough template that can be scaled to suit bigger needs.

The key takeaways for the deployment are:

  1. Scalable storage and computation
  2. Secure with end-user processes that meaningful and can be enforced


The server deployment outlined herein is a very classic one in the sense that it relies on a full-fledged Linux installation as an EC2 node and as such does not leverage containers or lambdas or any other light-weight mechanisms to achieve elastic scalability.

Titan Deployment


Currently, AWS is less than stellar in that it doesn’t allow for a launch profile to directly attach an instance to an existing block device. This is desirable as with a detached EBS volume backing our database, we could very easily take advantage of Spot instances with an AMI based boot snapshot and a standalone EBS volume where upon instance termination, our database backing store would remain and ready to be mounted the next time we launched a spot instance. Even so, to prepare for this eventuality, we’ll create an EBS volume that we’ll attach to the instance.

In our case, on a RHEL 7 image, an EBS device name of /dev/sdf showed up in linux as /dev/xvdf. While completely optional, we’ll here create a single LVM partition that will give us much flexibility later on when deciding how to divvy up the storage space.

Use fdisk to create a single, primary, partition that spans the device and has device type 8e (Linux LVM). Optionally, you can create a swap partition (type 82) as well as your image by default won’t have this. If you do, of course, this partition needs to be formatted with mkswap, used with swapon and mounted in /etc/fstab

# fdisk /dev/xvdf

In our LVM, we want one really small (e.g. 64MB) Logical Volume (LV) for the Titan data directory as that’s where we’ll keep our credentials graph. The second LV is the bigger one (e.g. 2GB) where the database files will be stored. The reason these are so small despite that the free tier covers up to 30GB is that the free tier eventually runs out and you don’t want to allocate more than you need. LVM makes it easy to grow these as needed. The LV creation follows these steps, presuming /dev/xvdf1 is your LVM partition:

# pvcreate /dev/xvdf1
# vgcreate vg /dev/xvdf1
# lvcreate -L 64M -n tdata vg
# lvcreate -L 2G -n dbdata vg
# mkfs.xfs /dev/mapper/vg-tdata
# mkfs.xfs /dev/mapper/vg-dbdata
# lsblk

The last command lists the device hierarchy.


This post is not an in-depth elaboration of how to setup an OpenVPN server. There are maintained references for that. Instead, we’ll cover specific choices made when setting it up.

Specifically, we installed OpenVPN from the official RPM, put keys under /etc/openvpn/keys, i.e. the server’s key and certificate, the CA’s certificate and the DH .pem file. The configuration file is /etc/openvpn/server.conf in which we kept the 1194/udp port and protocol. For its simplicity and performance, we use a routed configuration (as opposed to bridged).

Choose a particular subnet and netmask at the server tag. E.g.


Your OpenVPN server will assign IP address, which is the address your client application, i.e. those connecting to Titan once the VPN tunnel is up and running, will be using.

From a security point of view, TLS-Auth and running under an unprivileged account is highly recommended as we need to enable world-wide ( access to the VPN server in our AWS Security Group and as such need to arm ourselves against DoS attacks.

Open the requisite firewall port:

# firewall-cmd --zone-public --add-port=1194/udp
# firewall-cmd --zone=public --add-port=1194/udp --permanent

Generate client certificates either directly (including the .key file that then must be transferred over a secure line) or from certificate requests (.csr files).

Once you’ve confirmed that your setup is working, the final piece is to enable the systemd services file that the OpenVPN RPM provided. The .service file provided is a template file so that when you enable it, make sure that the name provided matches the configuration file, in our case server.conf. Hence:

# systemctl enable openvpn@server.service
# systemctl start openvpn@server.service


We have covered Titan deployments in previous posts and won’t be going into the same level of detail here. In sum, though, we want Titan installed under /var/lib/titan, running under a dedicated account (also named titan). We will be using the titan account to also run the console when necessary, so it’s important that it’s given a login shell (bash) and that it has a home directory. The Gremlin Console insists on writing user preferences and will throw an exception if it’s not there. Alternatively, create the titan user without a shell and home directory and add the titan group to the user account you’ll be running the Gremlin Console with.

Next, we’ll make titan use the LVs we have created. Basically:

# cd /var/lib/titan
# mkdir db
# mount /dev/mapper/vg-dbdata db
# mkdir db/berkeley
# mkdir db/es
# mv data data.old
# mkdir data
# mount /dev/mapper/vg-tdata data
# mv data.old/* data/
# rmdir data.old

Now the key Titan directories are using the EBS partitions. Update /etc/fstab to mount these on boot as e.g.

/dev/mapper/vg-dbata /var/lib/titan/db xfs defaults,noexec 0 2
/dev/mapper/vg-tdata /var/lib/titan/data xfs defaults,noexec 0 2

Next, we’ll copy and modify the following files and make changes such that Titan will use the BerkeleyDB location we specified.

  1. Copy conf/remote.yaml and change the IP to the one Titan listens on.
  2. Copy conf/ and change the paths to /var/lib/titan/db/berkeley and /var/lib/titan/db/es. Also add the following entry at the bottom:
  3. Copy conf/gremlin-server/gremlin-server.yaml and change the IP address and the path to the .properties file from the previous step.

Since we’re done with the file config, make the titan account own the install.

# cd /var/lib
# chown -R titan:titan titan

At this point it makes sense to ensure that Titan starts without errors and that as user titan, you can launch the Gremlin Console and connect to the Titan server.

Once that’s complete, it’s time for the final piece; to create a .services file so that systemd launches it on system boot. Start by changing to the system directory for systemd:

# cd /usr/lib/systemd/system

In there, create a file, named e.g. titan.service and tailor the contents to match the name of the server configuration file created above. E.g.

# /usr/lib/systemd/system/titan.service


ExecStart=/var/lib/titan/bin/ conf/gremlin-server/my-server.yaml


Note the dependence on the OpenVPN service launched earlier. This is required as IP address the Titan server listens to is on the tun0 network device created by OpenVPN and it would not do to start Titan before the OpenVPN service has completely started. Enable and start the Titan service:

# systemctl enable titan.service
# systemctl start titan.service

As we’re also running firewalld, open up the port to allow our clients access to the Titan server:

# firewall-cmd --zone-public --add-port=8182/tcp
# firewall-cmd --zone=public --add-port=8182/tcp --permanent


Our server is now self-sustaining in the sense that its storage is organized and services setup. You can reboot, stop and start and even create AMIs from it that can be launched and it will come up properly.

In order to manage new users we need a few simple processes:

  1. Sign certificates from certificate requests. This is simple to do securely as it does not involve the transference of private keys. However, if you have opted for the TLS-Auth option (and it’s recommended that you do), that pre-shared key needs to be distributed to each client in a secure fashion.
  2. Distribute a pre-configured client.conf (or client.ovpn) configuration file to facilitate a hassle-free connection experience.
  3. Once the clients’ VPN tunnels are running, they should connect to Titan via the OpenVPN assigned address, which in our case was
  4. If your Titan server requires authentication, define these credentials in the credentials graph and restart Titan.


Distributed Graph Database – Part 2


In the previous post, we deployed a Cassandra-backed Titan graph database. In this post we will continue to build on this by introducing basic means of securing access to the server. Specifically, we’re going to be defining username and password credentials as well as generating a CA-signed certificate for SSL.


Gremlin-Server out of the box supports a simple authentication scheme based on username and password. These pairs are unsurprisingly managed using a graph, albeit a special one that is not manipulated through regular Gremlin syntax but rather through a Credentials wrapper that ensures passwords are not persisted in clear text.

By virtue of the credentials being managed through a distinct graph that is only ingested into Gremlin-Server upon startup, credential changes necessitate a Gremlin-Server restart. Over the next couple of sections we will create credentials and configure Gremlin-Server to require authentication defined by said credentials graph.

Defining the Graph

We manage the credentials graph and the distinct file in which it will be persisted through the Gremlin-Console. The one pre-requisite for the Gremlin-Console is that the tinkperpop.credentials plugin must be activated.

gremlin> :plugin use tinkerpop.credentials

Create the empty, in-memory graph and wrap it in the CredentialGraph construct:

gremlin> graph =
gremlin> creds = credentials(graph)

Add a couple of username and password pairs:

gremlin> creds.createUser('demoUser', 'bad-password')
gremlin> creds.createUser('otherUser', 'simple-password')

In principle, this is sufficient to authenticate two distinct principals and we can therefore go ahead and persist this to a file to be loaded into Gremlin-Server. We use the compact and lossless format Gryo.


Even though we have successfully persisted the graph at this stage, there are a number of operations that are relevant to the maintenance of the credentials graph.

Loading the Gryo file into an empty TinkerGraph as constructed above:


Querying the number of username and password pairs in the graph:

gremlin> creds.countUsers()

Querying the usernames in the graph:

gremlin> g = graph.traversal()
gremlin> g.V().values('username')

Remove a username and password pair from the credentials graph:

gremlin> creds.removeUser('otherUser')

Configuring the Server

In line with the approach in the previous post, we won’t be writing many configuration files from scratch but rather copy and modify the ones that closest resemble our requirements. On the Gremlin-Server side, our objective is to ensure the server starts with the credentials we have defined in the Gremlin-Console above.

Start by copying the credentials graph file into the server’s data directory, e.g.:

cp ../gremlin-console/data/my-creds.kryo data/

Copy and modify the server configuration file that closest resemble the our requirements, i.e. conf/gremlin-server-secure.yaml

cp conf/gremlin-server-secure.yaml conf/my-secure-server.yaml

Open the new config file in a text editor of your choice and search for the credentialsDbLocation key and change its value to reflect the name of the credentials file; in our case: data/my-creds.kryo

Save, exit and start the Gremlin-Server:

$ bin/ conf/my-secure-server.yaml

Test the Credentials

The simplest test is performed using the Gremlin-Console and we’ll start with creating the client configuration to connect. We will use conf/remote-secure.yaml as the source of inspiration and thus copy that to something we modify e.g.:

cp conf/remote-secure.yaml conf/my-secure-remote.yaml

Edit the new configuration file to update the username and password keys to reflect the values defined in the credentials graph above. Save and exit.

Start the Gremlin-Console and perform a connection test. Note that simply connecting is insufficient to test the credentials; Gremlin-Console won’t attempt to log on until you ask it do something involving resources on the Gremlin-Server:

gremlin> :remote connect tinkerpop.server conf/my-secure-remote.yaml
gremlin> :> g.V().values('name')

Other than getting a warning about a self-signed certificate which is not suitable for production, the test should return the expected results without errors.

SSL Certificate

While it’s encouraging to have authentication correctly set up, passwords are transmitted in clear text between client and server unless SSL is configured and SSL is currently reporting that by virtue of using a self-signed certificate it’s not in a production-worthy state. That the certificate is self-signed means it was not issued by a trusted authority that would require proof of authenticity and hence while the certificate can be used for encrypting a channel, as a client you have no way of knowing that the party on the other side is actually who you think it is.


Without going into great depth of PKI – Public Key Infrastructure – considerations, we can intuitively infer that we need a trusted entity to sign certificates so we can reason that if we trust A and A said that B is B, we can trust that B is B. A in this simple logic is a Certificate Authority and while many such true authorities exist on the Internet, we don’t need one for our simple purposes here to illustrate micro-scale trust.

In terms of PKI to enable this scenario we’re going to be using the OpenSSL-backed easy-rsa. For those on RPM-based distributions, add the epel-release repository and install the easy-rsa package through yum. Before starting in earnest, copy the entire easy-rsa directory away from its /usr/share location to protect your vars update from being overwritten when easy-rsa is updated.

Certificate Generation

Readers unfamiliar with easy-rsa are encouraged to at least get a basic understanding of this simple yet highly useful wrapper that raises the abstraction level  for PKI relevant operations on top of OpenSSL.

It is assumed that the vars file has been updated and sourced into the current shell before the commands below are executed. Start by generating the key-pair etc. for the Certificate Authority itself:


The CA generated is capable of understanding certificate requests, but since we’re on the same host as the CA we can use a convenience function that bypasses the explicit generation and approval of the certificate request and thus can go on to directly generating the certificate itself.

./build-key-server gremlincert

After answers to the questions have been provided, including an empty password, this command outputs the key, certificate request and certificate files into the keys directory. The certificate is stored in keys/gremlincert.crt.


After copying the gremlincert.key and gremlincrt.crt files to the data area of the Gremlin-Server directory structure, edit the ssl area of the my-secure-server.yaml file to add the following after the opening curly brace and before the enable: true:

keyFile: data/gremlincert.key,
keyCertChainFile: data/gremlincert.crt,

Gremlin-Server should now start cleanly using the certificate provided.

Distributed Graph DB – Part 1


While traditional relational databases remain an integral part of the solution space for information storage and retrieval, there can be no denying that NoSQL options are presenting viable alternatives in spaces outside the RDBMS sweet-spot. One such such space where relational storage and query is suboptimal is in the handling of graph-based structures. In previous posts, we’ve looked very briefly at the currently most widely adopted graph DB, Neo4j. Neo4j is a robust, commercially maintained and enterprise-worthy graph database. Neo4j exposes a query language named Cypher that allows query developers to author queries declaratively.

Cypher is relatively easy to learn for those with a database background, however it suffers from one potentially serious shortcoming; Cypher is Neo4j-centric and hence any application that relies on a graph DB backend will be tightly coupled with Neo4j. We will instead in this post be looking at Tinkerpop3 as an abstraction layer that allows us to decouple clients from the graph storage backend. Specifically, we will be deploying a development-grade server with a Titan graph database backed by an instance of Apache Cassandra.


The diagram below illustrates a single server node architecture, suitable for small production deployments and non-production, such as test and development scenarios.

Titan Dev Server

Titan, our graph database abstraction of choice, supports many deployment scenarios and here we’ll focus on a remote, non-embedded architecture. With the addition of a load balancer, this model will scale very well horizontally owing to Cassandra’s intrinsic clustering and Titan’s effective statelessness. We will however not address that option in this post.


Gremlin is part of the Tinkerpop suite and can be thought of as a Domain Specific Language for graph traversal (and also structure). Gremlin-Server is a wrapper that through Web Sockets or REST APIs receive Gremlin scripts, execute them through a pluggable driver architecture that abstracts the actual graph storage implementations and marshals the results to the client. Gremlin-Server was named Rexster in the no longer supported version 2 of Tinkerpop.

Gremlin doesn’t require a server however. In principle, the Gremlin-Shell can be configured to leverage any supported graph database by loading the appropriate plug-in/driver. This is not how we’ll be using the Gremlin-Shell in this post though; We’ll be using it to submit scripts to the Gremlin-Server and thus the graphs will not be loaded in the Gremlin-Shell memory.


As our server image, we will be using a minimal CentOS7 image running in VMware Fusion 7 Pro. The image is light-weight with 2 virtual cores and a modest 2GB of RAM. The guest image NIC is NAT’d.


Before starting, download the necessary components:

  1. Java8 JVM. For our CentOS image we need a 64bit Linux RPM.
  2. Most recent version of Cassandra. While version 3.0.0 is released at the time of this writing, 2.2.3 is the most stable, production-worthy version so we’re going with that.
  3. The recommended version of Titan. At the time of writing, this was Titan 1.0.0 with Hadoop 1.0.

To keep the sudo-ing to a minimum, we’ll install said server components under our $HOME directory. This is clearly inappropriate for production deployments, however small, but it suits our needs for a development-centric environment.



Install the JVM, systemwide:

$ sudo rpm -i jdk-8u66-linux-x64.rpm

Verify that the Java version is the correct one:

$ java -version

If it isn’t, run the following interactive utility and select the appropriate JVM version:

$ sudo alternatives --config java

It may be necessary/appropriate to declare $JAVA_HOME to point to the JVM as well.

Apache Cassandra

Because we’re only deploying a development-grade server, we will at this point not worry about any tuning or even changing the network interface on which the server will be listening. Recall from the diagram above that Titan will be running on the same host as Cassandra, so it’s entirely appropriate and even desirable for Cassandra to only accept non-networked requests. This is more secure and also better performing  by virtue of not having to involve the network stack when Cassandra and Titan communicate. Uncompress the Cassandra tar ball:

$ tar -xzf apache-cassandra-2.2.3-bin.tar.gz

cd into the top-level directory of Cassandra and start the server on the console in the background.

$ bin/cassandra -f &

Once it has settled down, you need to enable the Thrift API. This API is required by Titan when using Cassandra as the storage backend.

$ bin/nodetool enablethrift

By checking the opened ports you should see that 9160/tcp is now active. In addition with the following netstat parameters you’ll also see all other ports associated with the Cassandra PID:

$ sudo netstat -tulpn

Now that Cassandra is running, you can move Cassandra to the foreground and our work continues in another session.

$ fg

In a different terminal session, cd into the Cassandra directory and start the Cassandra Query Language Shell and list the keyspaces available just to confirm that you can communicate with the server and also get a baseline of the keyspaces available:

$ bin/cqlsh
cqlsh> describe keyspaces;


The Titan package is effectively Gremlin (Shell and Server) with the Titan add-on pre-installed. Nonetheless, our exact scenario is not covered by the convenience files provided so we have to make some minor additions. Start by unpacking Titan:

$ unzip -q

Next we want to create a server configuration where a standalone, local Cassandra is the storage backend. We do this by optionally making a copy of the Cassandra properties file used by the Gremlin-shell and an optional addition:

$ cd titan-1.0.0-hadoop1/conf/gremlin-server
$ cp ../ ./

Using your editor of choice, open the newly copied file and under the storage.hostname assignment, add the following:

# Define custom keyspace

This is not necessary in order to get the configuration up and running. It is however useful to know when co-locating many different Titan servers on the same Cassandra cluster.

Next we move on to creating the Gremlin-Server configuration, i.e. the file that we pass to Gremlin-Server, telling Gremlin-Server about our Cassandra storage particulars. Gremlin configuration is in YAML format. Here, too, we copy and modify:

$ cp gremlin-server.yaml gremlin-cassandra-server.yaml

At this point our changes are limited to making sure Gremlin-Server listens on a network adapter (or all, as it were with the changes below):


With static IP servers, said IP address could be entered instead. We also have to point Gremlin-Server to the Cassandra properties file we just configured:

graph: conf/gremlin-server/

That is it for Gremlin-Server configuration. Go back to the Titan root and start the server:

$ cd ../..
$ bin/ conf/gremlin-server/titan-cassandra-server.yaml

It will spit out a staggering amount of information that we shouldn’t have to care too much about. Be sure to wait for the message that says it’s listening on port 8182 though.

Open up the firewall to enable communication on the Gremlin-Server port:

$ sudo firewall-cmd --zone=public --add-port=8182/tcp --permanent

To access the Gremlin-Server from the host machine (and beyond), don’t forget to add the port mapping in VMware’s nat.conf as outlined here.


At this stage, we have our server image running Titan with a standalone, locally accessible instance of Cassandra as storage backend. To test this, we will use the Gremlin-Shell as it’s by far the easiest. In addition, we will go back to the Cassandra Query Shell to confirm that things are happening in that instance. Start there by confirming that the new keyspace (defined in the .properties file above) is available and that empty tables have been added.

cqlsh> describe keyspaces;
cqlsh> use titan_test;
cqlsh:titan_test> describe tables;
cqlsh:titan_test> select * from graphindex;

To continue the test we’re going to run the Gremlin-Shell. Even though we have configured Gremlin-Server to listen for requests over the network, we’re going to run the Gremlin-Shell locally for the test. Note that remote Gremlin-Shells require the hostname of the server to be defined in conf/remote.yaml.

$ bin/

To connect to the Gremlin-Server inside the Gremlin-Shell, issue the following command:

gremlin> :remote connect tinkerpop.server conf/remote.yaml

We’ll create a small graph to prove that we persist it to the Cassandra system and that we can query it using Gremlin syntax. Note that :> is shorthand for :submit, i.e. it tells the Gremlin-Shell to send it to the active server connection instead of trying to execute it itself.

gremlin> script = """
cs = graph.addVertex('name', 'Cassandra Server')
ts = graph.addVertex('name', 'Titan Server')
gs = graph.addVertex('name', 'Gremlin Server')

ts.addEdge('queries', cs)
gs.addEdge('hosts', ts)
gremlin> :> @script

Traverse the graph using Gremlin to determine the name of entity that ‘Titan Server’ queries:

gremlin> :> g.V().has('name', 'Titan Server').out('queries').values('name')

It should return ==> Cassandra Server

We can now go back to the Cassandra Query Shell to verify that something has been added to the data store. As the Titan graph schema is non-relational, we will simply see BLOB entries in the tables. The important part and what we’re testing here is that the graph was indeed added to the database we spun up outside the Titan install. As humans interested in our graph content, issue Gremlin scripts against Gremlin-Server, not queries directly against Cassandra.

Read GraphML

Not all graphs will be created from scratch using structural Gremlin, however. There are a number of supported file formats for serialized graphs that we can read from the file system and load into our graph of choice. In this test, we will load the Gremlin-shipped sample graph Grateful Dead choosing the XML-based format GraphML. All samples are located in the data directory. Loading into our existing graph is as simple as:


This will load some 808 vertices from the provided file. Count the number of server-side vertices to verify the load (g is already the traverser variable)

:> g.V().count()

Lastly, to convince ourselves that the graph was indeed loaded into the Cassandra system, we query the the edgestore table in the titan keyspace:

cqlsh:titan_test> select count(*) as edges from edgestore;

While we’ll get a warning that we’re aggregating (i.e. row count) without specifying a partition key, the point here was to illustrate that thousands of new edges have successfully been added, which concludes the tests.

Where to next?

Next up, we’ll be securing our Gremlin-Server with authentication and SSL.

Centralizing GIT for a small team


In this post, we’ll examine how a small distributed team can collaborate using a centralized, i.e. effectively hosted, GIT server. GIT is inherently a distributed version control system that offers an incredibly low barrier to entry to achieving local version control. In fact, we’ll address directly the scenario where we have a locally started code base that gets promoted to said central repository.

Although even the smallest, local-only, repositories benefit from taking advantage of GIT’s powerful and efficient branching paradigm to e.g. perform experimental or otherwise feature-specific development in a topic branch, these practices have even greater impact when dealing with centralized repositories.

In addition and related to topic branching, we’ll take a side in the sometimes rather heated exchange between those who squash and who don’t. More on that below.


The following simple schematic depicts the architecture from the originating developer’s point of view. Before centralization, said developer uses a local GIT repository whose origin points to a repository hosted on a local NAS-type device.

Target SCM Architecture


While we generally recommend using SSH for accessing remote GIT repositories, this is not always possible. For illustrative purposes therefore, we’ll access the central repository via HTTPS. HTTPS has one rather significant drawback in that it’s stateless from a connection point of view and in order for users to not get prompted for user name and password every operation involving the remote repo, the GIT client offer various means of caching. Because our developer is on OSX, we’ll leverage the OS-level Keychain to retain the credentials necessary to access the central repo.

Start by enabling the credential helper:

$ git config --global credential.helper osxkeychain

This simply updates the ~/.gitconfig file and after being prompted the first time for a new remote, the credentials will be stored in the keychain.

First push

Before doing anything related to our central repository, we have to confirm the remotes we already have defined.

$ git remote -v

In our developer’s case, it will show two lines detailing the URLs to fetch from and push to the remote named origin. While not special, this is the default remote name and the new, central, remote must be referred to by a different name. Because the developer will start collaborating, we’ll call the new repository collab.

On the command prompt while in the repository directory structure, we add the link to the central remote like so (the URL is completely fictitious):

$ git remote add collab

Running a verbose remote listing again will yield two additional lines with the fetch and push URLs for the remote named collab.

Now for the actual push itself. Much like all GIT commands, at first blush they appear deceivingly simple.  This ‘default is simple’ paradigm is one of GIT’s many strengths. Don’t be fooled though, there’s is real complexity under the hood and complex scenarios will require similarly complex commands and a thorough understanding of the underpinnings in order to not create a hash of things. That being said, GIT is a robust version control system and you’d have to take some rather extreme steps to obliterate committed (and pushed) work to the point it’d probably not be a result of an ignorant accident. About that push command, however…

$ git push collab master

That is it. Use the switch –dry-run to try the command out first to ensure it’s all going to work properly but stopping short of sending anything to the central repository.

Collaborative Topic Branches

GIT’s branching paradigm, through its efficiency and rich and robust operations, encourages branching to be an integral part of version control with GIT. While this is helpful for individual developers looking to compartmentalize feature development, retain a robust master, or as we shall see, be provided with an environment that permits (and perhaps encourages) frequent commits, the branch concept really shines when multiple developers are collaborating not only on the same code base but on the same features.

There are a plethora of branching strategies out there and we don’t profess to using a superior one, but rather ones that fits our purpose for a small development team that can scale reasonably without any change.

Central GIT branches

In a nutshell, the master branch (by default called mainline in other version management systems) is as stable as possible and does not see direct/incremental feature development by developers. Instead, feature development takes place in topic branches where one or more developers implement the feature specific to that branch. Once the feature is complete, including being caught up with master, the code is reviewed before merged back into master. At this point, even though the schematic doesn’t show it, the topic branch can and should be deleted. Therefore, we call topic branches transient.

Topic branches

Local development branches are created most easily with the checkout -b command.

$ git checkout -b topic_branch

At this stage it is (of course) a local-only branch. We can also say it’s an un-tracked branch, which is to say it’s currently not tracking a remote branch, which is also to say there is no upstream branch defined for this (again, local) branch.

The following command pushes said branch to the remote of choice (here, collab) and makes it into its upstream branch, all in one go:

$ git push --set-upstream collab topic_branch

Execute the following branch command to see which branches exist and which are tracking branches in which remotes:

$ git branch -vv

But wait. What about our remote named origin? You can still push to (and fetch from), but an operation involving a (remote) branch the branch isn’t currently tracking needs to explicitly state the remote (and branch, if named differently). E.g.

$ git push origin topic_branch

Release Branches

As for official releases, these are tagged in GIT with so called Annotated Tags to enable full check-in notes and user lineage:

$ git tag -a tag_name

This opens the editor for the commit notes. -m can be used but is discouraged for official release tags. Branches don’t strictly speaking  need to be created at this point. Instead, branch at the point of the appropriate release tag when it’s necessary to do so, typically to create release-specific fix.


The topic of squashing has seen a lot of colorful commentry. To understand why, know that squashing refers to taking a number of commits and collapsing, i.e. squashing, them into one. This can be beneficial in local topic branches as it promotes frequent commits and thus help mitigate the risk of lost work. Furthermore, as there is less emotional attachment to local topic commits, the commit notes need not be elaborate and thus go beyond a simple -m on the commit command line.

Squashing can occurs once the feature is development-complete in the topic branch but before master (or whichever source branch the topic was created off of) is merged in. Squashing starts off as an interactive rebase from within the topic branch:

$ git rebase -i master

This opens the editor with all commits that have happened since the branch was synchronized from master. Replace the leading pick word with with squash on all lines except the first one. Save and quit. You get another editor where the official commit note is to be entered.

Again, only squash on local topic branches before merging with tracked branches. Do e.g. not rebase central, collaborative topic branches.

Merging with master

We mentioned that in our configuration, topic branches are transient. This means we at some point want to merge the enhancements and fixes in our topic branch with master once the former is deemed stable. When you only have one remote, this is fairly straight-forward and well-documented elsewhere. With multiple remote repositories we have some extra considerations to make.

To begin with, our remote merges take place in a distinct clone of the remote. I.e. we clone our centralized repository into a distinct local repository that we nonetheless name consistently to ensure we avoid confusing remotes by calling it origin in one clone and collab in another. As we called the central repo collab in our development area, we’ll call it exactly the same in our merge area:

$ git clone -o collab https://...

Should you already have cloned the repository into origin, simply rename it:

$ git remote rename origin collab

Now, our objective is to track our remote’s master and topic branch, merge topic into master and delete the topic branch on the server and from all local repositories (which will number two in our case; one for development and one for merge). To make things a little more interesting we’ll name our client-side topic branch client_topic to distinguish it from the remote’s topic_branch.

Our starting point after the clone is in the tracked master branch.

$ git branch client_topic
$ git branch --set-upstream-to=collab/topic_branch client_topic
$ git checkout client_topic
$ git pull
$ git checkout master
$ git merge client_topic
$ git push collab --delete topic_branch

This is not the most compact manner in which these operations can be performed, but it illustrates the main points. Particularly noteworthy is combined push and remote branch deletion. This command can be split up into a push followed by another push and the delete request. This approach may be desirable as it gives you an opportunity to verify the target branch look good before deleting the topic branch.

The remote branch is now gone from the list of known remote branches:

$ git remote --list -r

Interestingly, perhaps when inspecting tracked branches, our client_topic’s tracking status is now “gone”.

$ git branch -vv

Regardless, we’re now well positioned to delete the local branch

$ git branch -d client_topic

As for the development repository, the operations are fairly intuitive in that you need to pull in the changes from collab and delete the local branch

$ git checkout master
$ git pull collab
$ git branch -d dev_client_topic

The topic branch has now been merged and removed.

Spark on Hadoop


Data engineers are required to to have a functional understanding of a broad range of systems and tools that operate somewhere in the space between source system and end users. Said tools and systems typically store, transform, visualize – or some combination thereof – the information  as it makes its way to the consumers. One such system has been surrounded by sufficient mystique that people ostensibly have been hired purely by virtue of claiming some unquantified degree knowledge thereof. This system is Apache Hadoop; the open source 800 pound gorilla that knowledge thereof, in addition to providing fodder for oohs and aahs at certain cocktail parties, enables distributed storage and processing of data that doesn’t have to have inherent structure.

While many great packaged distributions of Hadoop along with truly awe-inspiring enhancements are readily available often in Community-style editions, there is benefit to, before rushing to complex integrated solutions, spend a little bit of time with the base components on their own to develop an intuition and feel for how they work in isolation. The purpose of this post is to outline the steps performed to first and foremost get a working Hadoop system. Once this foundation is in place, we’ll add an in-memory processing engine to the mix; Apache Spark.

Arguably, the most sincere effort to understand the software we’re using would require us to work with the source where building it is an absolute minimum. However, in this post we won’t be working with the source as it’s not congruent with our goal to get something up and running quickly. After all, this is about being able to use Spark on Hadoop, not setting up build environments.


Speaking of environment; we’ll be setting up Spark 1.4.1 on Hadoop 2.7.1 on a CentOS 7 guest image running in VMware Fusion 7 Professional on a MacBook Pro running Yosemite.

A degree of care needs to be taken before rushing off to download the software. On the Spark Download Page, be sure to select “Pre-built for Hadoop 2.6 and later. As for Hadoop, simply pick the appropriate binary package from the Hadoop Release Page.

Depending on the objectives with the installation, it may or may not be acceptable or desirable to install under $HOME. As we in this case will be eliminating as many complexities as possible, we will disregard that we can sudo and will thus proceed with a user-local installation which consequently also runs under the current user. Later, we’ll explore what it takes to promote the installation to /usr/local.

As we’ll be performing these activities in a guest virtual  image, it’s beneficial to have performed the SSH key exchange with the host to facilitate simple host access for file copy operations. The guests on my setup have access to the host through host name hostos. That means I can access files on the host system very easily without needing to provide any password provided that the usernames on host and guest match.

guest$ scp hostos:Downloads/hadoop-2.7.1.tar.gz .

Or, perhaps even better, if you have mounted a networked directory, e.g. on a NAS which is the case on our network:

guest$ scp hostos:/Volumes/Server/Common/hadoop-2.7.1.tar.gz .

Getting started

Untar the binary distribution to a location of your choice:

guest$ tar -xf hadoop-2.7.1.tar.gz

Before you’ll be able to run bin/hadoop, you need to tell it which JVM to use. In CentOS7, the latest JVM can be found at /usr/java/latest and you can either point your JAVA_HOME environment variable to that location or you can edit the appropriate location in your hadoop-local etc/hadoop/ configuration file. One reason one may want to consider the latter alternative is that non-volatile configuration such as the JVM stays within the hadoop file structure and tends to remove moving parts when making other changes such as which user the server should run under.

We’ll be configuring a so called Pseudo-Distributed Cluster. The steps to configure that are from the Hadoop Quickstart Guide are summarized below. All console operations assume working directory is the hadoop-2.7.1 root.

Define site properties in the following two files as per:





In order for the distributed file system operations to work, ssh to localhost need to work without prompting for a password. Some information about that has been provided as part of an earlier post. Once that’s setup, we can proceed with formatting the file system.

guest$ bin/hdfs namenode -format

With a formatted file system, we’re ready to start the file system daemon:

guest$ sbin/

After the namenode and datanode have been started, the server is now listening to three new TCP ports: 9000 and a whole bunch in the 50,000 range where the first is where worker requests are received and the latter for various forms of access. The two most interesting ports from a networking point of view at this stage are 50070 (administrative webapp) and 50075 (HTTP file access).

When the CentOS7 guest image is headless, it’s very helpful to get access to those ports from the host. The steps for enabling that with VMware Fusion 7 and NAT’d CentOS7 guests is outlined in a previous post and briefly summarized again:

Open the TCP port 50070 on the guest OS’ firewall:

guest$ sudo firewall-cmd --zone=public --add-port=50070/tcp --permanent

Add the port mapping in your hosts /Library/Preferences/VMware Fusion/vmnet8/nat.conf and restart VMware Fusion. Yes, this means stopping the guest and once it’s restarted, start the DFS as per above.

Validating DFS

At this point we should have a completely functioning Hadoop single-node pseudo cluster. Again referencing the quick start guide, this is the time to confirm that map-reduce jobs can be run.

The general approach is to use the sbin/hdfs command with dfs argument to create directories or put files. E.g.:

guest$ sbin/hdfs dfs -mkdir /user
guest$ sbin/hdfs dfs -mkdir /user/<your user name>
guest$ sbin/hdfs dfs -put etc/hadoop input

A sequence which creates a three level folder hierarchy and puts all Hadoop configuration files in the input folder.

As for the actual tests themselves, hadoop deploys with example jars including jars containing their source which is really handy. Let’s then run a word count example, the bread and butter for Hadoop:

guest$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount input output

This command starts a hadoop job that will be defined in a specific, example, jar. To the main function of said jar, the first argument indicates which class to direct the remaining arguments to; folder names input and output. When Hadoop runs these  jars, the working directory will be the /user/<your user name>. This is why it can resolve input as the name of the folder that resides on this level. The last argument, output, is the name of the folder that the class behind wordcount will create and fail id it’s there already. To re-run the example and not have to create a growing number of uniquely named output folders, delete the output folder e.g. like so:

guest$ bin/hdfs dfs -rm  -R output

Run the jar through bin/hadoop without any additional arguments to get a list of all possible command maps that have been defined in the jar (in file

Back to the example though, if the TCP port 50075 has not been opened in the firewall and NAT’d in the virtualization engine, simply download the output folder to the local file system with the DFS command GET:

guest$ bin/hdfs dfs -get output

Pop into the local output and the word count results will be in a file named part-r-0000.

Adding Spark

Because we’re interested in running spark jobs that consume content that resides in a Hadoop cluster (or pseudo-cluster which is the case above), it’s imperative that the spark binaries are built with the proper Hadoop version in mind. For a 2.7.1 Hadoop, we’ll get Spark 1.4.1 built for Hadoop 2.6 and newer.

We untar it right next to our Hadoop install, but before we do anything else, we need to tell Spark about Hadoop and we need to reduce the verbosity of Hadoop’s logging.

For the first one, we define environment variable HADOOP_CONF_DIR to point to Hadoop’s configuration directory. We put it in our ~/.bash_profile for future sessions:

export HADOOP_CONF_DIR=$HOME/hadoop-2.7.1/etc/hadoop

Second change Hadoop’s root log-level from INFO to ERROR by updating the corresponding entry in $HADOOP_CONF_DIR/


We’re now ready to go into our Spark root and run the Python Spark shell, pyspark:

guest$ bin/pyspark

Perform a very basic hdfs:// connectivity test by verifying that Spark can read one of the files that were uploaded earlier into the input folder. Note though that the full path needs to be provided:

>>> testRDD = sc.textFile("hdfs:///user/testuser/input/")
>>> testRDD.count()

While the number (99) may differ between deployments, it should be possible to run the above commands (substituting the testuser as required) without getting any errors.

The shell, be it Python or Scala (or even a slimmed down version of R as of Spark 1.4), is useful for performing simple tests. In order to run more complex logic, use bin/spark-submit to launch the program.

This concludes the very simple basic setup of a single pseudo-cluster node of Hadoop and standalone spark. Much can be done with this simple environment from a functional point of view such as testing genuine Big Data applications, albeit against very modest data sets. Said applications however can then be deployed unchanged onto production grade clusters.

What about YARN?

Production cluster deployments of Hadoop and Spark will include YARN, the Hadoop version 2 resource manager. In a nutshell, Spark is, in addition to a standalone entity, also a YARN application. YARN is so complex it’s bordering on the fantastical and will be discussed in a future post.


Basic NAT for VMs


As builders of analytical landscapes, we have to pay attention to a myriad of details. The topic of this post is carved out of a small portion of a huge subject area: Networking. Specifically, we’ll be focusing on the system builder’s machine setup to facilitate access to virtualized nodes running on a developer-type machine. To illustrate this setup we’ll use a very specific configuration. Different configurations will have different strengths and weaknesses but the example below will show the basic concepts.

The overall objective will be to run the graph database Neo4j inside a virtual machine on the host machine. Said database should be accessible to those who can access the host machine, directly or via the network.

Environment Setup

The host machine in this case is a 13.3″ MacBook Pro running Yosemite with VMware Fusion 7 Professional. Neo4j will be installed on a 64bit CentOS 7 (the firewall configuration on earlier versions is completely different). To keep this topic focused, we will not cover the installation of any software, but rather stay purely on the network configuration side of things.

Network Options

Virtualization solutions such as VMware provide network access to the guest images through a software switch, or a set thereof as is the case in VMware specifically. They are:

  • Host-only network. As the name implies, on the host itself can see and communicate with guests. Useful e.g. to limit impact of untrusted images. This network is provided through a virtual switch named vmnet1.
  • NAT, or Network Address Translation, which is the dominant configuration when dealing with firewalls; images behind the firewall appear as one to those outside the firewall. The firewall in turn makes sure packets are routed back to the correct host for returning/response traffic or forwarded in the case of new in-bound traffic. VMware’s switch for this is called vmnet8.
  • Bridged network turns the guests into ‘topological siblings’ of sorts with the host. This means guests route like the host, are on the same subnet and consequently get their IP-address (in the case of DHCP) from the same server as the host etc. This isn’t always exactly the case, but the purposes of this argument, it is.

Which option to choose? Well, requirements disqualify the host-only option, so it stands between NAT and Bridged. Unless there are very specific requirements that require Bridged, err on the side of NAT. This will keep control with you as you have one more layer between your guests and the network you’re on. This works both ways, if you’re on an otherwise untrusted network, you don’t have to harden the firewall of your development guest images that need to communicate. This is a good thing as it stops you from needing to chase down bugs that turned out to be an overzealous firewall configuration. Going the other way, you have one more way of making sure your guest images won’t embarrass you when your dev images cause your client’s network administrator to run screaming that they’re under attack. Lastly, on the topic of performance, why force host-to-guest and guest-to-guest traffic to route outside your host? Unless you need to sniff that traffic on e.g. your router, it’s just a drag on performance for you and those using your network.

We can now reformulate the problem to: with a NAT’d guest, how do we access it from the outside, i.e. on the host and beyond?

Three-tiered Solution

The answer to that question was already alluded to above; with port forwarding. The concept is familiar to anybody who’s been playing around in their router’s administration app. Most, if not all, modern routers allow you to open a particular port (or range of ports) on the WAN-side and connect with, i.e. forward to, a host and port (range) combination on the LAN side. Basically the exact same principle applies to the ‘router’ that is vmnet8. This forward rule is specified in the minimalist syntax required by the following file:

/Library/Preferences/VMware Fusion/vmnet8/nat.conf

Really old versions of VMware Fusion will have a slightly different path, but the idea remains the same. VMware is kind enough to put in examples of such forwarding rules, you just have to be mindful of the section header for TCP or UDP. E.g. for TCP, the following (commented out) example is provided:


# Use these with care - anyone can enter into your VM through these...
# The format and example are as follows:
#<external port number> = <VM's IP address>:<VM's port number>
#8080 =

If uncommented and VMware restarted (not just the guest image, but Fusion-proper), http://localhost:8080 in the browser on the host would be resolved and routed to the NAT’d guest with the IP-address in question. In our case with Neo4j, we’re interest in accessing our guest’s port 7474/tcp, so we add the following line under the commented out line above:

7474 =

A few things to note: We specified our guest by IP-address. In the VMware case this doesn’t necessitate a static IP for the guest; VMware’s internal DHCP server for vmnet1 and vmnet8 have address affinity and will give your guest a consistent address. Note also that we didn’t have to choose port 7474 on our host, it could in principle be any available port above 1024, e.g. 17474 and in many cases this is actually recommended, especially when forwarding to a guest’s web server that often listens on ports 80 and 443 that very likely are already taken on your host (or by another guest on your host).

But even so, we’re not necessarily done. Why not? Well, all we’ve really done is put up a sign and made a path, so to speak. There may still be an uncompromising firewall erected by the guest that prevents our access. In our case, since we’re using CentOS 7 (again it’s important it’s 7 as 6.x behaves differently) we’ll be using the rather pleasant firewall-cmd tool. Specifically, we have to do the following:

  • Determine the zone for the NIC in question (see below)
sudo firewall-cmd --get-active-zones
  • Poke a hole in the firewall for the port in the appropriate zone, in this case: public.
sudo firewall-cmd --zone=public --add-port=7474/tcp
  • Make changes permanent once confirmed to be working
sudo firewall-cmd --zone=public --add-port=7474/tcp --permanent

While following to conceptual flow of information, this still is somewhat backwards. Reason being, for us to establish the zone, we need to know the NIC and if we knew which NIC we would quickly realize that Neo4j out of the box does not use that NIC. Instead, very sensibly, Neo4j, unless you tell it otherwise will only listen on the loop-back interface to eliminate the risk that external marauders get access to your freshly minted graph database. So, even though we’ve set up the port forward and opened the firewall, we’ve merely configured a route to a NIC where there’s no Neo4j.

Enter the third and final configuration step; Make Neo4j listen on the NIC you’ve opened 7474/tcp access to. Thankfully, we don’t have to say exactly which IP this is, but by telling Neo4j to listen to, we’re effectively telling it that we want a ‘real’ (as real as it gets in the realm of virtual servers) NIC and not loop-back. To convince yourself of this all you need to run is this:

sudo netstat -tulpn

It will show that 7474/tcp is listening on, i.e. loopback. Again, your objective is to tell Neo4j to listen on This listen setting is defined in Neo4j’s file. Locate and uncomment the following line:


Save and restart Neo4j. Now, we’re ready to access our guest’s 7474/tcp port and in Neo4j’s case we do so in the browser: http://localhost:7474