Distributed Graph DB – Part 1

Overview

While traditional relational databases remain an integral part of the solution space for information storage and retrieval, there can be no denying that NoSQL options are presenting viable alternatives in spaces outside the RDBMS sweet-spot. One such such space where relational storage and query is suboptimal is in the handling of graph-based structures. In previous posts, we’ve looked very briefly at the currently most widely adopted graph DB, Neo4j. Neo4j is a robust, commercially maintained and enterprise-worthy graph database. Neo4j exposes a query language named Cypher that allows query developers to author queries declaratively.

Cypher is relatively easy to learn for those with a database background, however it suffers from one potentially serious shortcoming; Cypher is Neo4j-centric and hence any application that relies on a graph DB backend will be tightly coupled with Neo4j. We will instead in this post be looking at Tinkerpop3 as an abstraction layer that allows us to decouple clients from the graph storage backend. Specifically, we will be deploying a development-grade server with a Titan graph database backed by an instance of Apache Cassandra.

Architecture

The diagram below illustrates a single server node architecture, suitable for small production deployments and non-production, such as test and development scenarios.

Titan Dev Server

Titan, our graph database abstraction of choice, supports many deployment scenarios and here we’ll focus on a remote, non-embedded architecture. With the addition of a load balancer, this model will scale very well horizontally owing to Cassandra’s intrinsic clustering and Titan’s effective statelessness. We will however not address that option in this post.

Gremlin

Gremlin is part of the Tinkerpop suite and can be thought of as a Domain Specific Language for graph traversal (and also structure). Gremlin-Server is a wrapper that through Web Sockets or REST APIs receive Gremlin scripts, execute them through a pluggable driver architecture that abstracts the actual graph storage implementations and marshals the results to the client. Gremlin-Server was named Rexster in the no longer supported version 2 of Tinkerpop.

Gremlin doesn’t require a server however. In principle, the Gremlin-Shell can be configured to leverage any supported graph database by loading the appropriate plug-in/driver. This is not how we’ll be using the Gremlin-Shell in this post though; We’ll be using it to submit scripts to the Gremlin-Server and thus the graphs will not be loaded in the Gremlin-Shell memory.

Setup

As our server image, we will be using a minimal CentOS7 image running in VMware Fusion 7 Pro. The image is light-weight with 2 virtual cores and a modest 2GB of RAM. The guest image NIC is NAT’d.

Components

Before starting, download the necessary components:

  1. Java8 JVM. For our CentOS image we need a 64bit Linux RPM.
  2. Most recent version of Cassandra. While version 3.0.0 is released at the time of this writing, 2.2.3 is the most stable, production-worthy version so we’re going with that.
  3. The recommended version of Titan. At the time of writing, this was Titan 1.0.0 with Hadoop 1.0.

To keep the sudo-ing to a minimum, we’ll install said server components under our $HOME directory. This is clearly inappropriate for production deployments, however small, but it suits our needs for a development-centric environment.

Steps

Java

Install the JVM, systemwide:

$ sudo rpm -i jdk-8u66-linux-x64.rpm

Verify that the Java version is the correct one:

$ java -version

If it isn’t, run the following interactive utility and select the appropriate JVM version:

$ sudo alternatives --config java

It may be necessary/appropriate to declare $JAVA_HOME to point to the JVM as well.

Apache Cassandra

Because we’re only deploying a development-grade server, we will at this point not worry about any tuning or even changing the network interface on which the server will be listening. Recall from the diagram above that Titan will be running on the same host as Cassandra, so it’s entirely appropriate and even desirable for Cassandra to only accept non-networked requests. This is more secure and also better performing  by virtue of not having to involve the network stack when Cassandra and Titan communicate. Uncompress the Cassandra tar ball:

$ tar -xzf apache-cassandra-2.2.3-bin.tar.gz

cd into the top-level directory of Cassandra and start the server on the console in the background.

$ bin/cassandra -f &

Once it has settled down, you need to enable the Thrift API. This API is required by Titan when using Cassandra as the storage backend.

$ bin/nodetool enablethrift

By checking the opened ports you should see that 9160/tcp is now active. In addition with the following netstat parameters you’ll also see all other ports associated with the Cassandra PID:

$ sudo netstat -tulpn

Now that Cassandra is running, you can move Cassandra to the foreground and our work continues in another session.

$ fg

In a different terminal session, cd into the Cassandra directory and start the Cassandra Query Language Shell and list the keyspaces available just to confirm that you can communicate with the server and also get a baseline of the keyspaces available:

$ bin/cqlsh
cqlsh> describe keyspaces;

Titan

The Titan package is effectively Gremlin (Shell and Server) with the Titan add-on pre-installed. Nonetheless, our exact scenario is not covered by the convenience files provided so we have to make some minor additions. Start by unpacking Titan:

$ unzip -q titan-1.0.0-hadoop1.zip

Next we want to create a server configuration where a standalone, local Cassandra is the storage backend. We do this by optionally making a copy of the Cassandra properties file used by the Gremlin-shell and an optional addition:

$ cd titan-1.0.0-hadoop1/conf/gremlin-server
$ cp ../titan-cassandra.properties ./titan-cassandra-server.properties

Using your editor of choice, open the newly copied file and under the storage.hostname assignment, add the following:

# Define custom keyspace
storage.cassandra.keyspace=titan_test

This is not necessary in order to get the configuration up and running. It is however useful to know when co-locating many different Titan servers on the same Cassandra cluster.

Next we move on to creating the Gremlin-Server configuration, i.e. the file that we pass to Gremlin-Server, telling Gremlin-Server about our Cassandra storage particulars. Gremlin configuration is in YAML format. Here, too, we copy and modify:

$ cp gremlin-server.yaml gremlin-cassandra-server.yaml

At this point our changes are limited to making sure Gremlin-Server listens on a network adapter (or all, as it were with the changes below):

host: 0.0.0.0

With static IP servers, said IP address could be entered instead. We also have to point Gremlin-Server to the Cassandra properties file we just configured:

graph: conf/gremlin-server/titan-cassandra-server.properties

That is it for Gremlin-Server configuration. Go back to the Titan root and start the server:

$ cd ../..
$ bin/gremlin-server.sh conf/gremlin-server/titan-cassandra-server.yaml

It will spit out a staggering amount of information that we shouldn’t have to care too much about. Be sure to wait for the message that says it’s listening on port 8182 though.

Open up the firewall to enable communication on the Gremlin-Server port:

$ sudo firewall-cmd --zone=public --add-port=8182/tcp --permanent

To access the Gremlin-Server from the host machine (and beyond), don’t forget to add the port mapping in VMware’s nat.conf as outlined here.

Test

At this stage, we have our server image running Titan with a standalone, locally accessible instance of Cassandra as storage backend. To test this, we will use the Gremlin-Shell as it’s by far the easiest. In addition, we will go back to the Cassandra Query Shell to confirm that things are happening in that instance. Start there by confirming that the new keyspace (defined in the .properties file above) is available and that empty tables have been added.

cqlsh> describe keyspaces;
cqlsh> use titan_test;
cqlsh:titan_test> describe tables;
cqlsh:titan_test> select * from graphindex;

To continue the test we’re going to run the Gremlin-Shell. Even though we have configured Gremlin-Server to listen for requests over the network, we’re going to run the Gremlin-Shell locally for the test. Note that remote Gremlin-Shells require the hostname of the server to be defined in conf/remote.yaml.

$ bin/gremlin.sh

To connect to the Gremlin-Server inside the Gremlin-Shell, issue the following command:

gremlin> :remote connect tinkerpop.server conf/remote.yaml

We’ll create a small graph to prove that we persist it to the Cassandra system and that we can query it using Gremlin syntax. Note that :> is shorthand for :submit, i.e. it tells the Gremlin-Shell to send it to the active server connection instead of trying to execute it itself.

gremlin> script = """
cs = graph.addVertex('name', 'Cassandra Server')
ts = graph.addVertex('name', 'Titan Server')
gs = graph.addVertex('name', 'Gremlin Server')

ts.addEdge('queries', cs)
gs.addEdge('hosts', ts)
"""
gremlin> :> @script

Traverse the graph using Gremlin to determine the name of entity that ‘Titan Server’ queries:

gremlin> :> g.V().has('name', 'Titan Server').out('queries').values('name')

It should return ==> Cassandra Server

We can now go back to the Cassandra Query Shell to verify that something has been added to the data store. As the Titan graph schema is non-relational, we will simply see BLOB entries in the tables. The important part and what we’re testing here is that the graph was indeed added to the database we spun up outside the Titan install. As humans interested in our graph content, issue Gremlin scripts against Gremlin-Server, not queries directly against Cassandra.

Read GraphML

Not all graphs will be created from scratch using structural Gremlin, however. There are a number of supported file formats for serialized graphs that we can read from the file system and load into our graph of choice. In this test, we will load the Gremlin-shipped sample graph Grateful Dead choosing the XML-based format GraphML. All samples are located in the data directory. Loading into our existing graph is as simple as:

:> graph.io(IoCore.graphml()).readGraph('data/grateful-dead.xml')

This will load some 808 vertices from the provided file. Count the number of server-side vertices to verify the load (g is already the traverser variable)

:> g.V().count()

Lastly, to convince ourselves that the graph was indeed loaded into the Cassandra system, we query the the edgestore table in the titan keyspace:

cqlsh:titan_test> select count(*) as edges from edgestore;

While we’ll get a warning that we’re aggregating (i.e. row count) without specifying a partition key, the point here was to illustrate that thousands of new edges have successfully been added, which concludes the tests.

Where to next?

Next up, we’ll be securing our Gremlin-Server with authentication and SSL.

Published by

Henrik

Data Architect and Integrator Linux Evangelist Developer