Build Your Own Cassandra Cluster

All you need

  • 2N+1 linux machines (virtual?), with N>=1
  • a command line
  • a sheet of paper

Install the build system

Apache Cassandra needs a Java sdk installed. The latest supported version is Java 11. I read that upcoming support for latest Java version will be soon here, but we stick to Java 11 that is supported by now.

In order to have Apache Cassandra working on Java 11, it needs to be compiled with it, or you will have to run it with Java 8.

In order to install Java, you might want to do it like this, possibly from a root account:

mkdir -p /opt/java
curl -LO https://download.java.net/openjdk/jdk11/ri/openjdk-11+28_linux-x64_bin.tar.gz

tar vxzf openjdk-11+28_linux-x64_bin.tar.gz
mv jdk-11 /opt/java
cat> /opt/shared/java/load-jdk-11.sh 
#! /bin/bash

export JAVA_HOME=/opt/java/jdk-11
export PATH=${JAVA_HOME}/bin:${PATH}

^D

If you add to your ~/.bashrc the script like this:

source /opt/shared/java/load-jdk-11.sh

by opening a new shell you should now have the Java command available from the jdk 11. If you haven't, think of a way you're confortable installing Java 11 on your linux machine.

Cassandra is built with Apache ANT, I know it sounds old school, but I presume there are good reasons behind that, possibly that the build can be done without needing a network connection.

On Debian and derivative it should not be harder to install than:

apt install ant -y

and I presume you have something similar in RH and rpm/yum based distribution:

yum install ant

Build Cassandra with Java 11 support enabled

I will keep this short and just share the script that I created for that:

#! /bin/bash -x

git clone https://github.com/apache/cassandra.git
cd cassandra
stabletag="cassandra-4.0.7"
git stash
git checkout ${stabletag}
gitversion=$(git rev-parse --short HEAD)
ant -Duse.jdk11=true

cd -

tar cvfz "cassandra-${gitversion}.tar.gz" cassandra

The bottomline is: clone the git repo, checkout the last stable tagged version that - at this point - seems to be cassandra-4.0.7 and compile it with ant and the switch:

-Duse.jdk11=true

Give the script a try, it should complete in a matter of maybe a minute.

Once it is done, you should end up with these binaries:

$ ls -l cassandra/bin
total 152
-rwxr-xr-x 1 userx userx 10730 Nov  1 21:47 cassandra
-rw-r--r-- 1 userx userx  6093 Nov  1 21:47 cassandra.in.sh
-rwxr-xr-x 1 userx userx  3060 Nov  1 21:47 cqlsh
-rwxr-xr-x 1 userx userx 95397 Nov  1 21:47 cqlsh.py
-rwxr-xr-x 1 userx userx  1894 Nov  1 21:47 debug-cql
-rwxr-xr-x 1 userx userx  3491 Nov  1 21:47 nodetool
-rwxr-xr-x 1 userx userx  1770 Nov  1 21:47 sstableloader
-rwxr-xr-x 1 userx userx  1778 Nov  1 21:47 sstablescrub
-rwxr-xr-x 1 userx userx  1778 Nov  1 21:47 sstableupgrade
-rwxr-xr-x 1 userx userx  1781 Nov  1 21:47 sstableutil
-rwxr-xr-x 1 userx userx  1778 Nov  1 21:47 sstableverify
-rwxr-xr-x 1 userx userx  1175 Nov  1 21:47 stop-server

Installing the nodes

Once Cassandra is compiled, the full directory (or the tar.gz created above) needs to be copied over each and every of the machines intended to be used. I. e.: if you intend to use user userx in */home/userx/ of each of the machines.

I will not spend many words to explain why you would not want it to run as a root user: do not do it.

Take the sheet of paper and note down the IP address of each of the machines that will be part of the cluster. In my example:

192.168.1.2, 192.168.1.3, 192.168.1.4

All the configurations are done in the file:

./cassandra/conf/cassandra.yml

in each and everyone of the nodes.

The configurations needed are the following:

  • A name for the service in: cluster_name
  • The IP of the machine in: rpc_address and listen_address
  • The list of ip:port for all the machines in the cluster in seeds in the example abov it'll look like that:
seeds:"192.168.1.2:7000,192.168.1.3:7000,192.168.1.4:7000"
  • Then, since the nodes are members in a cluster, you will need to append at the end of the config file the following:
auto_bootstrap: false

I assume it'll be pretty tedious and error-prone to get this 100% correct the first time on each and every one of your 2n+1 instances that you are going to install, therefore I will provide a script to relieve the pain.

Just copy it along on each of the nodes, together with the *.tar.gz that we built together, and name it: ca-node-install

#! /bin/bash -x

SERVICE_NAME="data-service"
CASSANDRA_TGZ=$CURRDIR/apache-cassandra-4.0.7-bin.tar.gz
RAM_USAGE_MB=2048
CASSANDRA_HOME=~/cassandra


CFG_SEEDS="192.168.1.2:7000,192.168.1.3:7000,192.168.1.4:7000"
CFG_LISTEN_ADDRESS=${NODE_IP}
CFG_RPC_ADDRESS=${NODE_IP}
CFG_ENDPOINT_SNITCH="GossipingPropertyFileSnitch"
CFG_AUTHENTICATOR="PasswordAuthenticator"
#VERY BAD: change to certs
CFG_AUTO_BOOTSTRAP="false"


function install_cassandra()
{
  test -d $CASSANDRA_HOME || mkdir -p  $CASSANDRA_HOME
  tar vxzf $CURRDIR/$CASSANDRA_TGZ -C $CASSANDRA_HOME --strip-components=1
  echo "JVM_OPTS=\"$JVM_OPTS -Xms"$RAM_USAGE_MB"M\"" >> $CASSANDRA_HOME/conf/cassandra-env.sh
  
  test -f $CASSANDRA_HOME/conf/cassandra.yaml.orig || (echo "file exists" ; exit 1)

  cp $CASSANDRA_HOME/conf/cassandra.yaml $CASSANDRA_HOME/conf/cassandra.yaml.orig
  cat $CASSANDRA_HOME/conf/cassandra.yaml.orig | sed -e 's/^\s*#.*//g' | grep -E '\S' > $CASSANDRA_HOME/conf/cassandra.yaml.template
  
  cat $CASSANDRA_HOME/conf/cassandra.yaml.template | sed -e "s/cluster_name:.*/cluster_name:\ \'$SERVICE_NAME\'/g" \
  | sed -e "s/seeds:.*/seeds:\ $CFG_SEEDS/g" \
  | sed -e "s/listen_address:.*/listen_address:\ $CFG_LISTEN_ADDRESS/g" \
  | sed -e "s/rpc_address:.*/rpc_address:\ $CFG_RPC_ADDRESS/g" \
  | sed -e "s/authenticator:.*/authenticator:\ $CFG_AUTHENTICATOR/g" \
  > $CASSANDRA_HOME/conf/cassandra.yaml
  echo "auto_bootstrap: false"	>> $CASSANDRA_HOME/conf/cassandra.yaml
}


if [ $# != 1 ]; then 
  echo "Usage: $0 <node-ip>"
  exit 1
fi

NODE_IP="$1"


install_cassandra

Please note: this is only to get you off the ground, and before you start to adding real data to the cluster, you will need to think proper security with TLS, possibly mTLS.

But before starting to fight with the CA and certificates, you have a cluster that you can use straight in you LAN space.

To get the ball running, check again in the shell of each and every machine that will run the cluster, that you have java 11 available.

Once it is a pass, you can issue, from each and every of the machines, the following command:

./cassandra/bin/cassandra

Once the 3 nodes will start to gossip with each other, in order to connect to one of the nodes you can do:

./cassandra/bin/cql 192.168.1.2 -u cassandra -p cassandra


[security] [linux] [cassandra] [java] [certificate] [git]