How to install the Scala Spark (Apache Toree) Jupyter kernel with GeoMesa support

This page covers using GeoMesa 1.2.x with an older version of Jupyter.  Updated documentation lives at https://www.geomesa.org/documentation/user/spark/jupyter.html or elsewhere on https://www.geomesa.org/.

Install Jupyter

Assumptions:

  1. Spark is installed and the environment variable SPARK_HOME is set. See general instructions here.
  2. Python 2.7 is installed with pip available. Instructions
  3. Either compile GeoMesa for Scala 2.10 (instructions) or download a GeoMesa compute 'fat jar' here:
    1. version 1.2.2.
    2. version 1.2.3.
    3. version 1.2.4.
    4. version 1.2.5.
    5. version 1.2.6.

Note:  With some systems, in particular CentoOS 6, Python 2.7 is not the default. 

Ubuntu: Prepare for Jupyter install (apt-get example assuming python 2.7 is default)
sudo apt-get install build-essential python-dev
CentOS: Prepare for Jupyter install (yum example assuming python 2.7 is default)
sudo yum groupinstall 'Development Tools'
sudo yum install python-devel
Install Jupyter Notebook
sudo -H pip install --upgrade jupyter 

Install and configure Toree Jupyter kernel

Optionally you can build and install toree.

Install Apache Toree
sudo -H pip install --upgrade --pre toree

In the next block, we'll use the geomesa-compute jar from step 3 above.  You can download it or build it.


Configure the Apache Toree kernel
export GEOMESA_LIB=${GEOMESA_HOME}/lib
${GEOMESA_HOME}/bin/install-jai.sh      

jupyter toree install\
 --replace\
 --user\
 --kernel_name="Spark GeoMesa"\
 --spark_home=${SPARK_HOME}\
 --spark_opts="--master yarn --jars file://${GEOMESA_LIB}/common/jai_core-1.1.3.jar,file://${GEOMESA_LIB}/common/jai_codec-1.1.3.jar,file://${GEOMESA_LIB}/common/jai_imageio-1.1.jar,file://${GEOMESA_SRC}/geomesa-compute_2.10-1.2.5-shaded.jar"

The exact name of GeoMesa compute jar you use will likely vary if you built it yourself.  Here the instructions are assuming the linked jar is used.

Note also that the kernel name can be changed with different Toree installs. This can be used to version your kernels. For example naming one kernel "Spark GeoMesa 1.2.4" and another "Spark GeoMesa 1.2.5" will allow you to switch between those two kernels in Jupyter.

Jupyter can perform syntax highlighting of your scala code, but you may need to change the default language spec set by toree in the kernels.json config file, located in either /usr/local/share/jupyter/kernels/apache_toree_scala/  or ~/.local/share/jupyter/kernels/apache_toree_scala.  If syntax highlighting isn't working, the "language" entry may need to be set to "scala211" rather than "scala".  The contents of kernel.json will then look something like:

Install the Apache Toree kernel with geomesa compute jar
{
  "argv": [
    "/usr/local/share/jupyter/kernels/apache_toree_scala/bin/run.sh",
    "--profile",
    "{connection_file}"
  ],
  "language": "scala211",
  "env": {
    "PYTHON_EXEC": "python",
    "__TOREE_OPTS__": "",
    "PYTHONPATH": "/opt/spark-1.6.1-bin-hadoop2.6//python:/opt/spark-1.6.1-bin-hadoop2.6//python/lib/py4j-0.9-src.zip",
    "DEFAULT_INTERPRETER": "Scala",
    "SPARK_HOME": "/opt/spark-1.6.1-bin-hadoop2.6/",
    "__TOREE_SPARK_OPTS__": ""
  },
  "display_name": "Apache Toree - Scala"
}

Installing Vegas for graphing

To download the JARs for Vegas:

$ curl -L -O 'http://search.maven.org/remotecontent?filepath=org/apache/ivy/ivy/2.3.0/ivy-2.3.0.jar'
$ java -jar ivy-2.3.0.tar -dependency org.vegas-viz vegas-spark_2.11 0.3.6 -retrieve "lib/[artifact]-[revision](-[classifier]).[ext]"
$ sudo cp lib/* $SPARK_HOME/jars
This should be done with caution, as Spark 2.1.0 itself is a dependency of vegas-spark_2.11. The Ivy command listed above will download about 130 JARs, including Spark's transitive dependencies such as Hadoop and Zookeeper. If you are using a different version of Spark and/or have a Spark distribution bundled with a particular version of Hadoop, following the instructions above blindly will result in multiple versions of the same JARs in the Spark jars directory.

Then within the Jupyter notebook, add the following imports:

import vegas._
import vegas.render.HTMLRenderer._
import vegas.sparkExt._

Configure Jupyter

Prepare the notebook server for public use: use a password and bind to IPs.

For scripted installs you can generate the password using this one-liner.

Generate Jupyter Password one-liner
password=$((python -c "from notebook.auth import passwd; exit(passwd(\"yourpassword\"))") 2>&1)

Run Jupyter

The notebook server can be launched from the command line.  Long lived processes should be hosted in screensystemd or supervisord.

Start the notebook server
 jupyter notebook

Access GeoMesa DataStores within Jupyter Notebook

The credentials used to connect to the DataStore are able to be parsed from the config file in SPARK_HOME/conf/spark-defaults.conf under the given keys.

Example $SPARK_HOME/conf/spark-defaults.conf
spark.driver.host                    jupyter
spark.master                         yarn
spark.serializer                     org.apache.spark.serializer.KryoSerializer             
spark.kryo.registrator               org.locationtech.geomesa.compute.spark.GeoMesaSparkKryoRegistrator
...
...
spark.credentials.ds.username        accumuloUser
spark.credentials.ds.password        accumuloPassword

Example RDD creation
import org.apache.hadoop.conf.Configuration
import org.apache.spark.{SparkConf, SparkContext}
import org.geotools.data.{DataStoreFinder, Query}
import org.geotools.filter.text.ecql.ECQL
import org.locationtech.geomesa.accumulo.data.AccumuloDataStore
import org.locationtech.geomesa.compute.spark.GeoMesaSpark
import scala.collection.JavaConversions._
 
val params = Map(
    "instanceId" -> "geomesa",
    "zookeepers" -> "worker1:2181,worker2:2181,worker3:2181",
    "user"       -> sc.getConf.get("spark.credentials.ds.username"),
    "password"   -> sc.getConf.get("spark.credentials.ds.password"),
    "tableName"  -> "geomesa122.gbdx")

val ds = DataStoreFinder.getDataStore(params).asInstanceOf[AccumuloDataStore]
 
// For GeoMesa 1.2.4 and higher:
GeoMesaSpark.register(ds)
 
val filter = "BBOX(geom, -180, -90, 180, 90) AND item_date AFTER 2015-09-01T00:00:00.000Z"
val query = new Query("ObjectDetection", ECQL.toFilter(filter))
val rdd = GeoMesaSpark.rdd(new Configuration, sc, params, query)
 
// do something
rdd.count


The above register call registers the SimpleFeatureTypes of the provided data store directly into the Kryo Registrator. The caveat, however, is that before serialization, the SimpleFeatureType encodings must be sent to the executors via a Spark Broadcast and then used to create the corresponding types in each executor's registrator. The following demonstrates doing all of this.

Example SFT Registration
// Register the sfts of a given data store
GeoMesaSpark.register(dataStore)
// Broadcast sft encodings to executors
val broadcastedSfts = sc.broadcast(sfts.map{sft => (name, SimpleFeatureTypes.encodeType(sft)})
// Populate the type cache on each partition
someRdd.foreachPartition { iter =>
    broadcastedSfts.value.foreach { case (name, spec) =>
        val sft = SimpleFeatureTypes.createType(name, spec)
        GeoMesaSparkKryoRegistrator.putType(sft)
    }
}

Add extensions

If JavaScript extensions such as Leaflet or Turf are desired, they can be installed through Jupyter's nbextension command.

With a directory containing the desired library .js files, run the following command

Installing Extensions
jupyter nbextension install <extension-directory>
jupyter nbextension enable <extension-name>

This will place the needed files in the proper jupyter data directory, likely /usr/local/share/jupyter/nbextenstions.

Lastly, either run the following in the JavaScript console of the notebook

Enabling an extension
Jupyter.notebook.config.update({"load_extensions": {"extension-name":true}})

or update the notebook.json file in ~/.jupyter/nbconfig with the above .json.

This will automatically load the extension when a notebook is loaded in the browser.


With the extension installed, enabled, and loaded, use RequireJS within a JavaScript cell to use its functionality. The benefit to this over a script tag is its persistence across reloads of the notebook without having re-run cells.

Using an extension
%%javascript
require(["nbextensions/extension-name"], function(l) {
	// Your JS code using the extension here
})

Troubleshooting

Ensure the following environment variables are set (and included in .bashrc). Modify paths as needed.

Environment Variables
JAVA_HOME=/usr/lib/jvm/java-7-oracle
SCALA_HOME=/usr/share/scala
SPARK_HADOOP_VERSION=2.6.0
SPARK_YARN=true sbt/sbt assembly
SPARK_HOME=~/spark-1.6.2-bin-hadoop2.6
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
GEOMESA_HOME=~/geomesa/geomesa-tools/target/geomesa-tools-1.2.5/
GEOMESA_LIB=${GEOMESA_HOME}/lib
GEOMESA_SRC=~/geomesa


PATH=$SCALA_HOME/bin:${GEOMESA_HOME}/bin:$PATH