How to install the Scala Spark (Apache Toree) Jupyter kernel with GeoMesa support
This page covers using GeoMesa 1.2.x with an older version of Jupyter. Updated documentation lives at https://www.geomesa.org/documentation/user/spark/jupyter.html or elsewhere on https://www.geomesa.org/.
Install Jupyter
Assumptions:
Spark is installed and the environment variable SPARK_HOME is set. See general instructions here.
Python 2.7 is installed with pip available. Instructions
Either compile GeoMesa for Scala 2.10 (instructions) or download a GeoMesa compute 'fat jar' here:
Note: With some systems, in particular CentoOS 6, Python 2.7 is not the default.
Ubuntu: Prepare for Jupyter install (apt-get example assuming python 2.7 is default)
sudo apt-get install build-essential python-devCentOS: Prepare for Jupyter install (yum example assuming python 2.7 is default)
sudo yum groupinstall 'Development Tools'
sudo yum install python-develInstall Jupyter Notebook
sudo -H pip install --upgrade jupyter Install and configure Toree Jupyter kernel
Optionally you can build and install toree.
Install Apache Toree
sudo -H pip install --upgrade --pre toreeIn the next block, we'll use the geomesa-compute jar from step 3 above. You can download it or build it.
Configure the Apache Toree kernel
export GEOMESA_LIB=${GEOMESA_HOME}/lib
${GEOMESA_HOME}/bin/install-jai.sh
jupyter toree install\
--replace\
--user\
--kernel_name="Spark GeoMesa"\
--spark_home=${SPARK_HOME}\
--spark_opts="--master yarn --jars file://${GEOMESA_LIB}/common/jai_core-1.1.3.jar,file://${GEOMESA_LIB}/common/jai_codec-1.1.3.jar,file://${GEOMESA_LIB}/common/jai_imageio-1.1.jar,file://${GEOMESA_SRC}/geomesa-compute_2.10-1.2.5-shaded.jar"The exact name of GeoMesa compute jar you use will likely vary if you built it yourself. Here the instructions are assuming the linked jar is used.
Note also that the kernel name can be changed with different Toree installs. This can be used to version your kernels. For example naming one kernel "Spark GeoMesa 1.2.4" and another "Spark GeoMesa 1.2.5" will allow you to switch between those two kernels in Jupyter.
Jupyter can perform syntax highlighting of your scala code, but you may need to change the default language spec set by toree in the kernels.json config file, located in either /usr/local/share/jupyter/kernels/apache_toree_scala/ or ~/.local/share/jupyter/kernels/apache_toree_scala. If syntax highlighting isn't working, the "language" entry may need to be set to "scala211" rather than "scala". The contents of kernel.json will then look something like:
Install the Apache Toree kernel with geomesa compute jar
{
"argv": [
"/usr/local/share/jupyter/kernels/apache_toree_scala/bin/run.sh",
"--profile",
"{connection_file}"
],
"language": "scala211",
"env": {
"PYTHON_EXEC": "python",
"__TOREE_OPTS__": "",
"PYTHONPATH": "/opt/spark-1.6.1-bin-hadoop2.6//python:/opt/spark-1.6.1-bin-hadoop2.6//python/lib/py4j-0.9-src.zip",
"DEFAULT_INTERPRETER": "Scala",
"SPARK_HOME": "/opt/spark-1.6.1-bin-hadoop2.6/",
"__TOREE_SPARK_OPTS__": ""
},
"display_name": "Apache Toree - Scala"
}Installing Vegas for graphing
To download the JARs for Vegas:
$ curl -L -O 'http://search.maven.org/remotecontent?filepath=org/apache/ivy/ivy/2.3.0/ivy-2.3.0.jar'
$ java -jar ivy-2.3.0.tar -dependency org.vegas-viz vegas-spark_2.11 0.3.6 -retrieve "lib/[artifact]-[revision](-[classifier]).[ext]"
$ sudo cp lib/* $SPARK_HOME/jarsThis should be done with caution, as Spark 2.1.0 itself is a dependency of vegas-spark_2.11. The Ivy command listed above will download about 130 JARs, including Spark's transitive dependencies such as Hadoop and Zookeeper. If you are using a different version of Spark and/or have a Spark distribution bundled with a particular version of Hadoop, following the instructions above blindly will result in multiple versions of the same JARs in the Spark jars directory.
Then within the Jupyter notebook, add the following imports:
import vegas._
import vegas.render.HTMLRenderer._
import vegas.sparkExt._Configure Jupyter
Prepare the notebook server for public use: use a password and bind to IPs.
For scripted installs you can generate the password using this one-liner.
Generate Jupyter Password one-liner
password=$((python -c "from notebook.auth import passwd; exit(passwd(\"yourpassword\"))") 2>&1)Run Jupyter
The notebook server can be launched from the command line. Long lived processes should be hosted in screen, systemd or supervisord.
Start the notebook server
jupyter notebookAccess GeoMesa DataStores within Jupyter Notebook
The credentials used to connect to the DataStore are able to be parsed from the config file in SPARK_HOME/conf/spark-defaults.conf under the given keys.
Example $SPARK_HOME/conf/spark-defaults.conf
spark.driver.host jupyter
spark.master yarn
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator org.locationtech.geomesa.compute.spark.GeoMesaSparkKryoRegistrator
...
...
spark.credentials.ds.username accumuloUser
spark.credentials.ds.password accumuloPassword
Example RDD creation
import org.apache.hadoop.conf.Configuration
import org.apache.spark.{SparkConf, SparkContext}
import org.geotools.data.{DataStoreFinder, Query}
import org.geotools.filter.text.ecql.ECQL
import org.locationtech.geomesa.accumulo.data.AccumuloDataStore
import org.locationtech.geomesa.compute.spark.GeoMesaSpark
import scala.collection.JavaConversions._
val params = Map(
"instanceId" -> "geomesa",
"zookeepers" -> "worker1:2181,worker2:2181,worker3:2181",
"user" -> sc.getConf.get("spark.credentials.ds.username"),
"password" -> sc.getConf.get("spark.credentials.ds.password"),
"tableName" -> "geomesa122.gbdx")
val ds = DataStoreFinder.getDataStore(params).asInstanceOf[AccumuloDataStore]
// For GeoMesa 1.2.4 and higher:
GeoMesaSpark.register(ds)
val filter = "BBOX(geom, -180, -90, 180, 90) AND item_date AFTER 2015-09-01T00:00:00.000Z"
val query = new Query("ObjectDetection", ECQL.toFilter(filter))
val rdd = GeoMesaSpark.rdd(new Configuration, sc, params, query)
// do something
rdd.countThe above register call registers the SimpleFeatureTypes of the provided data store directly into the Kryo Registrator. The caveat, however, is that before serialization, the SimpleFeatureType encodings must be sent to the executors via a Spark Broadcast and then used to create the corresponding types in each executor's registrator. The following demonstrates doing all of this.
Example SFT Registration
// Register the sfts of a given data store
GeoMesaSpark.register(dataStore)
// Broadcast sft encodings to executors
val broadcastedSfts = sc.broadcast(sfts.map{sft => (name, SimpleFeatureTypes.encodeType(sft)})
// Populate the type cache on each partition
someRdd.foreachPartition { iter =>
broadcastedSfts.value.foreach { case (name, spec) =>
val sft = SimpleFeatureTypes.createType(name, spec)
GeoMesaSparkKryoRegistrator.putType(sft)
}
}Add extensions
If JavaScript extensions such as Leaflet or Turf are desired, they can be installed through Jupyter's nbextension command.
With a directory containing the desired library .js files, run the following command
Installing Extensions
jupyter nbextension install <extension-directory>
jupyter nbextension enable <extension-name>This will place the needed files in the proper jupyter data directory, likely /usr/local/share/jupyter/nbextenstions.
Lastly, either run the following in the JavaScript console of the notebook
Enabling an extension
Jupyter.notebook.config.update({"load_extensions": {"extension-name":true}})or update the notebook.json file in ~/.jupyter/nbconfig with the above .json.
This will automatically load the extension when a notebook is loaded in the browser.
With the extension installed, enabled, and loaded, use RequireJS within a JavaScript cell to use its functionality. The benefit to this over a script tag is its persistence across reloads of the notebook without having re-run cells.
Using an extension
%%javascript
require(["nbextensions/extension-name"], function(l) {
// Your JS code using the extension here
})Troubleshooting
Ensure the following environment variables are set (and included in .bashrc). Modify paths as needed.
Environment Variables
JAVA_HOME=/usr/lib/jvm/java-7-oracle
SCALA_HOME=/usr/share/scala
SPARK_HADOOP_VERSION=2.6.0
SPARK_YARN=true sbt/sbt assembly
SPARK_HOME=~/spark-1.6.2-bin-hadoop2.6
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
GEOMESA_HOME=~/geomesa/geomesa-tools/target/geomesa-tools-1.2.5/
GEOMESA_LIB=${GEOMESA_HOME}/lib
GEOMESA_SRC=~/geomesa
PATH=$SCALA_HOME/bin:${GEOMESA_HOME}/bin:$PATH