How to install the Scala Spark (Apache Toree) Jupyter kernel with GeoMesa support
This page covers using GeoMesa 1.2.x with an older version of Jupyter. Updated documentation lives at https://www.geomesa.org/documentation/user/spark/jupyter.html or elsewhere on https://www.geomesa.org/.
Install Jupyter
Assumptions:
- Spark is installed and the environment variable SPARK_HOME is set. See general instructions here.
- Python 2.7 is installed with pip available. Instructions
- Either compile GeoMesa for Scala 2.10 (instructions) or download a GeoMesa compute 'fat jar' here:
Note: With some systems, in particular CentoOS 6, Python 2.7 is not the default.
sudo apt-get install build-essential python-dev
sudo yum groupinstall 'Development Tools' sudo yum install python-devel
sudo -H pip install --upgrade jupyter
Install and configure Toree Jupyter kernel
Optionally you can build and install toree.
sudo -H pip install --upgrade --pre toree
In the next block, we'll use the geomesa-compute jar from step 3 above. You can download it or build it.
export GEOMESA_LIB=${GEOMESA_HOME}/lib ${GEOMESA_HOME}/bin/install-jai.sh jupyter toree install\ --replace\ --user\ --kernel_name="Spark GeoMesa"\ --spark_home=${SPARK_HOME}\ --spark_opts="--master yarn --jars file://${GEOMESA_LIB}/common/jai_core-1.1.3.jar,file://${GEOMESA_LIB}/common/jai_codec-1.1.3.jar,file://${GEOMESA_LIB}/common/jai_imageio-1.1.jar,file://${GEOMESA_SRC}/geomesa-compute_2.10-1.2.5-shaded.jar"
The exact name of GeoMesa compute jar you use will likely vary if you built it yourself. Here the instructions are assuming the linked jar is used.
Note also that the kernel name can be changed with different Toree installs. This can be used to version your kernels. For example naming one kernel "Spark GeoMesa 1.2.4" and another "Spark GeoMesa 1.2.5" will allow you to switch between those two kernels in Jupyter.
Jupyter can perform syntax highlighting of your scala code, but you may need to change the default language spec set by toree in the kernels.json config file, located in either /usr/local/share/jupyter/kernels/apache_toree_scala/ or ~/.local/share/jupyter/kernels/apache_toree_scala. If syntax highlighting isn't working, the "language" entry may need to be set to "scala211" rather than "scala". The contents of kernel.json will then look something like:
{ "argv": [ "/usr/local/share/jupyter/kernels/apache_toree_scala/bin/run.sh", "--profile", "{connection_file}" ], "language": "scala211", "env": { "PYTHON_EXEC": "python", "__TOREE_OPTS__": "", "PYTHONPATH": "/opt/spark-1.6.1-bin-hadoop2.6//python:/opt/spark-1.6.1-bin-hadoop2.6//python/lib/py4j-0.9-src.zip", "DEFAULT_INTERPRETER": "Scala", "SPARK_HOME": "/opt/spark-1.6.1-bin-hadoop2.6/", "__TOREE_SPARK_OPTS__": "" }, "display_name": "Apache Toree - Scala" }
Installing Vegas for graphing
To download the JARs for Vegas:
$ curl -L -O 'http://search.maven.org/remotecontent?filepath=org/apache/ivy/ivy/2.3.0/ivy-2.3.0.jar' $ java -jar ivy-2.3.0.tar -dependency org.vegas-viz vegas-spark_2.11 0.3.6 -retrieve "lib/[artifact]-[revision](-[classifier]).[ext]" $ sudo cp lib/* $SPARK_HOME/jars
jars
directory.Then within the Jupyter notebook, add the following imports:
import vegas._ import vegas.render.HTMLRenderer._ import vegas.sparkExt._
Configure Jupyter
Prepare the notebook server for public use: use a password and bind to IPs.
For scripted installs you can generate the password using this one-liner.
password=$((python -c "from notebook.auth import passwd; exit(passwd(\"yourpassword\"))") 2>&1)
Run Jupyter
The notebook server can be launched from the command line. Long lived processes should be hosted in screen
, systemd
or supervisord
.
jupyter notebook
Access GeoMesa DataStores within Jupyter Notebook
The credentials used to connect to the DataStore are able to be parsed from the config file in SPARK_HOME/conf/spark-defaults.conf under the given keys.
spark.driver.host jupyter spark.master yarn spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrator org.locationtech.geomesa.compute.spark.GeoMesaSparkKryoRegistrator ... ... spark.credentials.ds.username accumuloUser spark.credentials.ds.password accumuloPassword
import org.apache.hadoop.conf.Configuration import org.apache.spark.{SparkConf, SparkContext} import org.geotools.data.{DataStoreFinder, Query} import org.geotools.filter.text.ecql.ECQL import org.locationtech.geomesa.accumulo.data.AccumuloDataStore import org.locationtech.geomesa.compute.spark.GeoMesaSpark import scala.collection.JavaConversions._ val params = Map( "instanceId" -> "geomesa", "zookeepers" -> "worker1:2181,worker2:2181,worker3:2181", "user" -> sc.getConf.get("spark.credentials.ds.username"), "password" -> sc.getConf.get("spark.credentials.ds.password"), "tableName" -> "geomesa122.gbdx") val ds = DataStoreFinder.getDataStore(params).asInstanceOf[AccumuloDataStore] // For GeoMesa 1.2.4 and higher: GeoMesaSpark.register(ds) val filter = "BBOX(geom, -180, -90, 180, 90) AND item_date AFTER 2015-09-01T00:00:00.000Z" val query = new Query("ObjectDetection", ECQL.toFilter(filter)) val rdd = GeoMesaSpark.rdd(new Configuration, sc, params, query) // do something rdd.count
The above register
call registers the SimpleFeatureTypes of the provided data store directly into the Kryo Registrator. The caveat, however, is that before serialization, the SimpleFeatureType encodings must be sent to the executors via a Spark Broadcast and then used to create the corresponding types in each executor's registrator. The following demonstrates doing all of this.
// Register the sfts of a given data store GeoMesaSpark.register(dataStore) // Broadcast sft encodings to executors val broadcastedSfts = sc.broadcast(sfts.map{sft => (name, SimpleFeatureTypes.encodeType(sft)}) // Populate the type cache on each partition someRdd.foreachPartition { iter => broadcastedSfts.value.foreach { case (name, spec) => val sft = SimpleFeatureTypes.createType(name, spec) GeoMesaSparkKryoRegistrator.putType(sft) } }
Add extensions
If JavaScript extensions such as Leaflet or Turf are desired, they can be installed through Jupyter's nbextension command.
With a directory containing the desired library .js files, run the following command
jupyter nbextension install <extension-directory> jupyter nbextension enable <extension-name>
This will place the needed files in the proper jupyter data directory, likely /usr/local/share/jupyter/nbextenstions.
Lastly, either run the following in the JavaScript console of the notebook
Jupyter.notebook.config.update({"load_extensions": {"extension-name":true}})
or update the notebook.json file in ~/.jupyter/nbconfig with the above .json.
This will automatically load the extension when a notebook is loaded in the browser.
With the extension installed, enabled, and loaded, use RequireJS within a JavaScript cell to use its functionality. The benefit to this over a script tag is its persistence across reloads of the notebook without having re-run cells.
%%javascript require(["nbextensions/extension-name"], function(l) { // Your JS code using the extension here })
Troubleshooting
Ensure the following environment variables are set (and included in .bashrc). Modify paths as needed.
JAVA_HOME=/usr/lib/jvm/java-7-oracle SCALA_HOME=/usr/share/scala SPARK_HADOOP_VERSION=2.6.0 SPARK_YARN=true sbt/sbt assembly SPARK_HOME=~/spark-1.6.2-bin-hadoop2.6 PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH GEOMESA_HOME=~/geomesa/geomesa-tools/target/geomesa-tools-1.2.5/ GEOMESA_LIB=${GEOMESA_HOME}/lib GEOMESA_SRC=~/geomesa PATH=$SCALA_HOME/bin:${GEOMESA_HOME}/bin:$PATH