Note: With some systems, in particular CentoOS 6, Python 2.7 is not the default.
sudo apt-get install build-essential python-dev |
sudo yum groupinstall 'Development Tools' sudo yum install python-devel |
sudo -H pip install --upgrade jupyter |
Optionally you can build and install toree.
sudo -H pip install --upgrade --pre toree |
In the next block, we'll use the geomesa-compute jar from step 3 above. You can download it or build it.
export GEOMESA_LIB=${GEOMESA_HOME}/lib ${GEOMESA_HOME}/bin/install-jai.sh jupyter toree install\ --replace\ --user\ --kernel_name="Spark GeoMesa"\ --spark_home=${SPARK_HOME}\ --spark_opts="--master yarn --jars file://${GEOMESA_LIB}/common/jai_core-1.1.3.jar,file://${GEOMESA_LIB}/common/jai_codec-1.1.3.jar,file://${GEOMESA_LIB}/common/jai_imageio-1.1.jar,file://${GEOMESA_SRC}/geomesa-compute_2.10-1.2.5-shaded.jar" |
The exact name of GeoMesa compute jar you use will likely vary if you built it yourself. Here the instructions are assuming the linked jar is used.
Note also that the kernel name can be changed with different Toree installs. This can be used to version your kernels. For example naming one kernel "Spark GeoMesa 1.2.4" and another "Spark GeoMesa 1.2.5" will allow you to switch between those two kernels in Jupyter.
Jupyter can perform syntax highlighting of your scala code, but you may need to change the default language spec set by toree in the kernels.json config file, located in either /usr/local/share/jupyter/kernels/apache_toree_scala/ or ~/.local/share/jupyter/kernels/apache_toree_scala. If syntax highlighting isn't working, the "language" entry may need to be set to "scala211" rather than "scala". The contents of kernel.json will then look something like:
{ "argv": [ "/usr/local/share/jupyter/kernels/apache_toree_scala/bin/run.sh", "--profile", "{connection_file}" ], "language": "scala211", "env": { "PYTHON_EXEC": "python", "__TOREE_OPTS__": "", "PYTHONPATH": "/opt/spark-1.6.1-bin-hadoop2.6//python:/opt/spark-1.6.1-bin-hadoop2.6//python/lib/py4j-0.9-src.zip", "DEFAULT_INTERPRETER": "Scala", "SPARK_HOME": "/opt/spark-1.6.1-bin-hadoop2.6/", "__TOREE_SPARK_OPTS__": "" }, "display_name": "Apache Toree - Scala" } |
To download the JARs for Vegas:
$ curl -L -O 'http://search.maven.org/remotecontent?filepath=org/apache/ivy/ivy/2.3.0/ivy-2.3.0.jar' $ java -jar ivy-2.3.0.tar -dependency org.vegas-viz vegas-spark_2.11 0.3.6 -retrieve "lib/[artifact]-[revision](-[classifier]).[ext]" $ sudo cp lib/* $SPARK_HOME/jars |
This should be done with caution, as Spark 2.1.0 itself is a dependency of vegas-spark_2.11. The Ivy command listed above will download about 130 JARs, including Spark's transitive dependencies such as Hadoop and Zookeeper. If you are using a different version of Spark and/or have a Spark distribution bundled with a particular version of Hadoop, following the instructions above blindly will result in multiple versions of the same JARs in the Spark jars directory. |
Then within the Jupyter notebook, add the following imports:
import vegas._ import vegas.render.HTMLRenderer._ import vegas.sparkExt._ |
Prepare the notebook server for public use: use a password and bind to IPs.
For scripted installs you can generate the password using this one-liner.
password=$((python -c "from notebook.auth import passwd; exit(passwd(\"yourpassword\"))") 2>&1) |
The notebook server can be launched from the command line. Long lived processes should be hosted in screen
, systemd
or supervisord
.
jupyter notebook |
The credentials used to connect to the DataStore are able to be parsed from the config file in SPARK_HOME/conf/spark-defaults.conf under the given keys.
spark.driver.host jupyter spark.master yarn spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrator org.locationtech.geomesa.compute.spark.GeoMesaSparkKryoRegistrator ... ... spark.credentials.ds.username accumuloUser spark.credentials.ds.password accumuloPassword |
import org.apache.hadoop.conf.Configuration import org.apache.spark.{SparkConf, SparkContext} import org.geotools.data.{DataStoreFinder, Query} import org.geotools.filter.text.ecql.ECQL import org.locationtech.geomesa.accumulo.data.AccumuloDataStore import org.locationtech.geomesa.compute.spark.GeoMesaSpark import scala.collection.JavaConversions._ val params = Map( "instanceId" -> "geomesa", "zookeepers" -> "worker1:2181,worker2:2181,worker3:2181", "user" -> sc.getConf.get("spark.credentials.ds.username"), "password" -> sc.getConf.get("spark.credentials.ds.password"), "tableName" -> "geomesa122.gbdx") val ds = DataStoreFinder.getDataStore(params).asInstanceOf[AccumuloDataStore] // For GeoMesa 1.2.4 and higher: GeoMesaSpark.register(ds) val filter = "BBOX(geom, -180, -90, 180, 90) AND item_date AFTER 2015-09-01T00:00:00.000Z" val query = new Query("ObjectDetection", ECQL.toFilter(filter)) val rdd = GeoMesaSpark.rdd(new Configuration, sc, params, query) // do something rdd.count |
The above register
call registers the SimpleFeatureTypes of the provided data store directly into the Kryo Registrator. The caveat, however, is that before serialization, the SimpleFeatureType encodings must be sent to the executors via a Spark Broadcast and then used to create the corresponding types in each executor's registrator. The following demonstrates doing all of this.
// Register the sfts of a given data store GeoMesaSpark.register(dataStore) // Broadcast sft encodings to executors val broadcastedSfts = sc.broadcast(sfts.map{sft => (name, SimpleFeatureTypes.encodeType(sft)}) // Populate the type cache on each partition someRdd.foreachPartition { iter => broadcastedSfts.value.foreach { case (name, spec) => val sft = SimpleFeatureTypes.createType(name, spec) GeoMesaSparkKryoRegistrator.putType(sft) } } |
If JavaScript extensions such as Leaflet or Turf are desired, they can be installed through Jupyter's nbextension command.
With a directory containing the desired library .js files, run the following command
jupyter nbextension install <extension-directory> jupyter nbextension enable <extension-name> |
This will place the needed files in the proper jupyter data directory, likely /usr/local/share/jupyter/nbextenstions.
Lastly, either run the following in the JavaScript console of the notebook
Jupyter.notebook.config.update({"load_extensions": {"extension-name":true}}) |
or update the notebook.json file in ~/.jupyter/nbconfig with the above .json.
This will automatically load the extension when a notebook is loaded in the browser.
With the extension installed, enabled, and loaded, use RequireJS within a JavaScript cell to use its functionality. The benefit to this over a script tag is its persistence across reloads of the notebook without having re-run cells.
%%javascript require(["nbextensions/extension-name"], function(l) { // Your JS code using the extension here }) |
Ensure the following environment variables are set (and included in .bashrc). Modify paths as needed.
JAVA_HOME=/usr/lib/jvm/java-7-oracle SCALA_HOME=/usr/share/scala SPARK_HADOOP_VERSION=2.6.0 SPARK_YARN=true sbt/sbt assembly SPARK_HOME=~/spark-1.6.2-bin-hadoop2.6 PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH GEOMESA_HOME=~/geomesa/geomesa-tools/target/geomesa-tools-1.2.5/ GEOMESA_LIB=${GEOMESA_HOME}/lib GEOMESA_SRC=~/geomesa PATH=$SCALA_HOME/bin:${GEOMESA_HOME}/bin:$PATH |