The EMR (Elastic Map Reduce) service on Amazon has some nice packages that come pre-installed, and one of them is Apache Zeppelin, which is a Jupyter Notebook interface for Spark.
Zeppelin has interpreters for
spark-sql and others,
but if you want to run
spark-sql code on a PostgreSQL database, you need first
to install the JDBC interpreter and add some extra configuration to Zeppelin.
The JDBC adapter supports a wide variety of database engines, and it allows you to configure multiple database connections, which makes data exploration much easier.
The generic JDBC interpreter supports these databases:
- Apache Hive
In the next section, I’ll explain how to install and configure the JDBC interpreter on Apache Zeppelin.
Installing the JDBC Interpreter on Zeppelin
When you’re creating your EMR cluster on AWS, on the
step, make sure to include these applications:
Zeppelin 0.8.1, Spark 2.4.0.
You can also go to
Advanced Options and check all the applications you need.
Also, make sure to add an EC2 key pair so you can connect to it with ssh.
Then, after your cluster is running, connect to it via ssh:
$ ssh -i [path-to-your-ec2-keypair].pem hadoop@[some-address].compute.amazonaws.com
Then run this command to install the jdbc interpreter:
$ sudo /usr/lib/zeppelin/bin/install-interpreter.sh --name jdbc > Interpreter jdbc installed under /usr/lib/zeppelin/interpreter/jdbc.
Restart Zeppelin by running these commands:
$ sudo stop zeppelin > zeppelin stop/waiting $ sudo start zeppelin > zeppelin start/running, process 24434
PROTIP: this step can also be run as a
for convenience. EMR allows you to define a bash script that is run when
creating your EMR cluster and it will be executed when each node is started,
once on the master node and once on all the slave nodes. You can define
bootstrap actions before creating the EMR cluster.
Now, in order to run a query, you need to configure the Zeppelin interpreter. To access Zeppelin notebook, first you need to open an ssh tunnel to your EMR cluster and configure a proxy that will safely access the cluster through the ssh tunnel.
Connecting to EMR with FoxyProxy 6.x
In order to access the web tools on your EMR cluster, you need to configure the
Web Connection. Go to
Amazon EMR >
Clusters > click on the cluster you just
created, then click on
Enable Web Connection. This will explain how to set up
an ssh tunnel and configure foxyproxy.
To set up the tunnel on port 8157, run:
ssh -i ~/emr-key.pem -ND 8157 hadoop@[master-public-dns-of-your-cluster]
Then you need to configure a proxy management tool. The tutorial description on
amazon explains how to set up foxyproxy, but this is not up-to-date because
foxyproxy doesn’t support XML configurations anymore,
so you’re gonna need to use a json config. To do that, copy this configuration
and save it as
Then, follow these steps:
- Install the FoxyProxy add-on, then click on the FoxyProxy icon on the top right corner of firefox
- click on Import
- On the first section
Import Settings from FoxyProxy 6.x (current version), click on Browse
- You’re gonna see a new proxy configuration on the foxyproxy list called EMR SOCKS Proxy.
- Enable the proxy by clicking on the foxyproxy icon on the top right corner
of your browser and select
Use proxy EMR SOCKS Proxy for all URLs (ignore patterns)
Now you should be able to access any of the web application interfaces running on your EMR cluster, like Zeppelin, Ganglia, and YARN. You can see a list of them here.
To access Zeppelin, go to
Configure Database Connections
- On Zeppelin, go to Interpreters
- Click on
+ Createto configure a new Interpreter
- Give it a name (like
jdbc) to be used on your notes as
- Choose JDBC as the interpreter group
- Configure the connection properties, like
default.driver(if needed) to access your database.
The interesting thing about the JDBC interpreter is that you can define multiple connections to any different type of database by using a prefix instead of the default one and defining an appropriate driver, as described here. As an example, I wanted to connect to three different PostgreSQL databases, so I configured my connections as following:
db1.url -> jdbc:postgresql://db1-url:5432/db-name1 (string) db1.password -> ******* (password) db1.user -> db1user (string) db1.driver -> org.postgresql.Driver (string)
db2.url -> jdbc:postgresql://db2-url:5432/db-name2 (string) db2.password -> ******* (password) db2.user -> db2user (string) db2.driver -> org.postgresql.Driver (string)
And so on. The JDBC interpreter uses the driver and the connection details to access the database in a generic way, so it’s pretty easy to connect to any kind of database and use the same querying interface. In order to use a different connection on your notebook, you define the interpreter and add a prefix, like so:
%jdbc(db1) SHOW TABLES; %jdbc(db2) SHOW TABLES;
To use the generic JDBC interpreter, create a new zeppelin note. Then click on
Settings (Interpreter Binding) and enable/bind the JDBC interpreter to your note.
You just need to enable
Then, to use the interpreter you just created, create a new paragraph, and define the interpreter and the connection prefix you want to use:
%jdbc(prefix) SELECT * FROM tableA;
In my case:
%jdbc(db1) SELECT * FROM client;
Now you have a simple way to connect and explore different databases in a very simple way. Hope you find it useful!