Generic JDBC Queries on EMR Zeppelin

The EMR (Elastic Map Reduce) service on Amazon has some nice packages that come pre-installed, and one of them is Apache Zeppelin, which is a Jupyter Notebook interface for Spark. Zeppelin has interpreters for spark, pyspark, spark-sql and others, but if you want to run spark-sql code on a PostgreSQL database, you need first to install the JDBC interpreter and add some extra configuration to Zeppelin. The JDBC adapter supports a wide variety of database engines, and it allows you to configure multiple database connections, which makes data exploration much easier....

April 3, 2019 · 5 min · Thiago Araujo

Fast ElasticSearch Indexing with Apache Spark on EMR (overview)

I’ve been building the data infrastructure for a project and I needed to efficiently query, merge, process and clean terabytes of structured data and then index hundreds of millions of documents on elasticsearch. The problem is that querying and joining data on a RDBMS like Postgres is very painful when you have more than low terabytes of data. You’re going to spend a huge amount of time tuning your database, reading query plans, adding indexes, sharding, and slowly moving data around until you have something decent that take hours, days, maybe weeks to run....

September 6, 2018 · 4 min · Thiago Araujo