Fast ElasticSearch Indexing with Apache Spark on EMR (overview)

I’ve been building the data infrastructure for a project and I needed to efficiently query, merge, process and clean terabytes of structured data and then index hundreds of millions of documents on elasticsearch.

The problem is that querying and joining data on a RDBMS like Postgres is very painful when you have more than low terabytes of data. You’re going to spend a huge amount of time tuning your database, reading query plans, adding indexes, sharding, and slowly moving data around until you have something decent that take hours, days, maybe weeks to run. Trust me, I’ve been there.

Besides, indexing millions of documents on elasticsearch is also complicated, you have to spin up lots of new nodes even when you don’t really need to scale your cluster right away. In my case, I just needed fast and efficient bulk indexing, searching and filtering don’t need to be super fast because data velocity is not a problem yet, but volume is. Tuning bulk indexing and optimizing memory usage on elasticsearch is not fun either.

But I’m an impatient developer, and I need my stuff right now. The only way to do that is by liberating the data and moving it to a large hadoop cluster.

Nowadays, with Amazon EMR (Elastic Map Reduce) and spot instances, it’s really, really cheap to just spin up a cluster with hundreds of nodes and run your big data processing job in a very cost-efficient way. And you don’t even have to manage it, EMR has templates with pre-installed tools, like Apache Sqoop, Spark and Hadoop.

light speed is too slow, we’re gonna have to go right to Ludicrous Speed!

Why you should consider this approach?

It’s really cheap! I have a couple of jobs that are memory-heavy, and I run them on Spark. One of them runs on a cluster with 41 r4.2xlarge instances, each one of them has 8 vCPUS and 61GB of memory, so that’s a total of 328 CPUs and 2.5Tb of memory. Neat, huh?

They’re all spot instances that I can rent and pay $0.14 per hour, so running the cluster costs $6 dollars per hour :moneybag::moneybag::moneybag:. It may look like a lot, but here’s the cool thing about it: my job runs in about 30 minutes. After that, I just kill the cluster.

It’s really fast! Compared to the process we had before (SQL, logstash, some elixir code, and some bash scripts), I calculated it would take more than a month to run the whole thing and index all the data (yeah, it’s very slow). And this is assuming that one server would even be able to handle it (I bet it wouldn’t), and I’m not even considering the time it would take me to fix and optimize the whole process just to make it run properly and consistently.

Amazon even has a service right now called AWS Glue that abstracts away the whole cluster thing, so you just pay by the hour to run your Spark jobs, pretty much like Lambda. And if you prefer Google cloud, they also have some options too.

In my case, I can spin up the EMR cluster and let it crunch some terabytes of data while I brew my coffee, and spend just $6 dollars. That’s a huge improvement for an impatient dev like myself. After all, time is a very precious thing.

Sometimes running Hadoop and Spark is overkill, but for this problem it was a no-brainer: one person can build the whole map-reduce job and manage the cluster easily, and just move the final data to S3 when finished. So if you have a problem like that, you should seriously consider if a proper big data tool could be a better solution than your current one.

Did you like this article? Then you're gonna love these other ones:

Generic JDBC Queries on EMR Zeppelin

Why you should consider this approach?#

Why you should consider this approach?