Paper, Semantic Scholar Link, DOI Link

Authors: Michael Wawrzoniak, Gianluca Moro, Rodrigo Bruno, Ana Klimovic, and Gustavo Alonso

Serverless is not well suited for data analytics due to the limitations of the current platforms such as short function times, lack of persistent state and direct communication, as well as higher cost per second compared to VMs.

To address this issue, usually two approaches are taken:

  • Extend existing serverless platforms, often with VM-based services, to better support data analytics workloads.
  • Build new engines designed to work around the limitations.

The main problem with this approaches is the implicit give up on the existing distributed data processing platforms (e.g., Spark, Flink, Drill, etc.).

In this paper, the authors explore and propose an alternative solutions for data analytics on serverless. Run off-the-shelf distributed data processing platforms (e.g., Spark or Drill) on top of existing commercial serverless platforms (AWS Lambda).

To achieve this, they used Boxer to handle unmodified data processing engine processes.