What is Apex and why is it important?

Apache Apex (http://apex.apache.org/) is a stream processing platform and framework that can process data in-motion with low latency in a way that is highly scalable, highly performant, fault-tolerant, stateful, secure, distributed, and easily operable. Apex is written in Java, and Java is the primary application development environment.

In a typical streaming data pipeline, events from sources are stored and transported through a system such as Apache Kafka. The events are then processed by a stream processor and the results delivered to sinks, which are frequently databases, distributed file systems or message buses that link to downstream pipelines or services.

The following figure illustrates this:

In the end-to-end scenario depicted in this illustration, we see Apex as the processing component. The processing can be complex logic, with operations performed in sequence or in parallel in a distributed environment.

Apex runs on cluster infrastructure and currently supports and depends on Apache Hadoop, for which it was originally written. Support for Apache Mesos and other Docker-based infrastructure is on the roadmap.

Apex supports integration with many external systems out of the box, with connectors that are maintained and released by the project, including but not limited to the systems shown in the preceding diagram. The most frequently used connectors include Kafka and file readers. Frequently used sinks for the computed results are files and databases, though results can also be delivered directly to frontend systems for purposes such as real-time reporting directly from the Apex application, a use case that we will look at later.

Origin of Apex

The development of the Apex project started in 2012, with the original vision of enabling fast, performant, and scalable real-time processing on Hadoop. At that time, batch processing and MapReduce-based frameworks such as Apache Pig, Hive, or Cascading were still the standard options for processing data. Hadoop 2.x with YARN (Yet Another Resource Negotiator) was about to be released to pave the way for a number of new processing frameworks and paradigms to become available as alternatives to MapReduce. Due to its roots in the Hadoop ecosystem, Apex is very well integrated with YARN, and since its earliest days has offered features such as dynamic resource allocation for scaling and efficient recovery. It is also leading in high performance (with low latency), scalability and operability, which were focus areas from the very beginning.

The technology was donated to the Apache Software Foundation (ASF) in 2015, at which time it entered the Apache incubation program and graduated after only eight months to achieve Top Level Project status in April 2016.

Apex had its first production deployments in 2014 and today is used in mission-critical deployments in various industries for processing at scale. Use cases range from very low-latency processing in the real-time category to large-scale batch processing; a few examples will be discussed in the next section. Some of the organizations that use Apex can be found on the Powered by Apache Apex page on the Apex project web site at https://apex.apache.org/powered-by-apex.html.