PDF slides: Data Streams
A data stream is a real-time, continuous, ordered (implicitly by arrival time of explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor it is feasible to locally store a stream in its entirety. Stream data management is important in telecommunications, real-time facilities monitoring, stock monitoring and trading, preventing fraudulent transactions and click stream analysis.
To process streams more easily, Data Stream Management Systems (DSMS) have been proposed like StreamBase. In DSMS, many things are just opposite from a Database Management System (DBMS): in a DBMS the data is persistent and the query volatile, while in a DSMS the query is persistent and gives continuous answers, while the data is volatile. A DSMS processes queries over a stream of data, by partitioning that stream in windows and evaluating the query for every new window, producing a never ending stream of results. The windows can be time-limited, size-limited, or punctuated by specific kinds of events.
Twitter has built an open-source data stream management system called Storm. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. With Storm, one creates event workflows, called "Topologies" where events stem from stream sources called "Spouts" (e.g. Twitter feeds, stock tickers, RSS) and are processed by so-called "Bolts". Topologies determine how many of each of such elements are used (each becomes a process on a compute node) and how data is distributed and split between them, allowing all these elements to run in parallel on a large cluster.
The term Lambda Architecture was coined for systems that consist of two Big Data pipelines: one batch-oriented (e.g. MapReduce) that computes static views on most data, and a second pipeline that computes dynamic data views on the last few hours of data using a stream engine (e.g. Storm). This allows to compute current query answers over all data relatively cheaply -- however the architecture duplicates logic in two pipelines. The name Lambda Architecture refers to computing a function over the data; this ability to recompute everything over independent subsets of data makes the architecture possible,
Finally, we discuss the new Google DataFlow which will enter open source life as Apache Beam. This is Google' own successor to MapReduce. It has a data processing framework with operators similar to Map (ParDo) and Reduce (GroupByKey) - but also many operator such as Filter and Join. These operators can be stitched together in ``pipelines'' and the framework was initially called FlumeJava. Dataflow adds to this a stream extension that moves from (Key,Value) pairs used in MapReduce to (Key,Value,EventTime,Window) tuples, where each key-value pair also carries a time when the event occurred, and a session window. The DataFlow framework allows many times of window specification and allows to explicitly define computation policies based on the EventTime and Processing Time, which are independent concepts. As such, it provides a richer expressive power in to specify temporal behavior than mot other streaming engines.
For technical background material, there are three papers,
Slidedeck on Spark Streaming:
The introduction of Google DataFlow:
Finally, something on Storm; although it is mostly superseded by now: