Haskeller & creator of Repa library
Ben Lippmeier is building an analytic database based on Repa Flow for a recently started startup. In previous incarnations he was working with the Hadoop at the Commonwealth Bank of Australia (CBA), on Data Parallel Haskell at the University of New South Wales (UNSW), and teaching programming language theory at University of Sydney (USyd). Other projects include the Gloss graphics library for Haskell, and the Disciplined Disciple research compiler. Ben’s mind is empty, believing nothing in particular, least of all anyone’s favourite paradigm.
YOW! Lambda Jam 2015 Brisbane
Data Parallel Data Flow
Repa Flow is a library for data parallel data flow programming in Haskell. A flow is a bundle of independent streams, and the library provides operators such as map, fold and filter that apply to all streams in a bundle. Data parallelism is introduced by evaluating each stream in a separate thread on a multi-core machine.Like a souped-up version of the Haskell conduit or pipes library, Repa Flow adds support for efficient chunked streams of unboxed data; bucketed files; and analytic operators such as the SQL-like groupBy and the Hadoop-like shuffle. Repa Flow uses three separate array fusion methods to gain good numeric performance, all while maintaining a pleasant user-facing API.
I’ll use Repa to introduce the Data Parallel Data Flow paradigm, and discuss the general concept of flow polarity, meaning that data is pulled from stream sources but pushed to stream sinks.If a given flow operator (like map or zip) has multiple input or output ports then polarities can be assigned to these ports in various ways. However, only some assignments permit the operator to run in constant space.This fact is intrinsic to the data-flow model, rather than being specific to any particular implementation. In Repa Flow, only operators that run in constant space are exported by the library, which ensures that all programs written with the library also run in constant space by construction.