It’s a batch data processing framework. It’s distributed, it uses Cassandra as a data store and RabbitMQ for task scheduling, it implements basic relational algebra, it allows very flexible row-based data transforms.
It is somewhat similar to Apache Spark, but there are some key features that make it stand out:
-
written in Go, uses Go expressions for simple one-liner data transforms, Python for complex calculations;
-
declarative, formalized data transform process configuration, explicit DAG as part of the configuration;
-
each transform produces a persistent Cassandra table that is used a source for further transforms, or for troubleshooting;
-
the focus is on delivering production-quality data, so intermediate results can be signed-off or re-calculated by an operator if needed.
Github readme has instructions how to run a 100% Docker-based demo and see Capillaries in action within minutes.
Version 1.1.12 adds minor improvements that allows to obtain tangible performance results and cost estimates for a ~800-core AWS-based test environment: Capillaries: ARK portfolio performance calculation at scale.