Monoids for analytics

January 16, 2016

This is a short post on the elegance of using abstract algebra for analytics in Scala.

A monoid is a set TT that is closed under an associative binary operation appendappend with an identity element zerozero such that append(a,zero)=aappend(a, zero) = a. In other words, the following 3 properties apply:

  • Closure - the result of combining two elements of the set is also an elment of the set:

    a,bT:append(a,b)T\forall a, b \in T: append(a, b) \in T
  • Associativity - when combining more than two elements of a set the order of the pairwise combinations doesn’t matter, which makes a monoid well suited for parallel computations:

    a,b,cT:append(append(a,b),c)=append(a,append(b,c))\forall a, b, c \in T: append(append(a, b), c) = append(a, append(b, c))
  • Identity - there is a special element of the set that when combined with any other element of the set yields the same element:

    zeroT:aT:append(zero,a)=append(a,zero)=a\exists zero \in T: \forall a \in T: append(zero, a) = append(a, zero) = a

One way to define a trait for a monoid in Scala is the following:

trait Monoid[T] {
  def zero: T
  def append(a: T, b: T): T
}

Monoids are everywhere; think of the set of natural numbers and addition or the set of strings and concatenation. Also note that the same set can have multiple “monoidal forms”; for example the set of natural numbers can have both an additive and a multiplicative monoid.

Monoids compose well; for example a tuple of monoids is itself a monoid, as such it’s simple to define a monoid for a complex type once monoids for its constituents types exists. Scalaz and Algebird are two Scala libraries that provide monoids for data types such as List, Set, Option, Map and others. Algebird in particular, which is targeted at building aggregation systems, comes with a set of monoids useful for counting such as DecayedValue for exponential decay, AveragedValue for averaging and HyperLogLog for approximate cardinality counting.

As a concrete example on how monoids provided by Scalaz are used in practice, suppose we have a server that receives sparse histograms from its clients:

import scalaz._
import Scalaz._

case class Histogram(var values: Map[Long, Long], sum: Int, count: Int)

We would like to aggregate histograms over a time window. It’s simple enough to write some code to do that, yet monoids offer an elegant solution. Since Scalaz provides by default an additive monoid for Map and Long, we can easily define one for Histogram as well:

implicit def histogramMonoid: Monoid[Histogram] = new Monoid[Histogram] {
  def zero = Histogram(Map(), 0, 0)
  def append(a: Histogram, b: => Histogram) = Histogram(a.values |+| b.values,
                                                        a.sum    |+| b.sum,
                                                        a.count  |+| b.count)
}

Where |+| invokes the binary operator of the monoid defined over the type of its operators. We can aggregate histograms now that we have a monoid, e.g.:

val h1 = Histogram(Map(10 -> 12, 100 -> 41), 53, 3)
val h2 = Histogram(Map(10 -> 21, 80 -> 14), 35, 17)
h1 |+| h2

yields:

Histogram(Map(10 -> 33, 80 -> 14, 100 -> 41), 88, 20)

Let’s say that we have different histograms for different measurements which are stored in a mapping from strings to histograms. Since there is a monoid defined over Histogram then we can easily aggregate multiple Map[String, Histogram] as well:

val m1 = Map("metric1" -> h1, "metric2" -> h2)
val m2 = Map("metric1" -> h2, "metric3" -> h1)
m1 |+| m2

yields:

Map(metric1 -> Histogram(Map(10 -> 33, 80 -> 14, 100 -> 41), 88, 20),
    metric2 -> Histogram(Map(10 -> 21, 80 -> 14), 35, 17),
    metric3 -> Histogram(Map(10 -> 12, 100 -> 41), 53, 3))

This is just the tip of the iceberg of the elegance provided by abstract algebra. If this short article caught your interest grab a copy of Functional Programming in Scala, which is without any doubt the best functional programming book I have ever read. The book introduces a variety of abstract algebra concepts with their relative implementations.


Written by Roberto Vitillo

Want to learn how to build scalable and fault-tolerant cloud applications?

My book explains the core principles of distributed systems that will help you design, build, and maintain cloud applications that scale and don't fall over.

Sign up for the book's newsletter to get the first two chapters delivered straight to your inbox.

    I respect your privacy. Unsubscribe at any time.