Calculate Pi via MapReduce

Another example from the Apache Spark git repo. This one uses a MapReduce paradigm to calculate (approximately) the value of Pi.

Imagine a circle drawn inside a square. This example calculates Pi by randomly throwing darts (well, creating random data points) and seeing if these fall inside a pre-defined circle. Pi is calculated by measuring the ratio of points that are in the circle compared to the points falling outside the circle. The size of the circle and ‘dartboard’ (actually a square) are intentionally chosen to conveniently simplify the equation (circle has radius 1).

This shows one way of approximating Pi. The more darts you throw (by increasing ‘n’), the mor accurate the calculation. But the main objective of the example is to show:

  1. how to solve problems using a MapReduce paradigm
  2. how to perform large computational volumes using parallel execution (partitions)

MapReduce

MapReduce is an algorithm that breaks a problem into 2 steps

  1. Apply a mapping function to transform the data. In our case, the mapping function throws the dart and sees if we hit the circle. It does look at the number that was provided in the original data set.
  2. Apply a reducing function to shrink the mapped values to a single value. In this case, the reducer just adds the values together to see how many times the dart hit the circle.

Parallel Execution

The example makes use of the ‘parallelize’ execution method. This splits the data set (in this case, our pack of darts to throw) into slices. Each slice is a partition that will be processed in parallel to all others. The level of parallel execution will depend on the cluster that executes our code, but the example shows how the Spark framework hides all of the complexity involved in distributing work packages to nodes, and combining the execution results.

import sys
from random import random
from operator import add
from pyspark.sql import SparkSession

# create a SparkContext. If running in DataBricks, this is already done for you
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

partitions = 10
# n is the number of random numbers we will generate (split over 'partitions' partitions)
n = 100 * partitions

# Function to 'throw a dart a the circle' and decide if it is in the circle.
# Assume the circle has radius 1, so co-ordinates are generated between (-1,-1) and (1,1).
#
# Data point is in the circle if distance from the centre < radius (i.e. <1).
# Distance from the centre is sqrt(x2 + y2). i.e. if sqrt(x2 + y2) < 1 exp(2)
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0

#count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)

# Generate the data points in parallel. Level of partitioning is determined by the 'partitions' variable.
# Data points are generated by applying a map function (f) to throw the dart and record if it was in or not
mapped = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f)

# Once we have thrown the darts (and recorded if they were on target), add up the total that hit the circle.
# This is done using a reduce function (simply adding up all those that hit the target)
reduced = mapped.reduce(add)

# We now have the total of all darts that hit the circle.
# Now what?
# If the darts were random, the probability of being in the circle = area(circle) / area(square)
# By choosing a radius 1, this simplifies down...
# = pi.r2 / len.height = pi.1 / 2.2 = pi / 4.
# In other words, pi = count / 4

print("Pi is roughly %f" % (4.0 * reduced / n))

Pi is roughly 3.136000