Find lines with the most wordsΒΆ

This is an example based on the Getting Started examples (https://spark.apache.org/docs/latest/quick-start.html) from the Apache site. The sample finds the line(s) in a text file with the most number of words.

The example does not use a MapReduce model, but shows the basic DataFrame API which is the main data abstraction in Spark.

Start by reading a file into a DataFrame (via the SparkContext). Return a line count and show the first row.

from pyspark.sql import SparkSession

# create a SparkContext. If running in DataBricks, this is already done for you
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

filename = "FileStore/tables/README.md"
filename = "README.md"

textFile = spark.read.text(filename)
textFile.count()
textFile.first()

Now split each line into an array of words using the split function. This array of words can be counted to show the number of words on each line.

from pyspark.sql.functions import *

# creates a column of arrays (array is spark.sql function split using regexp)
words = split(textFile.value, "\s+").name("words")

# create a DF containing a single column of word counts (size is built-in to count the num elements in each Column item)
wcDf = textFile.select(size(words).name("numWords"), words)
#wcDf.show()

# find max value in the Column and convert to a DF
wcDf.agg(max(col("numWords"))).show()

This outputs:

+-------------+
|max(numWords)|
+-------------+
|           94|
+-------------+

We can use this to extract the actual rows that match this word count. This uses the filter method to find matching records.

# show rows with maxWords (this requires 2 passes over the DF - not efficient). Note filter syntax uses column notation.
maxValue = wcDf.agg(max(col("numWords"))).collect()[0][0]
wcDf.filter(wcDf.numWords == maxValue).show()

This outputs:

+--------+--------------------+
|numWords|               words|
+--------+--------------------+
|      94|[Spark, and, the,...|
+--------+--------------------+