Find lines with the most words ============================== This is an example based on the Getting Started examples (https://spark.apache.org/docs/latest/quick-start.html) from the Apache site. The sample finds the line(s) in a text file with the most number of words. The example does not use a MapReduce model, but shows the basic DataFrame API which is the main data abstraction in Spark. Start by reading a file into a DataFrame (via the SparkContext). Return a line count and show the first row. :: from pyspark.sql import SparkSession # create a SparkContext. If running in DataBricks, this is already done for you spark = SparkSession.builder.appName("SimpleApp").getOrCreate() filename = "FileStore/tables/README.md" filename = "README.md" textFile = spark.read.text(filename) textFile.count() textFile.first() Now split each line into an array of words using the split function. This array of words can be counted to show the number of words on each line. :: from pyspark.sql.functions import * # creates a column of arrays (array is spark.sql function split using regexp) words = split(textFile.value, "\s+").name("words") # create a DF containing a single column of word counts (size is built-in to count the num elements in each Column item) wcDf = textFile.select(size(words).name("numWords"), words) #wcDf.show() # find max value in the Column and convert to a DF wcDf.agg(max(col("numWords"))).show() This outputs: :: +-------------+ |max(numWords)| +-------------+ | 94| +-------------+ We can use this to extract the actual rows that match this word count. This uses the filter method to find matching records. .. code:: python # show rows with maxWords (this requires 2 passes over the DF - not efficient). Note filter syntax uses column notation. maxValue = wcDf.agg(max(col("numWords"))).collect()[0][0] wcDf.filter(wcDf.numWords == maxValue).show() This outputs: :: +--------+--------------------+ |numWords| words| +--------+--------------------+ | 94|[Spark, and, the,...| +--------+--------------------+