Usage

Build Classifier

Builds text classifiers.

Before starting any classification you first need to train the classifiers. Some pre-built models are stored in the models folder. But any changes to the classifier configuration, the feature definition, or even the training data will require the classifiers to be re-built. So let’s start with this step first.

Classifiers are built with the build_classifiers.py program:

$ ./twitter_ml/classify/build_classifiers.py

2019-11-12 18:48:51,480 - __main__ - INFO - Loading feature sets and training data...
Review extraction: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:01<00:00, 804.53it/s]
Review extraction: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:01<00:00, 746.02it/s]
Feature encoding: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:00<00:00, 2347.35it/s]
2019-11-12 18:48:54,961 - __main__ - INFO - Creating classifiers...
2019-11-12 18:48:54,961 - twitter_ml.classify.sentiment - INFO - Training voting classifier
2019-11-12 18:48:55,519 - twitter_ml.classify.sentiment - INFO - Saving classifier to models/voting.pickle...
2019-11-12 18:48:55,520 - __main__ - INFO - Done.

This downloads a set of training data (the default configuration uses some on-line Movie Reviews shipped with the NLTK toolkit). The program extracts a set of features from the data and uses this to build a classifier. The classifier is stored in the `model` folder.

The classifier is now ready to be used. If you want to start using it, jump ahead to the Text Classifier section.

build_classifiers.py also supports other command-line arguments:

$ ./twitter_ml/classify/build_classifiers.py -h

usage: build_classifiers.py [-h] [--features] [--report] [--graphs]

Builds scikit-learn/nltk classifiers based on training data.

optional arguments:
  -h, --help  show this help message and exit
  --features  list features and exit
  --report    print classifier metrics and exit
  --graphs    print classifier graphs and exit
  --learning  print classifier learning curves

The arguments are summarised below:

  • features
    Lists the words that are extracted from the training data and exits. These words are the ‘features’ used to classify documents.
  • report
    Build the classifier then run the classifier against some test data. A textual summary of the performance is printed. This is useful for evaluating how well the classifier is performing.
Metrics:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00        50

    accuracy                           1.00       100
   macro avg       1.00      1.00      1.00       100
weighted avg       1.00      1.00      1.00       100

Confusion matrix:
[[50  0]
 [ 0 50]]
  • graphs
    Print a series of graphs summarising the behaviour of the classifier.
_images/confusion.png
  • learning
    Calculate learning curves for each of the sub-classifiers and plot the results. This shows how the performance of each classifier changes based on the number of training samples.
_images/learning_curve.png

Text Classifier

A standalone program for classifying the sentiment of text using NLTK.

for example:

$ python twitter_ml/classify/classify_text.py --text "This is some negative text"
2019-10-18 13:34:33,791 - twitter_ml.classify.sentiment - INFO - Naive Bayes classifier from NLTK: neg
2019-10-18 13:34:33,808 - twitter_ml.classify.sentiment - INFO - Multinomial NB classifier from SciKit: neg
2019-10-18 13:34:33,826 - twitter_ml.classify.sentiment - INFO - Bernouilli NB classifier from SciKit: neg
2019-10-18 13:34:33,842 - twitter_ml.classify.sentiment - INFO - Logistic Regression classifier from SciKit: neg
2019-10-18 13:34:33,859 - twitter_ml.classify.sentiment - INFO - SGD classifier from SciKit: neg
2019-10-18 13:34:33,874 - twitter_ml.classify.sentiment - INFO - Linear SVC classifier from SciKit: neg
2019-10-18 13:34:34,076 - twitter_ml.classify.sentiment - INFO - Nu SVC classifier from SciKit: neg
2019-10-18 13:34:34,077 - twitter_ml.classify.sentiment - INFO - Voting Classifier: neg
Classification: neg; Confidence: 1.000000

or:

$ python twitter_ml/classify/classify_text.py --waffle --text "This is bad" "This is great" "And this is great as well"

will generate a waffle diagram summarising the results (in this case 25% negative, 75% positive).

_images/sample_waffle.png

or:

$ python twitter_ml/classify/classify_text.py --wordcloud --files tests/sample-text.txt

will classify the input files then generate a wordcloud summarising the most frequent words.

_images/wordcloud.png

Document Scanner

Start the analysis job (SPARK_ROOT is the folder where you installed Spark; path-to-this-git-repo is the place you cloned this repository):

cd $SPARK_ROOT
bin/spark-submit path-to-this-git-repo/doc-scanner/scan-doc.py some-file-to-analyse

The program supports a number of command line arguments:

usage: scan-doc.py [-h] [-v] [-s] [-p] file

Spark program to process text files and analyse contents

positional arguments:
  file        file to process

optional arguments:
  -h, --help  show this help message and exit
  -v          verbose logging
  -s          strip stopwords
  -p          plot figure

Twitter-Kafka Publisher

The twitter client needs API keys to read from Twitter. Sign-up on the Twitter developer platform to get your own keys. Insert your API keys into the code.

  • Start by running Zookeeper:
bin/zookeeper-server-start.sh config/zookeeper.properties
  • Start the Kafka server:
bin/kafka-server-start.sh config/server.properties
  • Create a Kafka topic (we only need to do this once):
bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic brexit
bin/kafka-topics.sh --list --bootstrap-server localhost:9092
  • Start the console listener (this is just to check Kafka is receiving tweets):
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic brexit --from-beginning
  • Start the Twitter producer:
python twitter-to-kafka.py

This will read tweets from Twitter and pump them into Kafka. It will also print the tweets to the console.

The Twitter Analyser

I had to define a variable to enable multi-threaded applications on a Mac (apparently due to security changes:

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
  • Start the analysis job (SPARK_ROOT is the folder where you installed Spark; path-to-this-git-repo is the place you cloned this repository):
cd $SPARK_ROOT
bin/spark-submit path-to-this-git-repo/twitter-stream-analyser/read-tweets-kafka.py

This will launch the Spark platform in standalone mode and submit the python job. This job reads tweets from Kafka.

Running from PyCharm

This blog has some useful information on running Spark jobs from PyCharm.

In summary:

  • Edit your .profile (or .bash_profile, or whatever) to add the SPARK_HOME and PYTHONPATH settings)
  • Add the Hadoop python libraries to the PyCharm project interpreter settings
  • Edit $SPARK_HOME/conf/spark-default.conf to include the line:
spark.jars.packages org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.4.0

Note: the actual version settings depend on the version of Spark (2.4.0), the version of Scala (2.11) and Kafka. If you try running your Spark program, it will print an error message that tells you which version to add. This will be used to download the relevent JARs from Maven the first time you run the code.