Introduction

This report outlines the methodology for building a word prediction application using Natural Language Processing techniques as part of the the Coursera Data Science Specialization.

The essence of the Capstone project is to create an application that uses NLP techniques and predictive analytics, and like SwiftKey's applications, takes in a word phrase and returns next-predicted word.

The project is developed in partnership with SwiftKey as the company well known for their predictive text analytics. As of March 1, 2016, SwiftKey became part of the Microsoft family of products. SwiftKey applications are used on Android and iOS anticipating and providing next-word choices while keyboard typing through Natural Language Processing (NLP) techniques. Microsoft's Word Flow Technology is another example of NLP in action.

Additionally, the work presented in this project follows the tenets of reproducible research and all code is available in an open-source repository to enable readers to review the approach, reproduce the results, and collaborate to enhance the model.

The milestone report outlines the initial approach to building a series of ngram models from a range of text documents.

Overview of the application

The application was developed in R using a number of packages and the Shiny web framework.
Below outlines the methodology used to build,predict and evaluate the application.

NGram model

Sample text taken from SwiftKey corpus data (15% of original)
The text was cleaned from non-ASCII characters, derogatory language, punctuation, non words and extra whitespace
The corpus was processed to produce 5 ngram models which were morphed into a data table together with the totalled frequence of the ngram (bag of words).

Prediction

A prediction function then takes a sentence as input and execute the below steps

Validates and clean the input sentence (using the same clean_text function to build the ngram models)
For each ngram gram model take the N last words for the input text where N is the size-1 of the ngram model. For example:

Description of the algorithm used to make the prediction

With a data table containing the ngram model, sentence, frequency and predicted word, the top 3 most probable words are predicted using a Stupid Backoff smoothing strategy.
A pseudo code description to calculate the score for each word follows:

if the rows ngram model was 5
score = matched 5 gram Count / input 4 gram Count
else if the rows ngram model was 4
score = 0.4 * matched 4 gram Count / input 3 gram Count
else if the rows ngram model was 3
score = 0.4 * 0.4 * matched 3 gram Count / input 2 gram Count
else if the rows ngram model was 2
score = 0.4 * 0.4 * 0.4 * matched 2 gram Count / input 1 gram Count

Finally we group and sum similar words
For example if the predicted word you was found in ngram 4 (and thus ngram 3 & 2) it may look like

ngram	predicted	score
4	you	0.2
3	you	0.1
2	you	0.05

The total score for the predicted word you is (0.2 + 0.1 + 0.05) = 0.35 The final scoring is aggregated and summed for each word. The top 3 words are selected according to the highest score. When no results are found the 3 most common words from the English language ('the', 'be', 'to') are returned as a response.

Evaluation

The prediction model was evaluated using the Benchmark.R tool (see references for source).
Initial predicts were quite high but also quite slow. The decision to only using 1-3 ngram models to speed up the search cut the time in half and only dropped the accuracy by 10%.

Instructions

To use the application navigate to the following URL
https://chrismckelt.shinyapps.io/datascience-capstone/
Start typing in text

References

Speech and Language Processing, by D. Jurafsky & al, Chapter 4, Draft of January 9, 2015
JHU DS Capstone Swiftkey Dataset
Large language models in machine translation by T. Brants et al, in EMNLP/CoNLL 2007
Next word prediction benchmark

Word prediction with Natural Language Processing