Final Report for DataScience Capstone
Word prediction with Natural Language Processing
This report outlines the methodology for building a word prediction application using Natural Language Processing techniques as part of the the Coursera Data Science Specialization.
The essence of the Capstone project is to create an application that uses NLP techniques and predictive analytics, and like SwiftKey’s applications, takes in a word phrase and returns next-predicted word.
The project is developed in partnership with SwiftKey as the company well known for their predictive text analytics. As of March 1, 2016, SwiftKey became part of the Microsoft family of products. SwiftKey applications are used on Android and iOS anticipating and providing next-word choices while keyboard typing through Natural Language Processing (NLP) techniques. Microsoft’s Word Flow Technology is another example of NLP in action.
Additionally, the work presented in this project follows the tenets of reproducible research and all code is available in an open-source repository to enable readers to review the approach, reproduce the results, and collaborate to enhance the model.
Overview of the application
The application was developed in R using a number of packages and the Shiny web framework.
Below outlines the methodology used to build,predict and evaluate the application.
- Sample text taken from SwiftKey corpus data (15% of original)
- The text was cleaned from non-ASCII characters, derogatory language, punctuation, non words and extra whitespace
- The corpus was processed to produce 5 ngram models which were morphed into a data table together with the totalled frequence of the ngram (bag of words).
A prediction function then takes a sentence as input and execute the below steps
- Validates and clean the input sentence (using the same ‘clean_text’ function to build the ngram models)
- For each ngram gram model take the N last words for the input text where N is the size-1 of the ngram model. For example:
Description of the algorithm used to make the prediction
With a data table containing the ngram model, sentence, frequency and predicted word, the top 3 most probable words are predicted using a Stupid Backoff smoothing strategy.
A pseudo code description to calculate the ‘score’ for each word follows:
if the rows ngram model was 5 score = matched 5 gram Count / input 4 gram Count else if the rows ngram model was 4 score = 0.4 * matched 4 gram Count / input 3 gram Count else if the rows ngram model was 3 score = 0.4 * 0.4 * matched 3 gram Count / input 2 gram Count else if the rows ngram model was 2 score = 0.4 * 0.4 * 0.4 * matched 2 gram Count / input 1 gram Count
Finally we group and sum similar words
For example if the predicted word ‘you’ was found in ngram 4 (and thus ngram 3 & 2) it may look like
The total score for the predicted word ‘you’ is (0.2 + 0.1 + 0.05) = 0.35
The final scoring is aggregated and summed for each word. The top 3 words are selected according to the highest score.
When no results are found the 3 most common words from the English language (‘the’, ‘be’, ‘to’) are returned as a response.
The prediction model was evaluated using the Benchmark.R tool (see references for source).
Initial predicts were quite high but also quite slow. The decision to only using 1-3 ngram models to speed up the search cut the time in half and only dropped the accuracy by 10%.
To use the application navigate to the following URL
Start typing in text
For access to the code please contact the author using one of the contact links on the site.