Word prediction with Natural Language Processing
IntroductionThis report outlines the methodology for building a word prediction application using Natural Language Processing techniques as part of the the Coursera Data Science Specialization.
The essence of the Capstone project is to create an application that uses NLP techniques and predictive analytics, and like SwiftKey's applications, takes in a word phrase and returns next-predicted word.
The project is developed in partnership with SwiftKey as the company well known for their predictive text analytics. As of March 1, 2016, SwiftKey became part of the Microsoft family of products. SwiftKey applications are used on Android and iOS anticipating and providing next-word choices while keyboard typing through Natural Language Processing (NLP) techniques. Microsoft's Word Flow Technology is another example of NLP in action.
Additionally, the work presented in this project follows the tenets of reproducible research and all code is available in an open-source repository to enable readers to review the approach, reproduce the results, and collaborate to enhance the model.
Overview of the applicationThe application was developed in R using a number of packages and the Shiny web framework.
Below outlines the methodology used to build,predict and evaluate the application.
- Sample text taken from SwiftKey corpus data (15% of original)
- The text was cleaned from non-ASCII characters, derogatory language, punctuation, non words and extra whitespace
- The corpus was processed to produce 5 ngram models which were morphed into a data table together with the totalled frequence of the ngram (bag of words).
PredictionA prediction function then takes a sentence as input and execute the below steps
- Validates and clean the input sentence (using the same clean_text function to build the ngram models)
- For each ngram gram model take the N last words for the input text where N is the size-1 of the ngram model. For example:
Description of the algorithm used to make the predictionWith a data table containing the ngram model, sentence, frequency and predicted word, the top 3 most probable words are predicted using a Stupid Backoff smoothing strategy.
A pseudo code description to calculate the score for each word follows:
if the rows ngram model was 5Finally we group and sum similar words
score = matched 5 gram Count / input 4 gram Count
else if the rows ngram model was 4
score = 0.4 * matched 4 gram Count / input 3 gram Count
else if the rows ngram model was 3
score = 0.4 * 0.4 * matched 3 gram Count / input 2 gram Count
else if the rows ngram model was 2
score = 0.4 * 0.4 * 0.4 * matched 2 gram Count / input 1 gram Count
For example if the predicted word you was found in ngram 4 (and thus ngram 3 & 2) it may look like
The total score for the predicted word you is (0.2 + 0.1 + 0.05) = 0.35 The final scoring is aggregated and summed for each word. The top 3 words are selected according to the highest score. When no results are found the 3 most common words from the English language ('the', 'be', 'to') are returned as a response.
EvaluationThe prediction model was evaluated using the Benchmark.R tool (see references for source).
Initial predicts were quite high but also quite slow. The decision to only using 1-3 ngram models to speed up the search cut the time in half and only dropped the accuracy by 10%.
InstructionsTo use the application navigate to the following URL
Start typing in text