Charge Id – Analysing the data
So what are we trying to do here?
Hypothesis
Given a users bank statement we will be able to predict (within a statistical confidence level ) the transactions within a period into categories & sub categories for spending classification.
Target Categories
- Banking & Finance
- Entertainment
- Food & Drinks
- Groceries
- Health & Beauty
- Holiday & Travel
- Home
- Household Utilities
- Income
- Insurance
- Kids
- Miscellaneous
- Shopping
- Transferring Money
- Transport
- Work & Study
Sub categories available on this link
So what does a bank statement usually contain?
Account Summary
Account Statement
Account Transactions
Account types come in numerous varieties of credit/debit options.
What data are we trying to predict?
Each transaction will contain at least 3 lines which can be used for categorisation prediction.
- Transaction date
- Transaction line description
- Amount
We want to know whether the payment was debit/credit and for what reason. So we can analyse our/the consumers financial decisions overtime. Not all categories will be easy. Random text entered by the vendor can stop a transaction type being identified easily.
But there are other ways to predict what that type of transaction would be. For example, given an unclassified transaction (not recognisable by keywords such as café, food, bar) occurred on a Saturday night between 5-10pm with the amount > 20 and under 500 the algorithm will look at past consumer payments at this time period and see most common categories at this time are food/restaurants/bars. PS let me know if you spend close to $500 on a bar bill – I think I want to party with you!
From this we are looking to summate the transactions for each sub category into their parent group.
Below is a result of manually classifying 3 months worth of bank statements into assigned categories.
Sample of Subcategories having > 50 occurrences on a bank statement
Each sub-category rolls up into its parent category which gives a clearer view on where money is coming/going:
Total categories summed by subcategory
Summing the amount for each of these categories will give us the total income/expenditure.
Posts in this series
Charge Id – scratching the tech itch
Charge Id – lean canvas
Charge Id – solution overview
Charge Id – analysing the data
Charge Id – the prediction model
Charge Id – deploying a ML.Net Model to Azure
Code
https://github.com/chrismckelt/vita
Published: