This article will experiment data pipeline and one of popular ML classification algorithms(Random Forrest) to solve the problem.
1. Supposed in a shopping site, user will signup by their dob, gender, city, then a series of action happened after such as merchandize view, add favourite, search etc before first purchase.
* user's demographic: age , gender, city
* user's behaviour: searches, views, favourites
After ETL, a user's data looks like:
id | age | gender | city | searches | views | favourites |
12345 | 36 | 1 | 16 | 25 | 198 | 3 |
2. Sampled users based on paid user or not(binary classification). Sampling:
Total | Paid | Unpaid |
20k | 8k | 12k |
3. Load dataset as DataFrame in Spark, split it into training and test sets. Train on first dataset, and then evaluate on test set. The data pipeline as follow:
4. After prediction on test dataset, it got ~90% correctness.
reference: interpret random forest, spark random forest
No comments:
Post a Comment