Distributed Machine Learning Recommender

Leveraging Spark and XGBoost to build a distributed recommender system for large-scale restaurant recommendation on Yelp.

Scalable Clustering

Designing a distributed batch-based clustering pipeline for large scale data clustering.

Community Detection

Building a Spark-based algorithm for large scale community detection on social graphs.

Market Basket Analysis

Designing a distributed computing algorithm to identify items frequently bought together on purchasing history data.

Association Rule Mining

Crime analytics via Frequent-Pattern Growth for mining association rules between locations, crimes and resolutions.

Data Stream Analytics

Building a scalable algorithm for estimating unique active users in a size-fluctuating stream of data.

Big Data Analytics

Using Spark and MapReduce to analyse a large dataset which requires the appropriate big data tools for insight extraction.

Bloom Filter

Tracking never-before-seen datapoints in a continuous data stream.

Reservoir Sampling

Designing a probability-adjusting data sampler for scalable unbiased sampling of data streams.

Netflix Movie Recommender

Exploring the Netflix movie dataset containing 100M movie ratings, then creating a recommender system based on vector similarity using sparse matrixes.