Portfolio

Visualizing Car Sales and Dealer Profits Using Snowflake Snowsight

Nov 2024

This project aimed to analyze car sales and dealer profits for SwiftAuto Traders by creating visualizations and presenting them as dashboards. The approach involved using Snowflake’s Snowsight to create and analyze business intelligence (BI) dashboards. Additionally, a Streamlit app was provided to produce the same visualizations using Streamlit-in-Snowflake (SiS).

Objectives:

Analyze historical trends in car sales for SwiftAuto Traders
Provide insights on car sales and profits for each dealer

Impact:

Identify key performance indicators (KPIs) such as profit and quantity sold
Inform business decisions based on data-driven insights

Skills: Python (Programming Language) · Data Engineering · Snowflake · Data Visualization · Security · Streamlit · Problem Solving · Data Warehousing · Business Intelligence

Predicting Food Truck Locations Using Snowpark ML and XGBoost

Sep 2024

This project aimed to predict the locations of the Freezing Point food truck by analyzing historical location data and creating a predictive model. The approach involved using Snowflake’s Snowpark ML and XGBoost to develop and evaluate the model. Additionally, the project included creating a complete end-to-end workflow for data processing and model training.

Objectives:

Create and upload a dataset to Snowflake
Clean and transform the data
Train an XGBoost model on the prepared data
Evaluate the model’s performance
Register the trained model in the Snowflake Model Registry

Impact:

Develop a predictive model to forecast the food truck’s location
Provide insights on the truck’s location patterns

Skills: Python (Programming Language) · Machine Learning · Snowflake · Data Engineering · Data Visualization · XGBoost · Problem Solving · Data Warehousing · Business Intelligence

Diabetes Prediction Using PySpark MLlib

Aug 2024

This project sought to build a logistic regression classifier using the PySpark Machine learning library (MLLIB) and Python to classify between diabetic and non-diabetic patients. My approach was to build a machine learning model to accurately predict whether the patient possesses diabetes or not.

Objectives:

Set up PySpark on the Google Colab environment
Work with Spark DataFrames
Clean and prepare data for analysis
Build logistic regression model with PySpark machine learning library to classify between the diabetic and non-diabetic patients

Impact:

Predict on a new set of data of being diabetic or not

Skills: Python (Programming Language) · Data Engineering · Apache Spark ML · PySpark · Machine Learning · Data Science · Problem Solving · Apache Spark

Waste Management Data Warehouse using PostgreSQL and Cognos Analytics

Jun 2024

This project involved indentifying patterns in volume and location of waste collection across Brazil. My approach was to design a data warehouse and subsequent visual representation of the waste collected by truck type, city, station ID and month.

Objectives:

Design a Data Warehouse
Load data into Data Warehouse
Write aggregation queries
Create MQTs
Create a Dashboard

Impact:

Inform decisions on resource allocation and deployment
Forecast needs for future resources

Skills: Data Modeling · Data Engineering · Data Visualization · Problem Solving · PostgreSQL · Data Warehousing · IBM Cognos Analytics

Traffic Flow Optimization with Airflow and Kafka

Apr 2024

This project sought to improve traffic flow on national highways by analyzing road traffic data from various toll plazas. My approach was to consolidate the disparate data from different toll operators and IT systems into a single file and then create a data pipeline to continue collecting the streaming data into a database for future analysis. During the process, I encountered carriage return characters (^M) and provided two potential solutions.

Objectives:

Create an ETL pipeline using an Airflow DAG
Build a streaming ETL pipeline using Kafka

Impact:

Identify source(s) of traffic congestion
Inform possible solutions

Skills: Extract, Transform, Load (ETL) · Python (Programming Language) · Apache Airflow · Data Engineering · Bash · MySQL · Problem Solving · Apache Kafka · Shell Script

Acquiring and Processing Information on the World’s Largest Banks

Mar 2024

This project required the creation of a database wherein managers from London, Berlin and New Delhi could query the top 10 largest banks by market capitalization in their local currency. My approach was to compile the list of the top 10 largest banks ranked by market capitalization in billion USD and then transform and store it in USD, GBP, EUR and INR based on the provided exchange rate.

Objectives:

Extract real-world data from a public website using Webscraping and Requests API in Python
Transform the data as per the problem statement
Load the data in the required file format as well as a SQLite database
Query the database to retrieve filtered information from the table

Impact:

Identify market capitalization of the top 10 banks in various currency

Skills: Extract, Transform, Load (ETL) · Python (Programming Language) · Beautiful Soup · Data Engineering · Pandas · Web Scraping · Problem Solving · SQLite

Build a Machine Learning Pipeline for Airfoil Noise Prediction

Feb 2024

This project aimed to identify the optimal angle of attack and flow direction for airfoil noise reduction. My approach was to Extract, Transform, Load (ETL) and construct ML pipelines on data from a series of aerodynamic and acoustic tests of airfoil blade sections conducted in an anechoic wind tunnel.

Objectives:

Clean the dataset
Create a Machine Learning pipeline
Evaluate the model’s performance
Persist it for future use

Impact:

Inform future airfoil development for noise reduction

Skills: Extract, Transform, Load (ETL) · Python (Programming Language) · Data Engineering · Apache Spark ML · PySpark · Problem Solving · Apache Spark

Data Analysis using Spark

Jan 2024

The project required the creation of a robust data pipeline capable of ingesting employee data in CSV format. For this I analyzed the data, implemented necessary transformations, and enabled the extraction of valuable insights from the processed data.

Objectives:

Create a DataFrame from a CSV file
Define a schema for the data
Perform transformations and actions using Spark SQL

Impact:

Help HR analyze employees by salary, age, department, name and employee number

Skills: Data Engineering · Problem Solving

Working with NoSQL Databases

Dec 2023

This project tasked me with providing analysts with usable data. My approach was to move data from external sources into various databases, transfer data between different types of databases, and execute basic queries across various databases.

Objectives:

Replicate a Cloudant database
Create indexes on a Cloudant database
Query data in a Cloudant database
Import data into a MongoDB database
Query data in a MongoDB database
Export data from MongoDB
Import data into a Cassandra database
Query data in a Cassandra database

Impact:

Ensures analysts can focus on analyzing data without worrying about the technical intricacies of data movement and management

Skills: MongoDB · Data Engineering · IBM Cloudant · Problem Solving · Cassandra