Project: Streamlit App for Rainfall prediction

Posted in :

Tung Lam

April 6, 2025

🌧️ Rainfall Prediction App – Will it rain tomorrow in Australia?

In this project, I built a Streamlit-based interactive web application that predicts whether it will rain tomorrow in Australia. The model behind this app is trained on 10 years of historical weather observation data from the Australian Bureau of Meteorology (BOM) using a Random Forest Classifier.

🌟 Overview

This app helps users decide whether to carry an umbrella by predicting the likelihood of rainfall for the next day. It leverages 10 years of daily weather data across Australian cities, preprocesses and visualizes the data, and uses a Random Forest classification model for prediction.

View source code in Github

✅ Key Features

🔍 Interactive data exploration and preprocessing visualizations
🧠 Machine learning model trained with resampling and hyperparameter tuning
⚖️ Handles imbalanced data through upsampling
🔢 One-hot encoding and outlier capping for clean, robust feature sets
📈 Model evaluation with ROC curve, confusion matrix, and metrics
📊 Streamlit sidebar for live user input and instant predictions

🛠️ Technical Details & Technologies

Language: Python 3.8+
Framework: Streamlit
Machine Learning: scikit-learn
Data Handling: Pandas, NumPy
Visualization: Matplotlib, Seaborn, Plotly
Model: Random Forest Classifier with RandomizedSearchCV for tuning
Other Tools: OneHotEncoder, StandardScaler

🚀 Getting Started

Prerequisites

Python 3.8+
pip or conda
Streamlit installed (pip install streamlit)

Installation

git clone https://github.com/your-username/rainfall-prediction-app.git
cd rainfall-prediction-app
pip install -r requirements.txt

Run the app

streamlit run app.py

📊 Dataset

Source: BOM weatherAUS dataset
Size: ~145,000 rows × 23 columns
Period: 2008–2017
Target Variable: RainTomorrow (Yes / No)
Features: Temperature, humidity, pressure, wind speed, and others
Missing Values: Handled via location-wise imputation (mean/mode)

🤖 Model Development

🔍 Preprocessing

Dropped irrelevant columns like Date
Imputed missing values by location using mean/mode
One-hot encoding for categorical variables
Capped outliers at 25th and 75th percentiles
Upsampled minority class (RainTomorrow=Yes)

🧠 Model

Algorithm: Random Forest Classifier
Training: Train/Test split + 8-fold Stratified Cross-Validation
Hyperparameter Tuning: RandomizedSearchCV
Scaling: StandardScaler applied to numerical features

📏 Evaluation Metrics

Accuracy: ~0.9416
F1 Score: ~0.9415
ROC AUC: ~0.9415
Includes ROC Curve and Confusion Matrix

📸 Examples & Visualizations

Correlation heatmaps

Feature distributions and data range summaries

Location-wise trends in temperature, humidity, and pressure

ROC curve and confusion matrix for model performance

Sample prediction output:

“It will not rain tomorrow 🌞, with a probability of 82%”

🧠 Challenges & Learnings

Handling missing data by group-level imputation improved model generalization
Dealing with class imbalance significantly improved model fairness
Encoding and scaling had to be aligned precisely between training and inference pipelines
Learned how to design user-friendly ML dashboards using Streamlit

🚧 Future Improvements

Deploy the app publicly (Streamlit Cloud, Hugging Face Spaces, or Heroku)
Improve mobile responsiveness and UI polish
Add SHAP/LIME model explainability tools
Compare with other models (e.g., XGBoost, LightGBM, neural networks)
Serialize model pipeline using joblib for efficient reloading