Global Earthquake–Tsunami Risk Prediction using Machine Learning
17 Okt 2025Reading time ± 3 min
🔍 Overview
End-to-end AI/ML project that predicts tsunami potential from global earthquake data (2001–2022).
Built using Python, Scikit-Learn, and Streamlit, this project explores how machine learning can analyze seismic data and forecast tsunami risk.
> 📊 Dataset: 782 global earthquakes (2001–2022) with 13 features and binary tsunami labels. Global Earthquake-Tsunami Risk Assessment Dataset | Kaggle
> 💡 Key Goal: Classify earthquake events as “Tsunami” or “Non-Tsunami.”
⚙️ Workflow
1. Data Preprocessing & EDA
- Cleaned 782 records, engineered features (
mag_depth_ratio,depth_category) - Visualized global distribution, magnitude–depth relationships, and correlations
2. Modeling & Evaluation
- Compared Logistic Regression (AUC 0.93) vs Random Forest (AUC 0.965)
- Tuned hyperparameters using GridSearchCV → best AUC 0.966
3. Model Variants
tsunami_rf_model.pkl→ includes temporal features (Year,Month)tsunami_rf_noyear.pkl→ excludes temporal bias, for future prediction
Reason:
Year-based features improve accuracy for 2001–2022,
but make the model biased toward specific years.
Therefore, a second “future-ready” model was built without Year/Month
4. Deployment
- Built an interactive web dashboard using Streamlit
- Mode 1: Historical analysis (maps, trends)
- Mode 2: Real-time prediction (user inputs magnitude, depth, etc.)
🧠 Key Insights
- Shallow, high-magnitude quakes near coastlines have highest tsunami potential
- Temporal features improved accuracy historically, but reduced generalization
- Removing time-based features created a future-ready model
🖼️ Visual Highlights
| Visualization | Description |
|---|---|
| 🌍 Scatter Geo | Global earthquake–tsunami distribution |
| 🔥 Heatmap | High-risk tsunami zones (Pacific Ring of Fire) |
| 📅 Yearly Trends | Frequency of tsunami events (2001–2022) |
| 📈 Histogram | Magnitude distribution per class |
| 🧩 Correlation | Relationship between physical parameters |
| 🌋 Feature Importance | Top drivers: magnitude, depth, longitude, latitude |
🚀 Tech Stack
- Python, Pandas, NumPy
- Scikit-Learn, Matplotlib, Plotly
- Streamlit (interactive dashboard)
- Google Colab (training environment)
🧩 Results
- Best Model: Random Forest (Tuned)
- ROC–AUC: 0.966
- Balanced dataset (39% tsunami events)
- Fully deployed on Streamlit Cloud
- Interactive visualization + real-time tsunami prediction
🧱 Project Files
/data/earthquake_data_tsunami.csv/models/tsunami_rf_model.pkl/models/tsunami_rf_noyear.pkl/app.py/requirements.txt
🖼️ Streamlit App Features
📊 1. Historical Analysis Mode
Menampilkan analisis data historis dan perilaku model:
- Global scatter map (Tsunami vs Non-Tsunami)
- Magnitude vs Depth scatter
- Heatmap zona risiko tinggi
- Yearly trends dan frekuensi
- Magnitude distribution
- Correlation heatmap
🔮 2. Prediction Mode
- Memprediksi potensi tsunami dari parameter gempa baru:
- Input: magnitude, depth, latitude, longitude, dll
- Menghitung fitur turunan: depth_category, mag_depth_ratio
- Output:
- Tsunami Probability (%)
- Prediction Result (Likely / Not Likely)
- Epicenter Map Visualization
🚀 Key Results
- Final Model ROC-AUC: 0.966
- Balanced dataset (39% tsunami events)
- High prediction reliability
- Fully deployed on Streamlit for interactive visualization
🧭 Tech Stack
| Category | Tools |
|---|---|
| Data Processing | Pandas, NumPy |
| Visualization | Plotly, Seaborn, Matplotlib |
| Modeling | Scikit-Learn (Logistic Regression, Random Forest) |
| Tuning | GridSearchCV |
| Deployment | Streamlit |
| Environment | Google Colab, Python 3.10 |
🧩 Project Structure
project/
│
├── data/
│ └── earthquake_data_tsunami.csv
│
├── notebooks/
│ ├── 01_data_preprocessing.ipynb
│ ├── 02_model_training.ipynb
│ └── 03_model_evaluation.ipynb
│
├── models/
│ ├── tsunami_rf_model.pkl
│ └── tsunami_rf_noyear.pkl
│
├── app.py
├── requirements.txt
└── README.md🧠 Insights & Learnings
- Temporal features (Year) dapat menyebabkan bias prediksi — dihapus untuk generalisasi.
- Gempa dangkal dengan magnitude besar di dekat pantai paling berpotensi tsunami.
- Kombinasi parameter fisik dan spasial meningkatkan akurasi model.
- Streamlit mempermudah penyajian hasil machine learning secara interaktif dan intuitif.