ক্যাপস্টোন পরিকল্পনা (Capstone Master Plan) — Part VIII + Integrative Coding Project¶
শূন্য→PhD পরিসংখ্যান শিক্ষাক্রমের (Parts 0–VII, ৬২ অধ্যায়) সমাপ্তি। দুটি deliverable:
- Part VIII (4 অধ্যায়) — পরিকল্পিত ক্যাপস্টোন অধ্যায়, একই ৬-writer + ৩-reviewer curriculum-style-এ (
.mdchapter + runnable.py+ figures + solutions),part-8-capstone/-এ। - Integrative Coding Project — প্রতিটি অধ্যায়ের (Parts 0–VII) একটি flagship concept নিয়ে: (ক) scratch থেকে implement + (খ) library version + (গ) empirical demonstration/"proof" + (ঘ) visualization — সব real open data-তে।
capstone-project/-এ, notebook (.ipynb) + curriculum doc (.md, solutions সহ) — দুই রূপেই।
১. Real datasets (offline-bundled, capstone-project/data/)¶
| dataset | shape | ধরন | মূল ব্যবহার |
|---|---|---|---|
iris.csv |
150×5 | multivariate tabular | EDA, correlation, t-test/ANOVA, classification, PCA/clustering |
wine.csv |
178×14 | 13 chem features + class | classification, PCA, regularization |
breast_cancer.csv |
569×31 | 30 features + malignant/benign | logistic/SVM/tree, learning theory, dim-reduction |
digits.csv |
1797×65 | 8×8 image + label | manifold/dim-reduction (Isomap/t-SNE/PCA), classification |
diabetes.csv |
442×11 | 10 features + continuous target | linear/multiple regression, ridge/lasso, cross-validation |
co2.csv |
2284×1 | Mauna Loa weekly time series | random processes, trend, martingale, convergence |
sunspots.csv |
309×2 | annual sunspot counts | time series, Poisson/count, processes |
সব sklearn.datasets / statsmodels থেকে reproducibly bundled (network লাগে না)। fixed seed 20260619 সর্বত্র।
২. Deliverable A — Part VIII অধ্যায় (curriculum style, 6W+3R)¶
part-8-capstone/ — ৪টি .md অধ্যায়, ৮-স্তম্ভ template (ভূমিকা→ধারণা→উদাহরণ→প্রমাণ→কোড→চিত্র→অনুশীলনী→সারসংক্ষেপ), Bangla prose + English terms, ৪ figure/অধ্যায়, solutions।
- 8.1 — End-to-End Data Analysis Project (
08-01-end-to-end-project.md): একটি real dataset (breast_cancer বা diabetes) নিয়ে সম্পূর্ণ pipeline — প্রশ্ন গঠন → EDA → assumption যাচাই → model (regression/classification) → inference/uncertainty → validation (CV) → ব্যাখ্যা ও report। Parts I–VI-এর সব ধাপ এক জায়গায়। - 8.2 — Simulation Study (
08-02-simulation-study.md): একটি পরিচিত তাত্ত্বিক ফলাফল simulation-এ reproduce ও যাচাই — CLT convergence rate, bootstrap coverage, bias–variance tradeoff, MLE consistency/asymptotic normality — Monte-Carlo experiment design সহ। - 8.3 — Reproducing a Paper / Result (
08-03-paper-reproduction.md): একটি ধ্রুপদী statistics/ML ফলাফল (যেমন James–Stein shrinkage, বা ridge vs OLS bias–variance, বা bootstrap-এর Efron 1979 উদাহরণ) scratch থেকে reproduce, real data-তে যাচাই। - 8.4 — Where Next: Research Readiness (
08-04-where-next.md): গবেষণা-ক্ষেত্র (causal inference, Bayesian nonparametrics, deep learning theory, stochastic processes), কোন বই/paper/journal, reproducibility ও open-science অভ্যাস, পরবর্তী পথের রোডম্যাপ।
৩. Deliverable B — Integrative Project: flagship demo per chapter¶
capstone-project/ layout:
capstone-project/
CAPSTONE_PLAN.md # this file
README.md # overview + how to run
data/ # 7 real datasets (bundled)
src/common.py # data loaders, figstyle, from-scratch helpers
notebooks/ # 8 .ipynb — one per Part (00..07)
docs/ # 8 .md — curriculum-style writeup per Part
solutions/ # 8 .md — full worked solutions per Part
figures/ # generated PNGs (prefix P-C-*)
Flagship-concept-per-chapter map (৬২ অধ্যায়)¶
Part 0 — Foundations (dataset: diabetes/iris) - 0.1 sets/logic → set operations ও truth tables on feature-subsets; De Morgan যাচাই code-এ - 0.2 combinatorics → C(n,k)/permutations scratch; iris-এ subset-গণনা; Pascal triangle viz - 0.3 calculus I (derivative/optimization) → gradient descent scratch to minimize a real loss (diabetes OLS); convergence plot - 0.4 calculus II (integration) → numeric integration (trapezoid/Simpson) scratch; ∫ density; Riemann-sum viz - 0.5 linear algebra → matrix ops, eigen-decomposition scratch (power iteration) on iris covariance; vs numpy - 0.6 Python/numpy → vectorization vs loop benchmark; broadcasting on a real feature matrix
Part I — Descriptive & EDA (iris, wine) - 1.1 data types/pop-sample → sampling distribution of a sample mean from a real "population" (iris) - 1.2 location/variability → mean/median/var/IQR/MAD scratch vs numpy; robustness to outliers viz - 1.3 distributions/viz → histogram/KDE/ECDF scratch; fit + Q–Q plot on a real feature - 1.4 correlation/bivariate → Pearson/Spearman scratch; correlation matrix heatmap (iris); anscombe caution - 1.5 EDA workflow → full EDA case study on wine (missing, outliers, transforms, multivariate viz)
Part II — Probability (iris, breast_cancer) - 2.1 sample spaces/axioms → empirical vs axiomatic probability; simulate & verify axioms - 2.2 conditional/Bayes → Bayes' theorem on a real classification (naive Bayes by hand on breast_cancer); posterior viz - 2.3 discrete RVs → Binomial/Poisson pmf scratch; fit Poisson to sunspots counts - 2.4 continuous dists → Normal/Exp pdf/cdf scratch; MLE-fit Normal to an iris feature; overlay - 2.5 expectation/moments/MGF → E/Var/skew/kurtosis scratch; MGF numerically; verify on real feature - 2.6 joint/covariance → covariance matrix scratch; bivariate normal fit to two iris features; contour - 2.7 transformations/order stats → change-of-variable; order statistics dist; min/max of samples viz
Part III — Convergence & Processes (co2, sunspots, iris) - 3.1 inequalities → Markov/Chebyshev/Hoeffding bounds vs empirical tail on a real feature - 3.2 convergence types → a.s. vs prob vs dist illustrated with running stats on real data - 3.3 LLN → running mean of iris feature → μ; convergence band viz - 3.4 CLT & delta method → CLT for the mean of a skewed real feature; delta method for a ratio; histogram→Normal - 3.5 random processes → Poisson process for events; Gaussian process fit to co2 residuals; sample paths - 3.6 Markov chains & MCMC → discretize co2 differences into a Markov chain; estimate P; Metropolis sampler; stationary dist
Part IV — Inference (diabetes, breast_cancer, iris) - 4.1 sampling distributions → bootstrap sampling distribution of a statistic on real data - 4.2 method of moments → MoM estimate for a fitted distribution on a real feature - 4.3 MLE → MLE scratch (Normal, Bernoulli) on real data; likelihood surface viz - 4.4 estimator properties → bias/variance/MSE/consistency via simulation on real-data-calibrated model - 4.5 sufficiency/Fisher/CRLB → Fisher information & CRLB for a real-feature model; efficiency check - 4.6 confidence intervals → CI (normal, bootstrap, t) for a real mean; coverage simulation - 4.7 hypothesis testing → two-sample t-test iris species; power curve; p-value distribution under H0 - 4.8 LRT/Wald/score → the three tests on a logistic fit (breast_cancer); agreement viz - 4.9 bootstrap/jackknife/permutation → bootstrap CI + permutation test on a real difference - 4.10 Bayesian inference → Beta–Binomial / Normal posterior on a real proportion/mean; credible interval
Part V — Modeling (diabetes, iris, wine, sunspots, digits) - 5.1 linear regression → OLS scratch (normal equations + QR) on diabetes; vs statsmodels - 5.2 diagnostics/selection → residual plots, leverage/Cook, VIF, AIC/BIC on diabetes fit - 5.3 ANOVA → one-way ANOVA of iris feature across species; F-test scratch - 5.4 logistic regression → IRLS scratch on breast_cancer; ROC/AUC; vs sklearn - 5.5 Poisson regression → GLM Poisson scratch on sunspots/count; rate ratio; overdispersion check - 5.6 mixed-effects → random-intercept model on a grouped real dataset; ICC/shrinkage - 5.7 nonparametric regression → kernel smoother + B-spline scratch on co2 trend; bandwidth - 5.8 cross-validation → K-fold/LOOCV scratch on diabetes; one-SE rule; model selection - 5.9 PCA & clustering → PCA scratch (SVD) + k-means scratch on wine/digits; elbow/silhouette
Part VI — Statistical ML (breast_cancer, wine, digits) - 6.1 learning theory → bias–variance decomposition sim + train/test gap on a real fit - 6.2 regularization → ridge/lasso path scratch on diabetes; coefficient shrinkage viz - 6.3 LDA/QDA/NB/kNN → all four scratch on wine; decision-boundary + accuracy compare - 6.4 SVM & kernels → soft-margin + RBF (via sklearn) on breast_cancer 2D projection; margins - 6.5 trees/bagging/RF → CART scratch (Gini) + bagging/RF on breast_cancer; OOB, importance - 6.6 boosting → AdaBoost scratch on wine; stage-wise error; vs gradient boosting - 6.7 EM/GMM → EM scratch for a 2-component GMM on an iris/wine feature; responsibilities; BIC - 6.8 dim-reduction (manifold) → PCA vs Isomap vs t-SNE on digits → 2D; trustworthiness - 6.9 anomaly/semi-sup → Mahalanobis + IsolationForest anomaly on breast_cancer; label-propagation
Part VII — Measure-Theoretic (analytic + empirical on real data) - 7.1 why measure theory → Dirichlet/Riemann failure + covering ℚ (measure 0) code + Cantor viz; empirical measure of a real feature set - 7.2 σ-algebra/measure → finite σ-algebra generator; empirical measure μ_n(A)=#{x∈A}/n on real feature bins; Carathéodory intuition viz - 7.3 measurable maps/RVs → pushforward: law of Y=g(X) for a real feature X; histogram of pushforward vs transform - 7.4 Lebesgue integral → ∫f dμ_n = (1/n)Σf(x_i) on real data (Lebesgue = averaging); MCT/DCT demo with a real-data sequence - 7.5 Lp/Hilbert/Radon–Nikodym → ‖·‖_p of a real feature (p-sweep); L² projection = best predictor = regression; RN-derivative = density ratio of two real subgroups - 7.6 independence/SLLN → SLLN a.s. convergence of a real feature mean; Borel–Cantelli via rare events; independence test - 7.7 conditional expectation → E[X|group] = group means (best L² predictor) on iris; total-variance decomposition = ANOVA R² - 7.8 martingales → build a martingale from co2 increments / a fair-game bet; optional stopping demo; Doob decomposition of a real series - 7.9 martingale convergence → Pólya urn (synthetic) + a real-data bounded martingale converging; Doob maximal inequality check - 7.10 char. functions & CLT → empirical characteristic function of a real feature; CLT of its standardized mean → N(0,1) via φ→e^{−t²/2}; the capstone
প্রতিটি demo-র format (notebook cell-গুচ্ছ ও doc-অনুচ্ছেদ)¶
- ধারণা (concept): এক-দুই বাক্যে flagship idea (Bangla + English term)।
- Scratch implementation: numpy দিয়ে হাতে (কোনো shortcut library নয়)।
- Library: sklearn/scipy/statsmodels version — scratch-এর সাথে মিলিয়ে (assert close)।
- Demonstrate/"prove": empirical যাচাই (simulation/convergence/identity) — সংখ্যায় দেখানো যে ধারণাটা কাজ করে।
- Visualize: ১টি (কখনো ২টি) plot, figure-এ সংরক্ষিত।
- Real data: উপরের map-এর dataset।
৪. Build method¶
Part VIII অধ্যায়: 6W+3R (Parts V–VII-এর মতো)। Integrative project: প্রতি Part-module (notebook+doc+solution) কয়েকজন writer agent দিয়ে (flagship demo গুচ্ছ ভাগ করে) + code-runner reviewer (সব cell run, সংখ্যা real, figure exists) + pedagogy reviewer। fixed seed 20260619; notebook গুলো nbformat দিয়ে তৈরি, jupyter nbconvert --execute দিয়ে যাচাই; doc curriculum-style Bangla+English।
৫. Order of build¶
- scaffold +
src/common.py(data loaders, figstyle, helpers) ✅ শুরু - Integrative project Parts 0→VII (notebook+doc+solution প্রতিটি)
- Part VIII অধ্যায় 8.1→8.4
- Finalize: capstone README, root README/PLAN status, glossary (নতুন term), memory, present