Mastering the Variables of Data
From data cleaning to complex structural modeling. Discover the roadmap to statistical validity.
The Primary Split
Qualitative vs. Quantitative Data
Qualitative (Categorical)
Labels, groups, or names. Mathematical operations (like averaging) don't make sense here.
Quantitative (Numerical)
Values that measure or count something. Differences between numbers are meaningful.
Example Dataset Composition
Statistical Prerequisites
Normality: Parametric vs. Non-parametric
Parametric Tests
- Assumption: Normal distribution (Bell Curve).
- Measures: Mean ± SD.
- Examples: t-test, ANOVA, Pearson.
Non-parametric Tests
- Assumption: No normality required.
- Measures: Median (IQR).
- Examples: Mann-Whitney, Kruskal-Wallis.
Comparative Profile
The Blueprint
Integrated Master Plan for Research Analysis
A structured hierarchy of needs. Do not skip phases; reliability depends on the foundation.
Phase 1: The Foundation Data Management
"Garbage In, Garbage Out." Before any math, ensure cleanliness.
Data Entry & Coding
Rows = Participants. Columns = Variables. Convert "Male" to 1, "Female" to 0.
Cleaning & Imputation
Check for impossible errors (e.g., Age=200). Fill missing data (Mean/Median imputation) or delete rows.
🧠 Theoretical Core: Measurement Theory
Levels of Measurement (Stevens' Scale):
- Nominal: Labels (Eye color). Mode only.
- Ordinal: Ranked (Gold/Silver). Median allowed.
- Interval: Equal distance, no true zero (Temp C). Mean allowed.
- Ratio: True zero (Income). All math allowed.
Phase 2: Descriptive Analysis Basic Level
Summarize the dataset to understand its general characteristics.
Quantitative
Mean, Median, Mode.
Spread: Range, SD (σ).
Vis: Histogram.
Categorical
Frequencies & Percentages.
Vis: Bar Charts.
🧠 Theory: Distribution & Spread
Skewness: Measures asymmetry. (Tail right = Positive, Tail left = Negative).
Kurtosis: Measures "tailedness". Heavy tails = more outliers.
Standard Deviation (σ): The average distance of points from the Mean. High SD = Volatile data.
Phase 3: Bivariate Analysis Hypothesis Testing
Look for relationships ($r$) and compare groups (p-value).
| Comparison | Parametric | Non-Parametric |
|---|---|---|
| 2 Groups | T-Test | Mann-Whitney |
| 3+ Groups | ANOVA | Kruskal-Wallis |
| Correlation | Pearson ($r$) | Spearman ($\rho$) |
🧠 Theory: The P-Value & Errors
Null Hypothesis ($H_0$): Assumes no difference/effect.
P-Value: Probability of results if $H_0$ is true. If $p < 0.05$, reject $H_0$.
Type I Error: False Positive (Finding a fake effect).
Type II Error: False Negative (Missing a real effect).
Phase 4: Multivariate Predictive
- 1. Regression: Predict outcomes. (Linear for values, Logistic for Yes/No).
- 2. PCA: Dimensionality Reduction. Reduce 50 variables into 5 factors.
- 3. Clustering: Grouping similar items (K-Means).
🧠 Theory: OLS & Multicollinearity
Ordinary Least Squares (OLS): Fits a line minimizing the squared vertical distance (residuals) between data and line.
Multicollinearity: When predictors correlate with each other. Bad. The model can't tell which variable is doing the work.
Phase 5: Structural & Temporal Expert Level
SEM (Structural Equation Modeling): Tests complex causal chains ($X \to M \to Y$).
Time Series (ARIMA): Analysis of trends over time for forecasting.
Meta-Analysis: Aggregating multiple studies (Forest Plots).
🧠 Theory: Latent Variables & Stationarity
Latent Variable: A theoretical construct (e.g., "Intelligence") inferred from observed variables (Test scores).
Stationarity: A time-series whose mean and variance don't change over time. Required for forecasting.
Glossary of Key Terms
Variable
Anything that can change (e.g., Age). A "container" for values.
Imputation
Replacing missing data with an average value so the record isn't lost.
Homoscedasticity
The assumption that the "spread" (variance) is equal across all groups.
Residuals
The difference between the Predicted value (Model) and the Actual value (Data).
Confidence Interval
A range likely containing the true mean. "We are 95% sure the mean is between X and Y."
Scree Plot
A chart used in PCA to decide how many factors to keep (look for the "elbow").
Quick Selection Guide
| Goal | Statistical Method | Visual Presentation |
|---|---|---|
| Describe population | Mean, SD | Histogram |
| Compare 2 groups | T-test, Mann-Whitney | Boxplot |
| Compare 3+ groups | ANOVA | Bar Chart + Error Bars |
| Association | Correlation ($r$) | Scatter Plot, Heat Map |
| Prediction | Regression | Line Plot |
| Reduce Complexity | PCA | Biplot, Scree Plot |
Non-Parametric Analysis
The Non-Parametric Toolkit
Non-parametric tests are the fallback when strict assumptions are not met.
| Test | Comparison | Parametric Equivalent |
|---|---|---|
| Mann-Whitney U | 2 independent groups | Independent t-test |
| Wilcoxon Signed-Rank | 2 paired groups | Paired t-test |
| Kruskal-Wallis | 3+ independent groups | One-way ANOVA |
Statistical Scenario Generator ✨
The Decision Pathway
Follow the path to classify your variable correctly.
Q1: Categories or Quantities?
Q2: Can it be ranked?
Q3: Any value or counts?