R for Data Analysis: A Practical Introduction

Why R Remains a Powerful Tool for Data Analysts

R is a programming language and environment purpose-built for statistical computing and data visualization. Created by statisticians for statisticians, it has an enormous ecosystem of packages for virtually every analytical need — from basic data manipulation to advanced econometric modeling, spatial analysis, and bioinformatics. While Python has become dominant in data science and machine learning, R remains the preferred language in academia, clinical research, economics, and any domain where statistical rigor is paramount.

For data analysts, R offers a unique combination of strengths: an unmatched visualization library in ggplot2, the elegantly consistent tidyverse ecosystem for data wrangling, and world-class implementations of statistical tests, regression models, and time series methods that are often more mature than their Python equivalents.

R vs. Python: When to Use Which

Criterion	R	Python
Statistical analysis	Excellent — built-in and packages	Good — via statsmodels, scipy
Data visualization	ggplot2 is best-in-class	Matplotlib, Seaborn, Plotly
Data wrangling	tidyverse (dplyr, tidyr)	pandas
Machine learning	caret, tidymodels	scikit-learn (stronger ecosystem)
Deep learning	Limited	PyTorch, TensorFlow (dominant)
Reporting / reproducibility	R Markdown, Quarto	Jupyter, Quarto
Industry adoption	Academia, biostatistics, finance	Tech, data engineering, ML
Learning curve	Moderate (consistent tidyverse)	Moderate (broader ecosystem)

Getting Started: RStudio and the Tidyverse

RStudio is the standard IDE for R, offering a script editor, interactive console, environment explorer, and plot viewer in a single interface. Posit (formerly RStudio) also provides Quarto, a next-generation tool for creating reproducible analytical documents, reports, and presentations that can embed R code alongside narrative text.

The tidyverse is a collection of R packages designed around a consistent philosophy of tidy data — where each row is an observation, each column is a variable, and each cell contains a single value. The core tidyverse packages every analyst needs are: dplyr for data manipulation, tidyr for reshaping data, ggplot2 for visualization, readr for data import, and purrr for functional programming.

Data Wrangling with dplyr

dplyr provides a grammar of data manipulation using five core verbs: filter() to select rows, select() to choose columns, mutate() to create new columns, summarise() to aggregate, and arrange() to sort. These verbs are chainable using the pipe operator (%>% or the native |>), creating readable, left-to-right data transformation pipelines.

group_by() combined with summarise() is the R equivalent of SQL's GROUP BY, enabling grouped aggregations. Joining data frames uses left_join(), inner_join(), right_join(), and full_join() — mirroring SQL join semantics with a tidyverse-consistent API. The combination of dplyr's expressive syntax and pipe chaining makes complex multi-step data transformations readable and maintainable.

Reshaping Data with tidyr

tidyr handles the common need to reshape data between wide and long formats. pivot_longer() converts wide data (multiple columns per variable) to long format (one row per observation-variable pair), and pivot_wider() does the reverse. These operations are frequently needed when preparing data for visualization in ggplot2, which expects data in long format, or for statistical models that require specific input structures.

separate() and unite() split columns that contain multiple values and combine multiple columns into one, respectively. complete() makes implicit missing values explicit by filling in all combinations of a set of variables — useful when working with time series that have gaps for certain categories.

Data Visualization with ggplot2

ggplot2 implements the Grammar of Graphics — a principled framework for building visualizations by layering components: data, aesthetic mappings (x, y, color, size, shape), geometric objects (points, lines, bars), scales, facets, and themes. This layered approach makes it easy to build complex, publication-quality visualizations with concise, readable code.

The basic structure is: ggplot(data, aes(x = var1, y = var2)) + geom_point(). Additional layers are added with +: + geom_smooth() adds a regression line, + facet_wrap(~category) creates small multiples for each category, + scale_color_brewer() applies a color palette, and + theme_minimal() applies a clean theme. The result is charts that are both beautiful and analytically precise.

ggplot2's faceting capabilities are particularly powerful for data analysis — creating a grid of charts segmented by one or two categorical variables in a single function call, enabling rapid visual comparison across subgroups.

Statistical Analysis in R

R's statistical foundation is one of its greatest strengths. Common statistical tests are built into base R: t.test() for t-tests, chisq.test() for chi-square tests, cor.test() for correlation tests, and aov() for ANOVA. The output includes test statistics, p-values, confidence intervals, and effect sizes — everything needed to interpret and report results.

Linear regression with lm() fits ordinary least squares models. The formula syntax (y ~ x1 + x2 + x1:x2) concisely specifies predictors and interaction terms. summary(lm_model) produces a comprehensive output with coefficients, standard errors, t-statistics, p-values, and R-squared. The broom package converts model outputs to tidy data frames for easier analysis and visualization.

Reproducible Reporting with R Markdown and Quarto

R Markdown and its successor Quarto enable analysts to combine code, output, and narrative in a single document that renders to HTML, PDF, Word, or presentation formats. Code chunks run inline, embedding tables and charts directly in the document. When data or analysis changes, re-rendering updates everything automatically — eliminating the manual copy-paste workflow that leads to stale reports.

This reproducibility is transformative for analytical work. An R Markdown report can be version-controlled, shared with collaborators, and rerun by anyone with the same data — making analyses fully transparent and auditable. For organizations that publish regular analytical reports, parameterized R Markdown documents enable one template to generate dozens of customized outputs automatically.

Key R Packages by Domain

Domain	Package	Purpose
Data wrangling	dplyr, tidyr, data.table	Manipulation and reshaping
Visualization	ggplot2, plotly, ggplotly	Static and interactive charts
Time series	forecast, tsibble, fable	Forecasting and temporal analysis
Machine learning	tidymodels, caret, mlr3	Modeling frameworks
Statistical tests	base R, coin, rstatix	Hypothesis testing
Reporting	rmarkdown, quarto, knitr	Reproducible documents
Database access	DBI, dbplyr, odbc	Query databases from R
Shiny	shiny, shinydashboard	Interactive web apps from R

Conclusion

R is a uniquely powerful environment for data analysis, combining a rich statistical heritage with a modern, expressive ecosystem. The tidyverse makes data wrangling consistent and readable, ggplot2 produces visualizations that set the standard for clarity and elegance, and R Markdown enables fully reproducible analytical reporting. Whether you're conducting hypothesis tests, building regression models, or creating interactive dashboards with Shiny, R has mature, well-documented tools for the job. Learning R alongside Python gives you the best of both worlds and makes you a more versatile, effective analyst.