If you work with large administrative datasets or run simulation models, you’ve probably lost hours to slow code. Usually the fix is obvious in hindsight, but only once you know which kind of slow you’re dealing with.
Two kinds of computing tasks
Most analytical work falls into one of two categories.
To make the examples concrete, here’s some synthetic data to work with:
library(tidyverse)
# Student enrolment and test results
enrolment <- tibble(
student_id = 1:500000,
school_id = sample(1:2000, 500000, replace = TRUE),
year = sample(2015:2023, 500000, replace = TRUE)
)
student_results <- tibble(
student_id = 1:500000,
school_id = sample(1:2000, 500000, replace = TRUE),
year = sample(2015:2023, 500000, replace = TRUE),
test_score = rnorm(500000, mean = 65, sd = 15)
)
# School funding data
school_funding <- tibble(
school_id = 1:2000,
funding_per_student = rnorm(2000, mean = 12000, sd = 2000)
)
# Wage data: 5 subgroups from entry-level to executive
set.seed(42)
n <- 50000
mus <- c(30000, 48000, 65000, 85000, 120000)
sigs <- c(5000, 6000, 7000, 9000, 12000)
weights <- c(0.15, 0.25, 0.25, 0.20, 0.15)
groups <- sample(1:5, n, replace = TRUE, prob = weights)
wages <- rnorm(n, mean = mus[groups], sd = sigs[groups])
The first category is data work: merging datasets, sorting, grouping, aggregating, reshaping. The computer’s job is to find the right data, move it into the right place, and hand it back in a different shape. The bottleneck is the data itself: how much of it there is and how much of it needs to move around to complete the task. Slowdowns here tend to scale with the size of your data. The bigger the dataset, the worse it gets.
# Joining student enrolment records to school funding data
system.time(
enrolment |>
left_join(school_funding, by = "school_id")
)
# Collapsing student results to school-year averages
system.time(
student_results |>
group_by(school_id, year) |>
summarise(mean_score = mean(test_score, na.rm = TRUE))
)
To feel this, scale up the data and re-run the same operations:
enrolment_large <- bind_rows(replicate(10, enrolment, simplify = FALSE))
student_results_large <- bind_rows(replicate(10, student_results, simplify = FALSE))
# Same join, ten times the rows
system.time(
enrolment_large |>
left_join(school_funding, by = "school_id")
)
# Same aggregation, ten times the rows
system.time(
student_results_large |>
group_by(school_id, year) |>
summarise(mean_score = mean(test_score, na.rm = TRUE))
)
You increased the number of rows tenfold. Notice how wall time scales with it. The computation itself hasn’t changed. Only the amount of data moving through it. That’s the signature of a data work problem.
The second category is computational work: fitting a nonlinear model, running a microsimulation, anything where the computer has to search for an answer rather than just reorganise your data. The bottleneck isn’t how many rows you have. It’s how hard the search is.
Suppose you suspect your wage data contains distinct subgroups (entry-level workers, mid-career professionals, senior management) but you have no variable identifying who belongs where. A mixture model estimates the subgroups directly from the distribution. The question is how many subgroups you ask it to find.
library(mixtools)
# Fit a 2-component mixture
system.time(mix2 <- normalmixEM(wages, k = 2, verb = FALSE))
# Fit a 5-component mixture: same data, same algorithm
system.time(mix5 <- normalmixEM(wages, k = 5, verb = FALSE))
Same 50,000 observations. Same algorithm. But the 5-component model has 14 parameters instead of 5, and the surface the optimiser navigates gets dramatically more complex. The components can swap labels with each other, creating ridges and saddle points that slow the search. You didn’t add a single row of data. You made the model harder, and the computation exploded. That’s the signature of a computational work problem.
How to tell which one you have
The simplest diagnostic is to ask what operation is running when things slow down, and what would make it slower.
If the slow operation is a merge, a sort, a groupby, or any reshaping of your data, and adding more rows would make it proportionally worse, you’re looking at a data work problem.
If the slow operation is an estimation routine, a simulation, or anything involving numerical optimisation, and adding more parameters or model complexity would make it worse regardless of how many rows you have, you’re looking at a computational work problem.
Knowing which problem you have won’t fix it automatically, but it tells you where to look next, and stops you from wasting time on solutions aimed at the wrong problem.
It’s also worth knowing that the tools most analysts default to were designed with data work in mind. They’re excellent at it. But the language you use for data work may not be the right one for computational work.
The usual fix for slow R code is a better package, or swap dplyr for data.table, or rewrite the loop as a vectorised call. That might work when the bottleneck is data. When the bottleneck is the model, the fix might not be in R at all.