What the heck is this?
A Random American generates demographically accurate Americans. The model covers home state, age, race/ethnicity, gender, and political affiliation. That sounds like it should be pretty simple, given that there's no shortage of statistics on all those categories individually. But putting them together is more difficult than you might think.
For example, about a third of the country is Republican, and about 10% live in California, but that doesn't mean 3.3% are California Republicans. The odds of being a Republican change once you decide that your random American is from California. Look up marginal distributions if you want to explore more about this phenomenon, but the more categories you add, the more difficult it becomes to account for. You might have a good estimate of the age distribution of the US, but what about the age distribution of Hispanic Democratic women in Guam? (Including territories like Guam also made this much more difficult since data is more sparse and often not included alongside data for states). There are many ways of dealing with this problem that vary in laziness, but our model uses real data wherever it's available, making it about as accurate as possible.
How people are generated
Generation begins by selecting a state according to its population,
using population data from the
2020 U.S. Census Demographic and Housing Characteristics File.
Within each state, age, gender, and race or ethnicity are sampled from the
American Community Survey Public Use Microdata Sample (ACS PUMS),
which preserves the relationships between these characteristics rather than treating them independently.
Similar census data are used for Puerto Rico, Guam, American Samoa,
the Northern Mariana Islands, and the U.S. Virgin Islands, which are the territories we included for this project.
Political affiliation is assigned only to adults. This data isn't collected by the Census or PUMS, so instead our model combines national survey data from the
Pew Research Center
with state party registration statistics and Census demographics using a Bayesian model. This should avoid double-counting correlations already captured by state demographics while producing realistic differences between places and populations.
Note that we also show names and locations for our generated people. Disclaimer: only the demographic info (age, gender, race/ethnicity, state, politics) is really accurate; the rest is done on a best-effort basis to give the site more personality and a better feel. If I had to incorporate more variables I would tear my hair out.
First names are chosen by combining several independent datasets. The
Social Security Administration's baby names database
provides the popularity of names by state, birth year, and sex, while
U.S. Census name data
provide estimates of how strongly names are associated with different racial and ethnic groups. However, the federal first-name data only account for men and women, so for nonbinary names we instead used
Cassolotl's informal survey of nonbinary first names.
Again, including nonbinary people made the model much more difficult due to the lack of data, but it was a worthwhile challenge. These sources are combined to produce names that are plausible for a person's age, gender, race, and location.