NYT COVID-19 Data Infrastructure

Building scalable data infrastructure during the pandemic, triaging 30+ scrapers and migrating 2 million documents to support critical journalism

Role Software Engineer
Year
Tech Stack
PythonBeautifulSoupPandasAWSGCPGitHub Actions
Impact

Fixed/triaged 30+ scrapers, migrated 2M documents from AWS to GCP, enabled critical pandemic data journalism

Context

During the critical early months of the COVID-19 pandemic, The New York Times was racing to build data infrastructure that could track the virus’s spread across thousands of jurisdictions. As public health data systems struggled to keep pace, reliable journalism depended on scalable, accurate data collection. This work was part of a larger team effort that contributed to the Times’ Pulitzer Prize-winning pandemic coverage.

Timeline: June 2020 - November 2020 Team: NYT COVID-19 Data Team Scope: Data scraping infrastructure, document migration, FOIA automation

Challenge

The pandemic created unprecedented demand for real-time public health data, but sources were fragmented across thousands of government websites with inconsistent formats. The existing scraping infrastructure was breaking under the scale and complexity.

Key Problems

  1. Scraper Maintenance at Scale 30+ scrapers required constant monitoring and fixes as government websites changed data formats

  2. Document Management Millions of FOIA documents needed organized storage and conversion for analysis

  3. Infrastructure Migration Need to migrate 2 million documents from AWS to GCP while maintaining data integrity

Solution

Built and maintained critical data infrastructure supporting NYT’s pandemic coverage:

Impact

30+ scrapers maintained - Ensured reliable data collection from government sources during the most critical phase of the pandemic

2 million documents migrated - Successfully moved massive dataset to new infrastructure without service disruption

Critical journalism enabled - Data infrastructure supported hundreds of NYT articles and visualizations tracking the pandemic

Public resource - Contributed to open-source dataset used by researchers, policymakers, and journalists worldwide

Technical Details

Infrastructure built primarily in Python with emphasis on reliability and maintainability. Scraping stack used BeautifulSoup and Pandas for data extraction and transformation. Migration project required careful orchestration between AWS and GCP services with comprehensive validation to ensure data integrity at scale.

This project demonstrated the critical role of data engineering infrastructure in enabling urgent public service journalism.