Skip to content

CivicTechTO/toronto-events

Repository files navigation

Toronto Event Source Pipeline

A data pipeline to identify high-quality event sources in Toronto and the Greater Toronto Area (GTA) from the Web Data Commons Event dataset.

License: MIT Python 3.11+

Built for CivicTechTO to enable better civic event discovery and aggregation.

πŸ“‹ Table of Contents

🎯 Overview

This project analyzes terabytes of structured web data (N-Quads with Schema.org markup) to find domains that publish event information for Toronto and the GTA. It uses a sophisticated multi-stage filtering process to efficiently identify relevant domains without needing to process the entire internet's worth of data.

Why This Matters

Toronto lacks a comprehensive, open-source list of event sources. This pipeline:

  • Discovers event sources automatically from web data
  • Validates locations using multiple geographic strategies
  • Ranks sources by confidence and data quality
  • Enables civic tech projects to build better event aggregators

The Challenge

The Web Data Commons Event dataset contains:

  • 133 compressed files (~20GB total)
  • 10M+ events from 50K+ domains worldwide
  • Only a tiny fraction are Toronto-specific

Our pipeline efficiently identifies the ~1-2% of relevant Toronto sources.

✨ Features

🎯 Tri-State Domain Classification

Efficiently categorizes domains as INCLUDE, EXCLUDE, or UNKNOWN based on:

  • Top-level domains (TLDs) - .ca vs .co.uk
  • Keywords - "toronto", "gta", municipality names
  • Known institutions - Universities, venues, cultural organizations

🌍 Multi-Strategy Geo-Filtering

Identifies Toronto/GTA events using:

  • Postal Codes: M* (Toronto), L* (GTA regions)
  • Bounding Boxes: Lat/lon coordinates
  • City Names: Toronto, Mississauga, Brampton, etc.
  • Neighborhoods: Scarborough, Etobicoke, North York, etc.

⚑ Performance Optimized

  • Uses orjson for 2-3x faster JSON processing
  • Uses regex for high-performance pattern matching
  • Streaming N-Quads parser for memory efficiency
  • Progress bars with tqdm for long operations

πŸ“Š Smart Prioritization

Identifies which data files contain the most Toronto-relevant domains, optimizing download and processing time.

πŸ–₯️ Manual Validation UI

Web-based interface for human review of uncertain classifications, with:

  • Keyboard shortcuts for fast validation
  • Auto-save to browser storage
  • Export/import of validation decisions

πŸš€ Quick Start

# 1. Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone the repository
git clone https://github.com/CivicTechTO/toronto-events.git
cd toronto-events

# 3. Install dependencies
uv sync

# 4. Download metadata and sample data
uv run python scripts/download_wdc_events.py --metadata-only

# 5. Run a quick test (1 file, 1000 events)
uv run python scripts/download_wdc_events.py --parts part_101.gz
uv run python scripts/run_pipeline.py --parts part_101.gz --limit 1000

# 6. Check your results!
cat data/processed/toronto_event_sources.csv

See examples/TUTORIAL.md for a detailed walkthrough.

πŸ“¦ Installation

Requirements

  • Python 3.11+ (3.12 recommended)
  • ~25GB disk space (20GB for data + 5GB for processing)
  • Stable internet (for downloading data)

Setup

This project uses uv for dependency management:

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone repository
git clone https://github.com/CivicTechTO/toronto-events.git
cd toronto-events

# Install dependencies
uv sync

# Verify installation
uv run python --version

Alternative: Traditional pip/venv setup:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e .

Note: The scripts use sys.path manipulation for imports. For a cleaner setup, install the package in editable mode with uv pip install -e . or pip install -e ..

πŸ“– Usage

Full Pipeline

Process the complete dataset (~2-4 hours):

# Download all data (~20GB, can take hours)
uv run python scripts/download_wdc_events.py

# Run complete pipeline
uv run python scripts/run_pipeline.py

Selective Processing

Process specific part files only:

# Download specific parts
uv run python scripts/download_wdc_events.py --parts part_0.gz part_14.gz part_101.gz

# Process those parts
uv run python scripts/run_pipeline.py --parts part_0.gz part_14.gz part_101.gz

Testing & Development

Quick test runs for development:

# Limit to 100 events
uv run python scripts/run_pipeline.py --limit 100

# Skip already-completed phases
uv run python scripts/run_pipeline.py --skip-phase1

Advanced Options

# Run specific pipeline phases
uv run python scripts/analyze_domains.py      # Phase 1: Domain signals
uv run python scripts/identify_relevant_parts.py  # Phase 2: Prioritization
uv run python scripts/extract_events.py --limit 1000  # Phase 3: Extraction
uv run python scripts/score_domains.py         # Phase 5: Scoring
uv run python scripts/generate_outputs.py      # Phase 6: Outputs

# Download options
uv run python scripts/download_wdc_events.py --metadata-only  # Metadata only
uv run python scripts/download_wdc_events.py --resume        # Resume interrupted download

πŸ—οΈ Pipeline Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Web Data Commons Event Dataset (2024-12)              β”‚
β”‚  133 part files, ~20GB N-Quads, 10M+ events            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Phase 1: Domain Analysis   β”‚  Tri-state classification
    β”‚ (analyze_domains.py)       β”‚  based on TLD, keywords,
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  known institutions
                  β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Phase 2: Prioritization    β”‚  Identify which files
    β”‚ (identify_relevant_parts)  β”‚  to download first
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Phase 3: Event Extraction  β”‚  Parse N-Quads, skip
    β”‚ (extract_events.py)        β”‚  negative domains,
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  reconstruct events
                  β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Phase 4 & 5: Geo & Scoring β”‚  Multi-strategy location
    β”‚ (score_domains.py)         β”‚  matching, confidence
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  scoring
                  β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Phase 6: Generate Outputs  β”‚  Final deliverables:
    β”‚ (generate_outputs.py)      β”‚  CSV + NDJSON + queue
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ Validation UI    β”‚  Manual review
         β”‚ (optional)       β”‚  of UNKNOWN domains
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

See ARCHITECTURE.md for detailed technical documentation.

πŸ“ Project Structure

toronto-events/
β”œβ”€β”€ src/
β”‚   └── toronto_events/          # Python package (reusable modules)
β”‚       β”œβ”€β”€ core/                # Core functionality
β”‚       β”‚   β”œβ”€β”€ nquads_parser.py # N-Quads streaming parser
β”‚       β”‚   └── geo_filter.py    # Geographic filtering
β”‚       β”œβ”€β”€ pipeline/            # Pipeline orchestration (future)
β”‚       └── utils/               # Utility functions (future)
β”œβ”€β”€ scripts/                     # Executable pipeline scripts
β”‚   β”œβ”€β”€ run_pipeline.py         # Main pipeline orchestrator
β”‚   β”œβ”€β”€ download_wdc_events.py  # Data downloader
β”‚   β”œβ”€β”€ analyze_domains.py      # Phase 1: Domain signals
β”‚   β”œβ”€β”€ identify_relevant_parts.py  # Phase 2: Part prioritization
β”‚   β”œβ”€β”€ extract_events.py       # Phase 3: Event extraction
β”‚   β”œβ”€β”€ score_domains.py        # Phase 5: Scoring & classification
β”‚   β”œβ”€β”€ generate_outputs.py     # Phase 6: Output generation
β”‚   β”œβ”€β”€ prepare_validation_data.py  # Prepare UI data
β”‚   └── apply_validations.py    # Apply manual validations
β”œβ”€β”€ validation_ui/               # Web-based validation interface
β”‚   β”œβ”€β”€ index.html              # Main UI
β”‚   β”œβ”€β”€ app.js                  # UI logic
β”‚   β”œβ”€β”€ styles.css              # Styling
β”‚   └── domains.json            # Data to validate
β”œβ”€β”€ data/                        # Data directory (gitignored)
β”‚   β”œβ”€β”€ raw/                    # Downloaded WDC files
β”‚   β”œβ”€β”€ intermediate/           # Processing artifacts
β”‚   └── processed/              # Final outputs ✨
β”œβ”€β”€ docs/                        # Additional documentation
β”‚   └── VALIDATION_UI.md        # Validation UI guide
β”œβ”€β”€ examples/                    # Usage examples
β”‚   └── TUTORIAL.md             # Step-by-step tutorial
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ ARCHITECTURE.md              # Technical architecture
β”œβ”€β”€ CONTRIBUTING.md              # Contribution guidelines
└── pyproject.toml              # Project configuration

πŸ–₯️ Validation UI

For domains classified as UNKNOWN, use the manual validation interface:

# 1. Prepare validation data
uv run python scripts/prepare_validation_data.py

# 2. Open UI (use a local server for best results)
python -m http.server 8000
# Then visit: http://localhost:8000/validation_ui/

# 3. Review domains using keyboard shortcuts:
#    A = Accept    R = Reject    U = Uncertain    N = Next

# 4. Export validations and apply
uv run python scripts/apply_validations.py

Features:

  • ⌨️ Keyboard shortcuts for fast validation
  • πŸ’Ύ Auto-save to browser local storage
  • πŸ“Š Progress tracking with visual indicators
  • πŸ“€ Export validations for pipeline integration

See docs/VALIDATION_UI.md for detailed documentation.

πŸ“Š Output Files

All outputs are saved to data/processed/:

toronto_event_sources.csv

Main deliverable - Ranked list of Toronto event sources

domain,classification,confidence_score,total_events,gta_events,gta_percentage,match_reasons
rcmusic.com,INCLUDE,0.95,247,247,100.0,postal_code|known_institution
torontopubliclibrary.ca,INCLUDE,0.92,183,183,100.0,known_institution|locality
eventbrite.com,INCLUDE,0.45,892,421,47.2,locality|postal_code

Columns:

  • domain - Website domain name
  • classification - INCLUDE, EXCLUDE, or UNKNOWN
  • confidence_score - 0.0-1.0 confidence level
  • total_events - Total events found
  • gta_events - Events matched to GTA
  • gta_percentage - Percentage of GTA events
  • match_reasons - Why it matched (postal, coords, locality, etc.)

toronto_event_samples.ndjson

Sample events from each INCLUDE domain (JSON Lines format)

{"domain":"rcmusic.com","name":"Summer Concert Series","location":{"locality":"Toronto","postal_code":"M5B 1W8"},"start_date":"2024-07-15"}
{"domain":"ago.ca","name":"Art Gallery Exhibition","location":{"locality":"Toronto"},"start_date":"2024-08-01"}

manual_review_queue.csv

Domains classified as UNKNOWN, needing human review

domain,confidence_score,event_count,gta_percentage,reason
example.com,0.35,12,58.3,low_sample_size
another-site.org,0.42,45,60.0,ambiguous_location

πŸ“š Documentation

🀝 Contributing

We welcome contributions! Whether you're:

  • πŸ› Reporting bugs
  • πŸ’‘ Suggesting features
  • πŸ“ Improving documentation
  • πŸ’» Writing code

Please see CONTRIBUTING.md for guidelines.

Quick Contribution Workflow

# 1. Fork and clone
git clone https://github.com/YOUR_USERNAME/toronto-events.git
cd toronto-events

# 2. Create a branch
git checkout -b feature/your-feature

# 3. Make changes and test
uv sync
# ... make your changes ...
uv run python scripts/run_pipeline.py --limit 100  # Test

# 4. Commit and push
git commit -m "Add: your feature description"
git push origin feature/your-feature

# 5. Create Pull Request on GitHub

πŸ“„ License

This project is licensed under the MIT License - see LICENSE file for details.

πŸ™ Acknowledgments

  • Web Data Commons - For providing the Event dataset
  • CivicTechTO - For supporting civic technology projects in Toronto
  • Contributors - Everyone who has contributed to this project

πŸ“ž Contact & Support


Made with ❀️ by CivicTechTO

About

Pipeline to identify Toronto/GTA event sources from Web Data Commons

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors