A data pipeline to identify high-quality event sources in Toronto and the Greater Toronto Area (GTA) from the Web Data Commons Event dataset.
Built for CivicTechTO to enable better civic event discovery and aggregation.
- Overview
- Features
- Quick Start
- Installation
- Usage
- Pipeline Architecture
- Project Structure
- Validation UI
- Output Files
- Documentation
- Contributing
- License
This project analyzes terabytes of structured web data (N-Quads with Schema.org markup) to find domains that publish event information for Toronto and the GTA. It uses a sophisticated multi-stage filtering process to efficiently identify relevant domains without needing to process the entire internet's worth of data.
Toronto lacks a comprehensive, open-source list of event sources. This pipeline:
- Discovers event sources automatically from web data
- Validates locations using multiple geographic strategies
- Ranks sources by confidence and data quality
- Enables civic tech projects to build better event aggregators
The Web Data Commons Event dataset contains:
- 133 compressed files (~20GB total)
- 10M+ events from 50K+ domains worldwide
- Only a tiny fraction are Toronto-specific
Our pipeline efficiently identifies the ~1-2% of relevant Toronto sources.
Efficiently categorizes domains as INCLUDE, EXCLUDE, or UNKNOWN based on:
- Top-level domains (TLDs) -
.cavs.co.uk - Keywords - "toronto", "gta", municipality names
- Known institutions - Universities, venues, cultural organizations
Identifies Toronto/GTA events using:
- Postal Codes: M* (Toronto), L* (GTA regions)
- Bounding Boxes: Lat/lon coordinates
- City Names: Toronto, Mississauga, Brampton, etc.
- Neighborhoods: Scarborough, Etobicoke, North York, etc.
- Uses
orjsonfor 2-3x faster JSON processing - Uses
regexfor high-performance pattern matching - Streaming N-Quads parser for memory efficiency
- Progress bars with
tqdmfor long operations
Identifies which data files contain the most Toronto-relevant domains, optimizing download and processing time.
Web-based interface for human review of uncertain classifications, with:
- Keyboard shortcuts for fast validation
- Auto-save to browser storage
- Export/import of validation decisions
# 1. Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone the repository
git clone https://github.com/CivicTechTO/toronto-events.git
cd toronto-events
# 3. Install dependencies
uv sync
# 4. Download metadata and sample data
uv run python scripts/download_wdc_events.py --metadata-only
# 5. Run a quick test (1 file, 1000 events)
uv run python scripts/download_wdc_events.py --parts part_101.gz
uv run python scripts/run_pipeline.py --parts part_101.gz --limit 1000
# 6. Check your results!
cat data/processed/toronto_event_sources.csvSee examples/TUTORIAL.md for a detailed walkthrough.
- Python 3.11+ (3.12 recommended)
- ~25GB disk space (20GB for data + 5GB for processing)
- Stable internet (for downloading data)
This project uses uv for dependency management:
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone repository
git clone https://github.com/CivicTechTO/toronto-events.git
cd toronto-events
# Install dependencies
uv sync
# Verify installation
uv run python --versionAlternative: Traditional pip/venv setup:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .Note: The scripts use sys.path manipulation for imports. For a cleaner setup, install the package in editable mode with uv pip install -e . or pip install -e ..
Process the complete dataset (~2-4 hours):
# Download all data (~20GB, can take hours)
uv run python scripts/download_wdc_events.py
# Run complete pipeline
uv run python scripts/run_pipeline.pyProcess specific part files only:
# Download specific parts
uv run python scripts/download_wdc_events.py --parts part_0.gz part_14.gz part_101.gz
# Process those parts
uv run python scripts/run_pipeline.py --parts part_0.gz part_14.gz part_101.gzQuick test runs for development:
# Limit to 100 events
uv run python scripts/run_pipeline.py --limit 100
# Skip already-completed phases
uv run python scripts/run_pipeline.py --skip-phase1# Run specific pipeline phases
uv run python scripts/analyze_domains.py # Phase 1: Domain signals
uv run python scripts/identify_relevant_parts.py # Phase 2: Prioritization
uv run python scripts/extract_events.py --limit 1000 # Phase 3: Extraction
uv run python scripts/score_domains.py # Phase 5: Scoring
uv run python scripts/generate_outputs.py # Phase 6: Outputs
# Download options
uv run python scripts/download_wdc_events.py --metadata-only # Metadata only
uv run python scripts/download_wdc_events.py --resume # Resume interrupted downloadβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Web Data Commons Event Dataset (2024-12) β
β 133 part files, ~20GB N-Quads, 10M+ events β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββΌβββββββββββββββ
β Phase 1: Domain Analysis β Tri-state classification
β (analyze_domains.py) β based on TLD, keywords,
βββββββββββββββ¬βββββββββββββββ known institutions
β
βββββββββββββββΌβββββββββββββββ
β Phase 2: Prioritization β Identify which files
β (identify_relevant_parts) β to download first
βββββββββββββββ¬βββββββββββββββ
β
βββββββββββββββΌβββββββββββββββ
β Phase 3: Event Extraction β Parse N-Quads, skip
β (extract_events.py) β negative domains,
βββββββββββββββ¬βββββββββββββββ reconstruct events
β
βββββββββββββββΌβββββββββββββββ
β Phase 4 & 5: Geo & Scoring β Multi-strategy location
β (score_domains.py) β matching, confidence
βββββββββββββββ¬βββββββββββββββ scoring
β
βββββββββββββββΌβββββββββββββββ
β Phase 6: Generate Outputs β Final deliverables:
β (generate_outputs.py) β CSV + NDJSON + queue
βββββββββββββββ¬βββββββββββββββ
β
ββββββββββΌββββββββββ
β Validation UI β Manual review
β (optional) β of UNKNOWN domains
ββββββββββββββββββββ
See ARCHITECTURE.md for detailed technical documentation.
toronto-events/
βββ src/
β βββ toronto_events/ # Python package (reusable modules)
β βββ core/ # Core functionality
β β βββ nquads_parser.py # N-Quads streaming parser
β β βββ geo_filter.py # Geographic filtering
β βββ pipeline/ # Pipeline orchestration (future)
β βββ utils/ # Utility functions (future)
βββ scripts/ # Executable pipeline scripts
β βββ run_pipeline.py # Main pipeline orchestrator
β βββ download_wdc_events.py # Data downloader
β βββ analyze_domains.py # Phase 1: Domain signals
β βββ identify_relevant_parts.py # Phase 2: Part prioritization
β βββ extract_events.py # Phase 3: Event extraction
β βββ score_domains.py # Phase 5: Scoring & classification
β βββ generate_outputs.py # Phase 6: Output generation
β βββ prepare_validation_data.py # Prepare UI data
β βββ apply_validations.py # Apply manual validations
βββ validation_ui/ # Web-based validation interface
β βββ index.html # Main UI
β βββ app.js # UI logic
β βββ styles.css # Styling
β βββ domains.json # Data to validate
βββ data/ # Data directory (gitignored)
β βββ raw/ # Downloaded WDC files
β βββ intermediate/ # Processing artifacts
β βββ processed/ # Final outputs β¨
βββ docs/ # Additional documentation
β βββ VALIDATION_UI.md # Validation UI guide
βββ examples/ # Usage examples
β βββ TUTORIAL.md # Step-by-step tutorial
βββ README.md # This file
βββ ARCHITECTURE.md # Technical architecture
βββ CONTRIBUTING.md # Contribution guidelines
βββ pyproject.toml # Project configuration
For domains classified as UNKNOWN, use the manual validation interface:
# 1. Prepare validation data
uv run python scripts/prepare_validation_data.py
# 2. Open UI (use a local server for best results)
python -m http.server 8000
# Then visit: http://localhost:8000/validation_ui/
# 3. Review domains using keyboard shortcuts:
# A = Accept R = Reject U = Uncertain N = Next
# 4. Export validations and apply
uv run python scripts/apply_validations.pyFeatures:
- β¨οΈ Keyboard shortcuts for fast validation
- πΎ Auto-save to browser local storage
- π Progress tracking with visual indicators
- π€ Export validations for pipeline integration
See docs/VALIDATION_UI.md for detailed documentation.
All outputs are saved to data/processed/:
Main deliverable - Ranked list of Toronto event sources
domain,classification,confidence_score,total_events,gta_events,gta_percentage,match_reasons
rcmusic.com,INCLUDE,0.95,247,247,100.0,postal_code|known_institution
torontopubliclibrary.ca,INCLUDE,0.92,183,183,100.0,known_institution|locality
eventbrite.com,INCLUDE,0.45,892,421,47.2,locality|postal_codeColumns:
domain- Website domain nameclassification- INCLUDE, EXCLUDE, or UNKNOWNconfidence_score- 0.0-1.0 confidence leveltotal_events- Total events foundgta_events- Events matched to GTAgta_percentage- Percentage of GTA eventsmatch_reasons- Why it matched (postal, coords, locality, etc.)
Sample events from each INCLUDE domain (JSON Lines format)
{"domain":"rcmusic.com","name":"Summer Concert Series","location":{"locality":"Toronto","postal_code":"M5B 1W8"},"start_date":"2024-07-15"}
{"domain":"ago.ca","name":"Art Gallery Exhibition","location":{"locality":"Toronto"},"start_date":"2024-08-01"}Domains classified as UNKNOWN, needing human review
domain,confidence_score,event_count,gta_percentage,reason
example.com,0.35,12,58.3,low_sample_size
another-site.org,0.42,45,60.0,ambiguous_location- examples/TUTORIAL.md - Step-by-step tutorial for beginners
- ARCHITECTURE.md - Technical architecture and design
- CONTRIBUTING.md - How to contribute to the project
- docs/VALIDATION_UI.md - Validation UI documentation
We welcome contributions! Whether you're:
- π Reporting bugs
- π‘ Suggesting features
- π Improving documentation
- π» Writing code
Please see CONTRIBUTING.md for guidelines.
# 1. Fork and clone
git clone https://github.com/YOUR_USERNAME/toronto-events.git
cd toronto-events
# 2. Create a branch
git checkout -b feature/your-feature
# 3. Make changes and test
uv sync
# ... make your changes ...
uv run python scripts/run_pipeline.py --limit 100 # Test
# 4. Commit and push
git commit -m "Add: your feature description"
git push origin feature/your-feature
# 5. Create Pull Request on GitHubThis project is licensed under the MIT License - see LICENSE file for details.
- Web Data Commons - For providing the Event dataset
- CivicTechTO - For supporting civic technology projects in Toronto
- Contributors - Everyone who has contributed to this project
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- CivicTechTO: Website | Slack
Made with β€οΈ by CivicTechTO