Methodology & Codebook
How the dataset was collected, how sentiments were classified, and known limitations.
1. Dataset Scope
This dashboard analyses YouTube comments related to Canadian immigration collected between 2013 and 2026. The corpus consists of:
- 113 channels — major Canadian news outlets (CBC, CTV, Global News), international news (CNN, TRT World, Firstpost, WION), and immigration-focused creators.
- 362 videos — selected by keyword search and channel expansion from seeds including: Canadian immigration, Canadian visas, deportation, immigrant settlement, anti-immigration, Indians in Canada, Punjabis in Canada.
- 508,577 top-level comments and 87,965 replies from unique commenter accounts.
- Temporal distribution is heavily skewed toward 2023–2025, reflecting YouTube's engagement patterns and the collection window.
Important limitation: This corpus is a purposive sample, not a representative sample of all YouTube discourse on Canadian immigration. Channel expansion may have introduced non-immigration content. Findings should be interpreted as patterns within this collected corpus, not as representative of all Canadian public opinion.
2. Data Collection Pipeline
Stage A — Keyword seed: YouTube search API queries using immigration-related keywords. Matching video URLs saved as seed list.
Stage B — Channel expansion: For each video in the seed, the publishing channel was identified and additional videos from that channel collected. This captures channel-level discourse but may include non-immigration content.
Stage C — Comment & reply collection: YouTube Data API v3 was used to collect all top-level comment threads, including replies, for each video. Comments are paginated and stored in full.
Stage D — Incremental refresh: Subsequent ETL runs use per-video checkpoints (etl_comment_checkpoints) to fetch only comments published after the last collection run, avoiding duplication.
Stage E — LLM classification: Comment text was batch-classified (500 comments per batch) using an LLM classifier (Gemini / DeepSeek / OpenAI-compatible endpoint) into the 25 sentiment categories described in Section 4. Note: the comment_llm_labels table (which stores richer per-comment labels including identity_frame, migration_stage, stance, and toxicity) is currently empty — advanced LLM labelling has not yet been run on the full corpus.
3. Sentiment Classification Model
Sentiment was assigned using a multi-label LLM classifier. Each comment may receive one or more category labels from the 25-category taxonomy (Section 4). Category assignments are stored in the comment_sentiment_categories join table — the current corpus has 1,033,643 category assignments across 469,744 classified comments.
Polarity (Positive / Negative / Ambiguous) is a property of the category, not independently assigned per comment. The overall polarity distribution is:
Rationale: Each sentiment record in comment_sentiment includes an LLM-generated rationale explaining why the category was assigned. These are accessible via the Close Reading page.
Reliability note: LLM classifiers at this scale (batch 500) are not manually validated. Category assignments should be treated as exploratory indicators, not ground truth. Researchers conducting publishable analysis should validate a sample of assignments against human coders.
4. Sentiment Category Codebook
The 25 categories were designed to capture the emotional and discursive register of immigration-related public commentary. Categories are grouped by polarity.
Ambiguous polarity
Negative polarity
Positive polarity
5. Citing This Dataset
When citing findings derived from this dashboard in academic work, include the following information:
Note: as an evolving live database, row counts and sentiment distributions will change as new data is ingested. Record the access date and relevant counts (visible on the Overview page) when citing specific statistics.
6. Known Limitations
- Purposive sample: Channels and videos were chosen via keyword search and channel expansion — this is not a random sample of YouTube content.
- Geo NLP noise: Place-name extraction uses NLP entity recognition which produces false positives (e.g., "Quran", "West", "Danielle" appear as city entities). Geographic data should be used indicatively, not as precise counts.
- LLM classification reliability: Category assignments were generated at scale without human validation. Treat as exploratory.
- comment_llm_labels is empty: The richer per-comment LLM label table (with
identity_frame,migration_stage,stance,toxicity) has not yet been populated. These fields are reserved for a future classification run. - Platform bias: YouTube comment sections attract particular demographics and discourse styles. Comments are not representative of the general Canadian public.
- Missing subscriber/view counts: The
channelstable does not currently store subscriber counts or channel-level view counts — only comment-derived metrics are available. - ETL errors: Recent ingestion runs (March 2026) have returned errors. Check the Django Admin ETL log for current pipeline status.