Skip to content
Canadian Immigration Dashboard [ CID ]
Reference

Methodology & Codebook

How the dataset was collected, how sentiments were classified, and known limitations.

113
Channels
362
Videos
508,577
Comments
469,744
Classified
1,033,643
Labels

1. Dataset Scope

This dashboard analyses YouTube comments related to Canadian immigration collected between 2013 and 2026. The corpus consists of:

  • 113 channels — major Canadian news outlets (CBC, CTV, Global News), international news (CNN, TRT World, Firstpost, WION), and immigration-focused creators.
  • 362 videos — selected by keyword search and channel expansion from seeds including: Canadian immigration, Canadian visas, deportation, immigrant settlement, anti-immigration, Indians in Canada, Punjabis in Canada.
  • 508,577 top-level comments and 87,965 replies from unique commenter accounts.
  • Temporal distribution is heavily skewed toward 2023–2025, reflecting YouTube's engagement patterns and the collection window.

Important limitation: This corpus is a purposive sample, not a representative sample of all YouTube discourse on Canadian immigration. Channel expansion may have introduced non-immigration content. Findings should be interpreted as patterns within this collected corpus, not as representative of all Canadian public opinion.

2. Data Collection Pipeline

Stage A — Keyword seed: YouTube search API queries using immigration-related keywords. Matching video URLs saved as seed list.

Stage B — Channel expansion: For each video in the seed, the publishing channel was identified and additional videos from that channel collected. This captures channel-level discourse but may include non-immigration content.

Stage C — Comment & reply collection: YouTube Data API v3 was used to collect all top-level comment threads, including replies, for each video. Comments are paginated and stored in full.

Stage D — Incremental refresh: Subsequent ETL runs use per-video checkpoints (etl_comment_checkpoints) to fetch only comments published after the last collection run, avoiding duplication.

Stage E — LLM classification: Comment text was batch-classified (500 comments per batch) using an LLM classifier (Gemini / DeepSeek / OpenAI-compatible endpoint) into the 25 sentiment categories described in Section 4. Note: the comment_llm_labels table (which stores richer per-comment labels including identity_frame, migration_stage, stance, and toxicity) is currently empty — advanced LLM labelling has not yet been run on the full corpus.

3. Sentiment Classification Model

Sentiment was assigned using a multi-label LLM classifier. Each comment may receive one or more category labels from the 25-category taxonomy (Section 4). Category assignments are stored in the comment_sentiment_categories join table — the current corpus has 1,033,643 category assignments across 469,744 classified comments.

Polarity (Positive / Negative / Ambiguous) is a property of the category, not independently assigned per comment. The overall polarity distribution is:

Negative 62.8% Positive 26.0% Ambiguous 11.2%

Rationale: Each sentiment record in comment_sentiment includes an LLM-generated rationale explaining why the category was assigned. These are accessible via the Close Reading page.

Reliability note: LLM classifiers at this scale (batch 500) are not manually validated. Category assignments should be treated as exploratory indicators, not ground truth. Researchers conducting publishable analysis should validate a sample of assignments against human coders.

4. Sentiment Category Codebook

The 25 categories were designed to capture the emotional and discursive register of immigration-related public commentary. Categories are grouped by polarity.

Ambiguous polarity

Identity Struggle
Comments reflecting confusion or conflict around cultural identity.
Mixed/Conflicted
Ambivalence or coexisting positive and negative tones.
Relief
Sense of escape or safety after hardship in home country.
Sarcasm/Irony
Masked criticism or humor with emotional complexity.

Negative polarity

Alienation
Feeling of being excluded, marginalized, or unwanted.
Disillusionment
Loss of faith or idealism, especially after arrival.
Economic Anxiety
Worries about cost of living, housing, or job market.
Fear
Expressions of anxiety or concern.
Frustration with System
Disappointment or anger at immigration processes, bureaucracy, or delays.
Homesickness
Expressions of longing for home country, family, or familiar culture.
Indignation
Moral outrage or ethical disapproval.
Policy Critique
Critical reflection on immigration laws or political decisions.
Racism Experienced
Describing or denouncing racial discrimination.
Resentment
Frustration, bitterness, or feelings of injustice.
Xenophobia or Prejudice
Negative attitudes toward immigrants or minority groups.

Positive polarity

Admiration for Canada
Expressions of praise or admiration for Canadian values, people, or policies.
Belonging
Feeling of community, acceptance, or inclusion.
Civic Optimism
Positive outlook on participating in democracy or civil life.
Empathy
Supportive, understanding, or compassionate sentiment.
Gratitude
Expressions of thankfulness, relief, or appreciation.
Hope
Aspirational or forward-looking optimism.
Hope for Children
Optimism centered on better futures for one's children.
Job Satisfaction
Positive reflections on career opportunities or working conditions.
Solidarity
Support expressed for others facing similar immigration struggles.
Thankfulness for Safety
Gratitude for peace, asylum, or protection.

5. Citing This Dataset

When citing findings derived from this dashboard in academic work, include the following information:

Canadian Immigration YouTube Analytics Dashboard. Data collected 2013–2026 via YouTube Data API v3. Sentiment classified using LLM-assisted thematic coding (25-category scheme). Live dashboard: https://youtubeanalytics.ca/ [Accessed: ].

Note: as an evolving live database, row counts and sentiment distributions will change as new data is ingested. Record the access date and relevant counts (visible on the Overview page) when citing specific statistics.

6. Known Limitations

  • Purposive sample: Channels and videos were chosen via keyword search and channel expansion — this is not a random sample of YouTube content.
  • Geo NLP noise: Place-name extraction uses NLP entity recognition which produces false positives (e.g., "Quran", "West", "Danielle" appear as city entities). Geographic data should be used indicatively, not as precise counts.
  • LLM classification reliability: Category assignments were generated at scale without human validation. Treat as exploratory.
  • comment_llm_labels is empty: The richer per-comment LLM label table (with identity_frame, migration_stage, stance, toxicity) has not yet been populated. These fields are reserved for a future classification run.
  • Platform bias: YouTube comment sections attract particular demographics and discourse styles. Comments are not representative of the general Canadian public.
  • Missing subscriber/view counts: The channels table does not currently store subscriber counts or channel-level view counts — only comment-derived metrics are available.
  • ETL errors: Recent ingestion runs (March 2026) have returned errors. Check the Django Admin ETL log for current pipeline status.