ML Notes - gensim-samples

Append-To-Existing-Dictionary

Fri 14 November 2025

title: "Append To Existing Dictionary" author: "Raja CSP Raman" date: 2019-05-02 description: "-" type: technical_note draft: false

import gensim
from gensim import corpora
from pprint import pprint

# How to create a dictionary from a list of sentences?
documents = [
        """More than half of survey participants also reported clicking on a headline expecting to read a balanced news account, only to find the story was pushing an agenda.
The survey found 48 per cent of respondents struggled to distinguish between fact and falsehood, while doubts about the authenticity of news stories had jumped 10 per cent in the past year.
The Canadian Journalism Foundation says the survey findings are troubling, particularly in the run-up to a federal election."""
    ]

# Tokenize(split) the sentences into words
tokens = [[text for text in doc.split()] for doc in documents]

# Create dictionary
dictionary = corpora.Dictionary(tokens)

print(dictionary)

Dictionary(60 unique tokens: ['10', '48', 'Canadian', 'Foundation', 'Journalism']...)

print(type(dictionary))

<class 'gensim.corpora.dictionary.Dictionary'>

# Show the word to id map
print(dictionary.token2id)

{'10': 0, '48': 1, 'Canadian': 2, 'Foundation': 3, 'Journalism': 4, 'More': 5, 'The': 6, 'a': 7, 'about': 8, 'account,': 9, 'agenda.': 10, 'also': 11, 'an': 12, 'and': 13, 'are': 14, 'authenticity': 15, 'balanced': 16, 'between': 17, 'cent': 18, 'clicking': 19, 'distinguish': 20, 'doubts': 21, 'election.': 22, 'expecting': 23, 'fact': 24, 'falsehood,': 25, 'federal': 26, 'find': 27, 'findings': 28, 'found': 29, 'had': 30, 'half': 31, 'headline': 32, 'in': 33, 'jumped': 34, 'news': 35, 'of': 36, 'on': 37, 'only': 38, 'participants': 39, 'particularly': 40, 'past': 41, 'per': 42, 'pushing': 43, 'read': 44, 'reported': 45, 'respondents': 46, 'run-up': 47, 'says': 48, 'stories': 49, 'story': 50, 'struggled': 51, 'survey': 52, 'than': 53, 'the': 54, 'to': 55, 'troubling,': 56, 'was': 57, 'while': 58, 'year.': 59, '2,300': 60, 'Association.': 61, 'Canadians.': 62, 'Intelligence': 63, 'Marketing': 64, 'Online': 65, 'Research': 66, 'according': 67, 'assigned': 68, 'be': 69, 'cannot': 70, 'conducted': 71, 'error,': 72, 'five-day': 73, 'last': 74, 'margin': 75, 'month,': 76, 'more': 77, 'over': 78, 'period': 79, 'polls': 80, 'sampled': 81, 'survey,': 82}

# Add a new document to the existing dictionary
documents_2 = [
        """The survey, conducted over a five-day period last month, sampled more than 2,300 Canadians.
        Online polls cannot be assigned a margin of error, according to the Marketing Research and Intelligence Association."""
    ]

texts_2 = [[text for text in doc.split()] for doc in documents_2]
dictionary.add_documents(texts_2)

print(dictionary)

Dictionary(83 unique tokens: ['10', '48', 'Canadian', 'Foundation', 'Journalism']...)

# Show the word to id map
print(dictionary.token2id)

{'10': 0, '48': 1, 'Canadian': 2, 'Foundation': 3, 'Journalism': 4, 'More': 5, 'The': 6, 'a': 7, 'about': 8, 'account,': 9, 'agenda.': 10, 'also': 11, 'an': 12, 'and': 13, 'are': 14, 'authenticity': 15, 'balanced': 16, 'between': 17, 'cent': 18, 'clicking': 19, 'distinguish': 20, 'doubts': 21, 'election.': 22, 'expecting': 23, 'fact': 24, 'falsehood,': 25, 'federal': 26, 'find': 27, 'findings': 28, 'found': 29, 'had': 30, 'half': 31, 'headline': 32, 'in': 33, 'jumped': 34, 'news': 35, 'of': 36, 'on': 37, 'only': 38, 'participants': 39, 'particularly': 40, 'past': 41, 'per': 42, 'pushing': 43, 'read': 44, 'reported': 45, 'respondents': 46, 'run-up': 47, 'says': 48, 'stories': 49, 'story': 50, 'struggled': 51, 'survey': 52, 'than': 53, 'the': 54, 'to': 55, 'troubling,': 56, 'was': 57, 'while': 58, 'year.': 59, '2,300': 60, 'Association.': 61, 'Canadians.': 62, 'Intelligence': 63, 'Marketing': 64, 'Online': 65, 'Research': 66, 'according': 67, 'assigned': 68, 'be': 69, 'cannot': 70, 'conducted': 71, 'error,': 72, 'five-day': 73, 'last': 74, 'margin': 75, 'month,': 76, 'more': 77, 'over': 78, 'period': 79, 'polls': 80, 'sampled': 81, 'survey,': 82}

Score: 10

Category: gensim-samples

Bag-Of-Words

Fri 14 November 2025

title: "Bag of Words" author: "Raja CSP Raman" date: 2019-04-20 description: "-" type: technical_note draft: false

from gensim.utils import simple_preprocess
from gensim import corpora
from pprint import pprint

contents = [
    "More than half of survey participants also reported clicking on a headline expecting to read a balanced news account, only to …

Category: gensim-samples

Fri 14 November 2025

title: "Basic Vector" author: "Raja CSP Raman" date: 2019-05-02 description: "-" type: technical_note draft: false

from gensim import corpora, models, similarities

corpus = [
    [(0, 1.0), (1, 1.0), (2, 1.0)],
    [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
    [(1, 1 …

Category: gensim-samples

Fri 14 November 2025

title: "Bag of word Counts" author: "Raja CSP Raman" date: 2019-05-02 description: "-" type: technical_note draft: false

from gensim.utils import simple_preprocess
from gensim import corpora
from pprint import pprint

contents = [
    "The Star obtained a copy of the email outlining the latest in a series of Progressive Conservative provincial budget cuts …

Category: gensim-samples

Fri 14 November 2025

title: "Content Summary" author: "Raja CSP Raman" date: 2019-04-20 description: "-" type: technical_note draft: false

from gensim.summarization import summarize, keywords
from pprint import pprint
from smart_open import smart_open

text = " ".join((line for line in smart_open('sample.txt', encoding='utf-8')))

text

'More than half of survey participants also reported clicking on …

Category: gensim-samples

Fri 14 November 2025

title: "File To Dictionary" author: "Raja CSP Raman" date: 2019-04-20 description: "-" type: technical_note draft: false

import gensim
from gensim import corpora
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

dictionary = corpora.Dictionary()

# Create gensim dictionary form a single tet file
dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True …

Category: gensim-samples

Fri 14 November 2025

title: "Find Odd" author: "Raja CSP Raman" date: 2019-05-02 description: "-" type: technical_note draft: false

import gensim.downloader as api

fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')

print(fasttext_model300.doesnt_match(['one', 'two', 'eleven', 'thirty', 'tennis']))

tennis


/Users/rajacsp/anaconda3/envs/py36/lib/python3.6/site-packages/gensim/models/keyedvectors.py:858: FutureWarning: arrays to …

Category: gensim-samples

Fri 14 November 2025

title: "Text 2 Dictionary" author: "Raja CSP Raman" date: 2019-04-20 description: "-" type: technical_note draft: false

import gensim
from gensim import corpora
from pprint import pprint

# How to create a dictionary from a list of sentences?
documents = [
        """More than half of survey participants also reported clicking on a headline expecting to …

Category: gensim-samples

Fri 14 November 2025

title: "Tokenize Sentences" author: "Raja CSP Raman" date: 2019-05-02 description: "-" type: technical_note draft: false source: https://tedboy.github.io/nlps/gensim_tutorial/tut1.html

from gensim import corpora

documents = [
    "The traditional paradigm just seems safer: be firm and a little distant from your employees", 

"The people who work for you should …

Category: gensim-samples

Append-To-Existing-Dictionary

Fri 14 November 2025

Bag-Of-Words

Fri 14 November 2025

Basic-Vector

Fri 14 November 2025

Bow-Counts

Fri 14 November 2025

Content-Summary

Fri 14 November 2025

File-To-Dictionary

Fri 14 November 2025

Find-Odd

Fri 14 November 2025

Text-2-Dictionary

Fri 14 November 2025

Tokenize Sentence

Fri 14 November 2025

Append-To-Existing-Dictionary

Fri 14 November 2025

Bag-Of-Words

Fri 14 November 2025

Basic-Vector

Fri 14 November 2025

Bow-Counts

Fri 14 November 2025

Content-Summary

Fri 14 November 2025

File-To-Dictionary

Fri 14 November 2025

Find-Odd

Fri 14 November 2025

Text-2-Dictionary

Fri 14 November 2025

Tokenize Sentence

Fri 14 November 2025

Page 1 of 1