Append-To-Existing-Dictionary

Sat 17 May 2025

title: "Append To Existing Dictionary" author: "Raja CSP Raman" date: 2019-05-02 description: "-" type: technical_note draft: false


import gensim
from gensim import corpora
from pprint import pprint
# How to create a dictionary from a list of sentences?
documents = [
        """More than half of survey participants also reported clicking on a headline expecting to read a balanced news account, only to find the story was pushing an agenda.
The survey found 48 per cent of respondents struggled to distinguish between fact and falsehood, while doubts about the authenticity of news stories had jumped 10 per cent in the past year.
The Canadian Journalism Foundation says the survey findings are troubling, particularly in the run-up to a federal election."""
    ]
# Tokenize(split) the sentences into words
tokens = [[text for text in doc.split()] for doc in documents]
# Create dictionary
dictionary = corpora.Dictionary(tokens)
print(dictionary)
Dictionary(60 unique tokens: ['10', '48', 'Canadian', 'Foundation', 'Journalism']...)
print(type(dictionary))
<class 'gensim.corpora.dictionary.Dictionary'>
# Show the word to id map
print(dictionary.token2id)
{'10': 0, '48': 1, 'Canadian': 2, 'Foundation': 3, 'Journalism': 4, 'More': 5, 'The': 6, 'a': 7, 'about': 8, 'account,': 9, 'agenda.': 10, 'also': 11, 'an': 12, 'and': 13, 'are': 14, 'authenticity': 15, 'balanced': 16, 'between': 17, 'cent': 18, 'clicking': 19, 'distinguish': 20, 'doubts': 21, 'election.': 22, 'expecting': 23, 'fact': 24, 'falsehood,': 25, 'federal': 26, 'find': 27, 'findings': 28, 'found': 29, 'had': 30, 'half': 31, 'headline': 32, 'in': 33, 'jumped': 34, 'news': 35, 'of': 36, 'on': 37, 'only': 38, 'participants': 39, 'particularly': 40, 'past': 41, 'per': 42, 'pushing': 43, 'read': 44, 'reported': 45, 'respondents': 46, 'run-up': 47, 'says': 48, 'stories': 49, 'story': 50, 'struggled': 51, 'survey': 52, 'than': 53, 'the': 54, 'to': 55, 'troubling,': 56, 'was': 57, 'while': 58, 'year.': 59, '2,300': 60, 'Association.': 61, 'Canadians.': 62, 'Intelligence': 63, 'Marketing': 64, 'Online': 65, 'Research': 66, 'according': 67, 'assigned': 68, 'be': 69, 'cannot': 70, 'conducted': 71, 'error,': 72, 'five-day': 73, 'last': 74, 'margin': 75, 'month,': 76, 'more': 77, 'over': 78, 'period': 79, 'polls': 80, 'sampled': 81, 'survey,': 82}
# Add a new document to the existing dictionary
documents_2 = [
        """The survey, conducted over a five-day period last month, sampled more than 2,300 Canadians.
        Online polls cannot be assigned a margin of error, according to the Marketing Research and Intelligence Association."""
    ]
texts_2 = [[text for text in doc.split()] for doc in documents_2]
dictionary.add_documents(texts_2)
print(dictionary)
Dictionary(83 unique tokens: ['10', '48', 'Canadian', 'Foundation', 'Journalism']...)
# Show the word to id map
print(dictionary.token2id)
{'10': 0, '48': 1, 'Canadian': 2, 'Foundation': 3, 'Journalism': 4, 'More': 5, 'The': 6, 'a': 7, 'about': 8, 'account,': 9, 'agenda.': 10, 'also': 11, 'an': 12, 'and': 13, 'are': 14, 'authenticity': 15, 'balanced': 16, 'between': 17, 'cent': 18, 'clicking': 19, 'distinguish': 20, 'doubts': 21, 'election.': 22, 'expecting': 23, 'fact': 24, 'falsehood,': 25, 'federal': 26, 'find': 27, 'findings': 28, 'found': 29, 'had': 30, 'half': 31, 'headline': 32, 'in': 33, 'jumped': 34, 'news': 35, 'of': 36, 'on': 37, 'only': 38, 'participants': 39, 'particularly': 40, 'past': 41, 'per': 42, 'pushing': 43, 'read': 44, 'reported': 45, 'respondents': 46, 'run-up': 47, 'says': 48, 'stories': 49, 'story': 50, 'struggled': 51, 'survey': 52, 'than': 53, 'the': 54, 'to': 55, 'troubling,': 56, 'was': 57, 'while': 58, 'year.': 59, '2,300': 60, 'Association.': 61, 'Canadians.': 62, 'Intelligence': 63, 'Marketing': 64, 'Online': 65, 'Research': 66, 'according': 67, 'assigned': 68, 'be': 69, 'cannot': 70, 'conducted': 71, 'error,': 72, 'five-day': 73, 'last': 74, 'margin': 75, 'month,': 76, 'more': 77, 'over': 78, 'period': 79, 'polls': 80, 'sampled': 81, 'survey,': 82}

Score: 10

Category: gensim-samples


Bag-Of-Words

Sat 17 May 2025

title: "Bag of Words" author: "Raja CSP Raman" date: 2019-04-20 description: "-" type: technical_note draft: false


from gensim.utils import simple_preprocess
from gensim import corpora
from pprint import pprint
contents = [
    "More than half of survey participants also reported clicking on a headline expecting to read a balanced news account, only to …

Category: gensim-samples

Read More

Basic-Vector

Sat 17 May 2025

title: "Basic Vector" author: "Raja CSP Raman" date: 2019-05-02 description: "-" type: technical_note draft: false


from gensim import corpora, models, similarities
corpus = [
    [(0, 1.0), (1, 1.0), (2, 1.0)],
    [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
    [(1, 1 …

Category: gensim-samples

Read More

Bow-Counts

Sat 17 May 2025

title: "Bag of word Counts" author: "Raja CSP Raman" date: 2019-05-02 description: "-" type: technical_note draft: false


from gensim.utils import simple_preprocess
from gensim import corpora
from pprint import pprint
contents = [
    "The Star obtained a copy of the email outlining the latest in a series of Progressive Conservative provincial budget cuts …

Category: gensim-samples

Read More

Content-Summary

Sat 17 May 2025

title: "Content Summary" author: "Raja CSP Raman" date: 2019-04-20 description: "-" type: technical_note draft: false


from gensim.summarization import summarize, keywords
from pprint import pprint
from smart_open import smart_open
text = " ".join((line for line in smart_open('sample.txt', encoding='utf-8')))
text
'More than half of survey participants also reported clicking on …

Category: gensim-samples

Read More

File-To-Dictionary

Sat 17 May 2025

title: "File To Dictionary" author: "Raja CSP Raman" date: 2019-04-20 description: "-" type: technical_note draft: false


import gensim
from gensim import corpora
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os
dictionary = corpora.Dictionary()
# Create gensim dictionary form a single tet file
dictionary = corpora.Dictionary(simple_preprocess(line, deacc=True …

Category: gensim-samples

Read More

Find-Odd

Sat 17 May 2025

title: "Find Odd" author: "Raja CSP Raman" date: 2019-05-02 description: "-" type: technical_note draft: false


import gensim.downloader as api
fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')
print(fasttext_model300.doesnt_match(['one', 'two', 'eleven', 'thirty', 'tennis']))  
tennis


/Users/rajacsp/anaconda3/envs/py36/lib/python3.6/site-packages/gensim/models/keyedvectors.py:858: FutureWarning: arrays to …

Category: gensim-samples

Read More

Text-2-Dictionary

Sat 17 May 2025

title: "Text 2 Dictionary" author: "Raja CSP Raman" date: 2019-04-20 description: "-" type: technical_note draft: false


import gensim
from gensim import corpora
from pprint import pprint
# How to create a dictionary from a list of sentences?
documents = [
        """More than half of survey participants also reported clicking on a headline expecting to …

Category: gensim-samples

Read More

Tokenize Sentence

Sat 17 May 2025

title: "Tokenize Sentences" author: "Raja CSP Raman" date: 2019-05-02 description: "-" type: technical_note draft: false source: https://tedboy.github.io/nlps/gensim_tutorial/tut1.html


from gensim import corpora
documents = [
    "The traditional paradigm just seems safer: be firm and a little distant from your employees", 

"The people who work for you should …

Category: gensim-samples

Read More
Page 1 of 1