ML Notes

Text-2-Dictionary

Fri 14 November 2025

title: "Text 2 Dictionary" author: "Raja CSP Raman" date: 2019-04-20 description: "-" type: technical_note draft: false

import gensim
from gensim import corpora
from pprint import pprint

# How to create a dictionary from a list of sentences?
documents = [
        """More than half of survey participants also reported clicking on a headline expecting to …

Category: gensim-samples

Fri 14 November 2025

title: "Template" author: "Rj" date: 2019-04-21 description: "-" type: technical_note draft: false

Basics

tokens

text1[0:100] - first 101 tokens

text2[5] - fifth token

concordance

text3.concordance(‘begat’) - basic keyword-in-context

text1.concordance(‘sea’, lines=100) - show other than default 25 lines

text1.concordance(‘sea’, lines=all) - show all results

text1.concordance …

Category: textprocessing

Fri 14 November 2025

title: "Text Blob Classifier" author: "Rj" date: 2019-04-21 description: "-" type: technical_note draft: false

from textblob.classifiers import NaiveBayesClassifier

train = [
     ('I love this sandwich.', 'pos'),
     ('this is an amazing place!', 'pos'),
     ('I feel very good about these beers.', 'pos'),
     ('this is my best work.', 'pos'),
     ("what an awesome view", 'pos'),
     ('I …

Category: textprocessing

Fri 14 November 2025

title: "Text Classification - Naive Bayes - Stackoverflow Tags" author: "Rj" date: 2019-04-21 description: "-" type: technical_note draft: false

# Disclaimer: some code copied form this https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568

import logging
import pandas as pd
import numpy as np
from numpy import random
import gensim
import nltk
from sklearn.model_selection import train_test_split
from …

Category: textprocessing

Fri 14 November 2025

title: "Text Decompose" author: "Rj" date: 2019-04-20 description: "List Test" type: technical_note draft: false

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup('<p>This is a slimy text and <i> I am slimer</i></p>')
soup.i.decompose()
print(soup.text)

This is a slimy text and

Score: 0

Category: webreader

Fri 14 November 2025

title: "Text Diff" author: "Rj" date: 2019-04-20 description: "List Test" type: technical_note draft: false

import difflib

str1 = "I understand how customers do their choice. Difference"
str2 = "I understand how customers do their choice."

seq = difflib.SequenceMatcher(None, str1, str2)

d = seq.ratio()*100

88.65979381443299

def get_similarity(str1, str2 …

Category: basics

Fri 14 November 2025

title: "Text File 2 NLTK Text" author: "Rj" date: 2019-04-21 description: "-" type: technical_note draft: false

import nltk

f =open('canola.txt','r')
raw = f.read()

raw

'OTTAWA—The federal Liberals promised Wednesday to give Canada’s canola farmers much-needed financial aid to help lessen the impact of China’s decision …

Category: textprocessing

Fri 14 November 2025

title: "Text Index and Slicing" author: "Rj" date: 2019-04-21 description: "-" type: technical_note draft: false

import nltk

f =open('canola.txt','r')
raw = f.read()

raw

'OTTAWA—The federal Liberals promised Wednesday to give Canada’s canola farmers much-needed financial aid to help lessen the impact of China’s decision to …

Category: textprocessing

Fri 14 November 2025

title: "Text Similarity" author: "Raja CSP Raman" date: 2019-04-20 description: "-" type: technical_note draft: false

import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer

stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

'''remove …

Category: basics

Fri 14 November 2025

title: "Text Similarity Finder" author: "Raja CSP Raman" date: 2019-04-20 description: "-" type: technical_note draft: false

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def findSimilarity(param1, param2):
    documents = (
        param1,
        param2
    )
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
    cosine = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)

    print(cosine)

findSimilarity("In …

Category: basics

Page 137 of 146

« Prev Next »

Text-2-Dictionary

Fri 14 November 2025

Text-Analysis-Cheatsheet

Fri 14 November 2025

tokens

concordance

Text-Blob-Classifier

Fri 14 November 2025

Text-Classification-Nb

Fri 14 November 2025

Text-Decompose

Fri 14 November 2025

Text-Diff

Fri 14 November 2025

Text-File-2-Nltk-Text

Fri 14 November 2025

Text-Index-And-Slicing

Fri 14 November 2025

Text-Similarity

Fri 14 November 2025

Text-Similarity-Finder

Fri 14 November 2025