ML Notes - textprocessing

Print-Filenames

Fri 14 November 2025

title: "Print Filenames" author: "Rj" date: 2019-04-21 description: "-" type: technical_note draft: false

from os import walk
from os import listdir
from os.path import isfile, join

path = '/tmp'

onlyfiles = [f for f in listdir(path) if isfile(join(path, f))]

onlyfiles

['.BBE72B41371180178E084EEAF106AED4F350939DB95D3516864A1CC62E7AE82F']

for fle in onlyfiles:
    print(fle)

.BBE72B41371180178E084EEAF106AED4F350939DB95D3516864A1CC62E7AE82F

Score …

Category: textprocessing

Fri 14 November 2025

title: "Text Classification - Naive Bayes - Product Summary" author: "Rj" date: 2019-04-21 description: "-" type: technical_note draft: false

# Disclaimer: some code copied form this https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568

import logging
import pandas as pd
import numpy as np
from numpy import random
import gensim
import nltk
from sklearn.model_selection import train_test_split
from …

Category: textprocessing

Fri 14 November 2025

title: "Regexp Stemmer" author: "Rj" date: 2019-04-21 description: "-" type: technical_note draft: false

from nltk.stem import RegexpStemmer

re_stemmer = RegexpStemmer("ing$|s$|e$|able$", min=7)

words = [
    "wheels",
    "breaking",
    "thrones",
    "breakable"
]

words

['wheels', 'breaking', 'thrones', 'breakable']

result = [re_stemmer.stem(word) for word in words]

result

['wheels', 'break', 'throne', 'break']

As the …

Category: textprocessing

Fri 14 November 2025

title: "Search Text" author: "Rj" date: 2019-04-21 description: "-" type: technical_note draft: false

from nltk.book import text1

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby …

Category: textprocessing

Fri 14 November 2025

title: "Simple Text Processing" author: "Rj" date: 2019-04-20 description: "-" type: technical_note draft: false

import re
from nltk.tokenize import word_tokenize
from collections import Counter
from nltk.corpus import stopwords

text = ("""The cat is in the box. The cat likes the box. The box is over the cat.""")

tokens = [w for …

Category: textprocessing

Fri 14 November 2025

title: "Snowball Stemmer" author: "Rj" date: 2019-04-21 description: "-" type: technical_note draft: false

from nltk.stem.snowball import SnowballStemmer

words = [
    "hunting",
    "bunnies",
    "thinking"
]

words

['hunting', 'bunnies', 'thinking']

stemmer = SnowballStemmer("english")

result = [stemmer.stem(word) for word in words]

result

['hunt', 'bunni', 'think']

Score: 5

Category: textprocessing

Fri 14 November 2025

title: "Speech 2 Text" author: "Rj" date: 2019-04-21 description: "-" type: technical_note draft: false

import speech_recognition as sr

def startpy():

    # obtain audio from the microphone
    r = sr.Recognizer()
    d= ''
    while (d!='exit' and d!='quit'):
        with sr.Microphone() as source:
            print("Say something!")
            audio = r.listen(source)

    # recognize speech using Google …

Category: textprocessing

Fri 14 November 2025

title: "Stemmer with Stopwords" author: "Rj" date: 2019-04-21 description: "-" type: technical_note draft: false

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

print(stemmer.stem("having"))

have

stemmer2 = SnowballStemmer("english",  ignore_stopwords = True)

print(stemmer2.stem("having"))

having

Score: 5

Category: textprocessing

Fri 14 November 2025

title: "Stemming and Lemmatization" author: "Rj" date: 2019-04-21 description: "-" type: technical_note draft: false

Stemming and Lemmatization are Text Normalization techniques in Natural Language Processing that are used to prepare text, words, and documents for further text processing.
Text normalization sometimes called as Word Normalization
Stemming in the process of keeping …

Category: textprocessing

Fri 14 November 2025

title: "Template" author: "Rj" date: 2019-04-21 description: "-" type: technical_note draft: false

Basics

tokens

text1[0:100] - first 101 tokens

text2[5] - fifth token

concordance

text3.concordance(‘begat’) - basic keyword-in-context

text1.concordance(‘sea’, lines=100) - show other than default 25 lines

text1.concordance(‘sea’, lines=all) - show all results

text1.concordance …

Category: textprocessing

Page 3 of 5

« Prev Next »

Print-Filenames

Fri 14 November 2025

Product-Summary-Classification-Nb

Fri 14 November 2025

Regexp-Stemmer

Fri 14 November 2025

Search-Text

Fri 14 November 2025

Simple-Text-Processing

Fri 14 November 2025

Snowball-Stemmer

Fri 14 November 2025

Speech-2-Text

Fri 14 November 2025

Stemmer-With-Stopwords

Fri 14 November 2025

Stemming-And-Lemmatization

Fri 14 November 2025

Text-Analysis-Cheatsheet

Fri 14 November 2025

tokens

concordance