Text to Vector

21 Apr 2019

import nltk

content = "The Democrats — including more than 50 freshmen — are mindful that impeachment poses political risks that could endanger the seats of moderates and their majority, as well as strengthen Mr. Trump’s hand. "

content

'The Democrats — including more than 50 freshmen — are mindful that impeachment poses political risks that could endanger the seats of moderates and their majority, as well as strengthen Mr. Trump’s hand. '

tokens = nltk.word_tokenize(content)

tokens

['The',
 'Democrats',
 '—',
 'including',
 'more',
 'than',
 '50',
 'freshmen',
 '—',
 'are',
 'mindful',
 'that',
 'impeachment',
 'poses',
 'political',
 'risks',
 'that',
 'could',
 'endanger',
 'the',
 'seats',
 'of',
 'moderates',
 'and',
 'their',
 'majority',
 ',',
 'as',
 'well',
 'as',
 'strengthen',
 'Mr.',
 'Trump',
 '’',
 's',
 'hand',
 '.']

type(tokens)

list

Why text to vector?

To to machine learning on text, we need to transform our documents into vectors so we can apply numeric machine learning. This is called feature extraction or vectorization.