python 2.7 - How can I use word_tokenize in nltk and keep the spaces? -
takes the word_tokenize
function in nltk, as far as I understand, a string sentence and return Represents a list of all your words:
& gt; & Gt; & Gt; Import from nltk word_tokenize, wordpunct_tokenize & gt; & Gt; & Gt; S = ("Good muffins cost $ 3.88 in New York, please buy me two of them. \ N \ nThank you. \ N" "> Gt; & Gt; & Gt; Word_tokenize ('good', 'muffin', 'cost', '$', '3.88', 'in', 'new', 'york', 'please', 'buy', 'm' It is important to keep a blank space for further calculations, but 'so I would like to return it to it like the word_tokenize
: [' good ' '', '', '' '', '', '', '', '', '', '' '', '' '' '', '' '' '', '' '' '' ' 'York', '', 'please', '', 'buy', '', 'm', '', 'two', '', 'ki', '', 'them', '.', 'Thank you', '.']]
How do I change / change it Can you / tweak word_tokenize
?
Step 1: Break by string and spacing
Step 2: word_tokenize
< Code> using each word (split according to location in step 1)>> S = "Good muffins cost $ 3.88 in New York, please let me know \ n" & gt; & Gt; & Gt; Ll = [[word_tokenize (w), ''] s split () for w & gt; & Gt; & Gt; List ('*', '', '', '', 'cost' '' '' '$' ',' 3.88 ',' '' '' '' '' '', '', 'York', '.', '', 'Please', '', 'buy', '', 'me' '']
Comments
Post a Comment