python - Tagging spanish text with Unicode characters not possible with NLTK? -


I'm trying to parse some Spanish sentences that contain non-Eski characters (for example. Peilicula (film), anension (meditation), etc.).

I am reading lines from a file which is encoded with UTF-8. Here is a sample of my script: nltk.corpus as imported as unit from import cess_esp to import nltk import system <- p>

  # - * - coding: utf- 8 - * - nltk import BigramTagger as b = f (code_sentences', encoding = 'utf-8' W + ') for line in ITER (F): output_line = "current line content before tagging-> + str (line.decode (' utf-8 ',' replace ')) print output_line results_file.write (output_line. Encode ('utf8')) output_line = "Ungram Tagger->" print output_line results_file.w Rite (output_line) s = line.decode ('utf-8', 'replace') output_line = tagger.uni.tag (s.split ()) print output_line results_file.write (str (output_line) .encode ('utf8' )) F.close () results_file.close ()  

On this line:

  output_line = tagger.uni.tag (ssplit () )  

I am getting this error:

  /usr/local/lib/python2.7/dist-packages/nltk- 2.0.4- Py2.7.egg / nltk / tag / sequential.py: 138: UnicodeWarning: Unicode equivalent comparison Unicode failed to change both debates - interpreted them as having uneven return self._context_to_tag.get (reference)  

Here are some output days for a simple sentence Received:

  Existing line content before tagging- & gt; Tengo una cuza u cad queen hablo a ethenian mejan en la line media horra y corton la laalada !! Unigrammed Tagger- & gt; [(Utengo ',' vmip1s0 '), (' u'na ',' dfff) ', (ukua', 'ncfsfxa'), (u'i ',' cc '), (u'kada ',' 0 CC 0 '), (U.K.,' Preet CNNA '), (U'HoLlo', 'VMP1S'), (U'A ',' SPS & '), (Utensi \ Xf3n ', none) (U'me', 'pp1cs000'), (you'jan ',' vmip3p0 '), (u'n', 'espind'), (yula ',' dff 0, 0), (U'line '(U'I', 'CC'), (U 'Karta', None), (Yula ',' Da0fs0 '), (u'llamada !!', none )]  

If I understood from right ... the process is correct ... I have utf-8 Change the code, and then Unicode to UTF-8 again ... ... I do not understand this error.

What is wrong with any idea? < / P>

Thank you, Alejandro

Edit: Problems get ... Basically the Spanish SAS_PAP corpus is encoded with Latin-2 encoding correctly. See the code below to be able to train

  tag_saint = (has been sent to [word.decode ('Latin2'), tag) (word, tag) in cass.tagged_sents ()) tagger = UT (Tag_Assent) # Training A Tagger  

Corpus encoding to use the class to ask for a better method, thus you do not need to know it beforehand.

Maybe there is something wrong with your tagger object or how your file has been read. I have rewritten the part of my code and it runs without error:

  # - * - coding: utf-8 - * - import urlib2, nltk as ct from nltk. Corpus import codec tagger from cess_esp = ut (cess.tagged_sents ()) URL = 'https://db.tt/42Lt5M5K' wings = urllib2.urlopen (url) .read () bt as nltk import from BigramTagger ut Imported from NTTK import UnigramTagger as word_tokenize. . ('Tagger.out' 'W', 'UTF8') for the fin.split line in decode ('UTF8') fout = codecs.open ('\ n'): Print & gt; & Gt; Fout, current line content before "tagging->,", line print & gt; & Gt; Fout, "Unigram tagger- & gt;", print & gt; & Gt; Fout, tagger.tag (word_tokenize (line)) print & gt; & Gt; [Out]: 


Comments

Popular posts from this blog

import - Python ImportError: No module named wmi -

Editing Python Class in Shell and SQLAlchemy -

lua - HowTo create a fuel bar -