python - Tagging spanish text with Unicode characters not possible with NLTK? -

- August 15, 2014

I'm trying to parse some Spanish sentences that contain non-Eski characters (for example. Peilicula (film), anension (meditation), etc.).

I am reading lines from a file which is encoded with UTF-8. Here is a sample of my script: nltk.corpus as imported as unit from import cess_esp to import nltk import system <- p>

  # - * - coding: utf- 8 - * - nltk import BigramTagger as b = f (code_sentences', encoding = 'utf-8' W + ') for line in ITER (F): output_line = "current line content before tagging-> + str (line.decode (' utf-8 ',' replace ')) print output_line results_file.write (output_line. Encode ('utf8')) output_line = "Ungram Tagger->" print output_line results_file.w Rite (output_line) s = line.decode ('utf-8', 'replace') output_line = tagger.uni.tag (s.split ()) print output_line results_file.write (str (output_line) .encode ('utf8' )) F.close () results_file.close ()

On this line:

  output_line = tagger.uni.tag (ssplit () )

I am getting this error:

  /usr/local/lib/python2.7/dist-packages/nltk- 2.0.4- Py2.7.egg / nltk / tag / sequential.py: 138: UnicodeWarning: Unicode equivalent comparison Unicode failed to change both debates - interpreted them as having uneven return self._context_to_tag.get (reference)

Here are some output days for a simple sentence Received:

  Existing line content before tagging- & gt; Tengo una cuza u cad queen hablo a ethenian mejan en la line media horra y corton la laalada !! Unigrammed Tagger- & gt; [(Utengo ',' vmip1s0 '), (' u'na ',' dfff) ', (ukua', 'ncfsfxa'), (u'i ',' cc '), (u'kada ',' 0 CC 0 '), (U.K.,' Preet CNNA '), (U'HoLlo', 'VMP1S'), (U'A ',' SPS & '), (Utensi \ Xf3n ', none) (U'me', 'pp1cs000'), (you'jan ',' vmip3p0 '), (u'n', 'espind'), (yula ',' dff 0, 0), (U'line '(U'I', 'CC'), (U 'Karta', None), (Yula ',' Da0fs0 '), (u'llamada !!', none )]

If I understood from right ... the process is correct ... I have utf-8 Change the code, and then Unicode to UTF-8 again ... ... I do not understand this error.

What is wrong with any idea? < / P>

Thank you, Alejandro

Edit: Problems get ... Basically the Spanish SAS_PAP corpus is encoded with Latin-2 encoding correctly. See the code below to be able to train

  tag_saint = (has been sent to [word.decode ('Latin2'), tag) (word, tag) in cass.tagged_sents ()) tagger = UT (Tag_Assent) # Training A Tagger

Corpus encoding to use the class to ask for a better method, thus you do not need to know it beforehand.

Maybe there is something wrong with your tagger object or how your file has been read. I have rewritten the part of my code and it runs without error:

  # - * - coding: utf-8 - * - import urlib2, nltk as ct from nltk. Corpus import codec tagger from cess_esp = ut (cess.tagged_sents ()) URL = 'https://db.tt/42Lt5M5K' wings = urllib2.urlopen (url) .read () bt as nltk import from BigramTagger ut Imported from NTTK import UnigramTagger as word_tokenize. . ('Tagger.out' 'W', 'UTF8') for the fin.split line in decode ('UTF8') fout = codecs.open ('\ n'): Print & gt; & Gt; Fout, current line content before "tagging->,", line print & gt; & Gt; Fout, "Unigram tagger- & gt;", print & gt; & Gt; Fout, tagger.tag (word_tokenize (line)) print & gt; & Gt; [Out]:




















Get link





Facebook





X





Pinterest





Email





Other Apps




Comments





Post a Comment



Popular posts from this blog




unix - Launch PUTTY script in C# code -



-



January 15, 2011








I have to write a C # code that uses the Pootty to connect to a UNIX server, execute a command (for example "ls - la"), and return the results of the script to c #  How can I do this?   I use the process. To run the PUTTY process, start in C #.      What you need to get results from your cyst process, redirect your processes  Stream Out (Standard Output)  and consume it in your code:    var processStartInfo = new process restart {filename = @ "c: \ basement location", logic = @ "- SSH -b abc.txt "RedirectStandardOutput = true, UseShellExecute = false, // You must set ShellExecute to a false error ErrorDialog = false}; Var Process = Process Start (processStartInfo); If (process == zero) {return; } Var reader = process.StandardOutput; While (Reader.EndOfStream) {// read data ..}      





Read more





Editing Python Class in Shell and SQLAlchemy -



-



July 15, 2015








    I am working on the terminal on a shell script after this tutorial SQLAlchemy tutorial. I need to type    gt; & Gt; & Gt; Sqlalchemy import column, integer, string & gt; & Gt; & Gt; Class user (base): __tablename__ = 'users' id = column (integer, primary_key = true) name = column (string) absolute name = column (string) password = column (string) def __repr __ (self): return "& lt ; User (name = '% s', absolute name = '% s', password = '% s') & gt; "% (Self.name, self.fullname, self.password)    After the problem I have typed the password = column (string) I hit twice and changed .... >> I then took everything back but then an error was thrown because the class already exists ... I am not completely sure how to fix it. How do I open that slip in the shell script and edit it (Add DF to DRR)   The error thrown below:    / user / giripeters / TFSQLLX / Lib / python2.7 / site-packages / sqlalchemy / ext / decla...





Read more





uislider - In a MATLAB GUI, how does one implement a continuously
varying slider from a GUIDE created .m file? -



-



September 15, 2015








    I am trying to create a GUI with a plot and a slider. The slider should be described as continuously and output. I have a problem with this explanation that the  uicontrol  object is programmatically created, on the contrary, I want to know that in the last three instances, to complete the same thing done according to the program What can be added to a GUI file that is created. The code is down, and I have marked the (wrong) code that is provided by some of the above examples. Thanks a lot for any help! This slider function sets P1_slider_Callback (hObject, eventdata, handle) (handle, dragon, 'string', num2str (get (hObject, 'value'), 3)) PlottheData_Callback ( HOBject, EventData, Handle)% "PlotHeadata" is a button that creates the data slider based on the value of P1_slider%, and I think that here "p1_slider_CreateFcn" (hObject, eventdata, handle) needs to be assembled (HOBJECT, 'BackgroundRoller'), (0, 'Default UITTACKBACK MANDROLLE...





Read more

Search This Blog

City

python - Tagging spanish text with Unicode characters not possible with NLTK? -

Comments

Post a Comment

Popular posts from this blog

unix - Launch PUTTY script in C# code -

Editing Python Class in Shell and SQLAlchemy -

uislider - In a MATLAB GUI, how does one implement a continuously varying slider from a GUIDE created .m file? -