Howto use CATMA and Stanford NER together

I know that at least 10% of my love for the tool CATMA (Computer Aided Textual Markup and Analysis) is due to the fact that my dear colleagues developed it and I think they are doing a great job on it. And I am happy that they never get tired of dealing with annoying users like myself always wanting to do more than actually possible (and often making it possible somehow). But the other 90% is easily explained by the incredible number of featutes and the ease of using them. So when I searched for a tool to optimize my NER (Named Entity Recognition – find more about it here) outcome I almost automatically found my way back home to CATMA and I can fully recommend combining the two tools, especially now that a new function is implemented in CATMA which allows us to upload xml directly and thus access the NER tags automatically.

How to import NER data to CATMA

First of all, if you don’t know how to use Stanford Named Entity Recognition, you will find all you need here. As for CATMA you do not have to install anything as it is a web based tool. Just go here and connect with your google account and you’re in.

When you finished the NER on your text, go to save tagged file and give it a name you can easily remember. Unfortunately the Stanford NER will not give you a full xml file but only a txt with some xml tags in it. To upload the whole markup you will have to correct that by giving it an xml frame, which is quite easily done. Just ad a < head > tag at the beginning and close it at the end. For example you could insert < NER-Markup > (without spaces) in the very first line of the text and < / NER-Markup > in the last one. Afterwards you have to save the file in xml which can be done in programs like TextWrangler or Wordpad by changing the txt extension into xml.

Then go back to CATMA and upload the file by following the step-by-step upload process. Catma will automatically recognize the xml extension when everything is done properly. However, it will have problems when there are >…< used instead of „…“ to signal direct speach and it will give you an error if there is the sign & used instead of the word „and“. Just make sure these signs are not in your text (except of course in the tags). Afterwards you will have an automatically created markup collection which is the place where your markup is stored and which is called Intrinsic Marup Collection.

Bildschirmfoto 2014-11-22 um 11.42.03

Next you may go to „analyse document“. In the new window you will find a command line where you can enter regular expressions for searching your tags or just follow the gui click system to create a query. For example you can put tag=“NER-Markup/LOCATION%“, click „execute query“ and you will get a list of all location tags of the NER you entered. If you select all of them for kwic (keyword in context) they will appear in a table in the window on the right. Underneath this window you will find a small symbol of an xls file. Clicking on that you will start the download of it and you can open it in Excel. Or you can just use the KWIC in CATMA to jump back to your text by double clicking on one of the keywords. You can also visualize your data in a distribution graph by clicking the symbol underneath the tag list or you can go to a tree visualization of a single keyword.

Bildschirmfoto 2014-11-22 um 11.44.21

Of course you can also go back to your excel file as I did and work on your output data. You can easily remove all the wrongly tagged entities (which will be some work if you work on German texts and much less if you do it in English, because the English NER works much better at the moment). And you can use this table to import it in a geolocalization tool as I will explain to you in the next tutorial on this blog.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>