Automated Annoting Tools

Named Entity Recognition and Automated Annoting Tools

In thise next sections below we identify currently available tools for automated tagging and provide a brief evalution of each of them based on a set of criteria.

Open Source Standalone Applications

In this section, we describe the most popular generic, stand-alone NER systems, the majority of which are written in Java. Many of them can be customised to identify specific entities (people, places, events etc). Many of them don’t explicitly support semantic technologies like RDF, but they can be modified relatively easily, to generate this kind of output.

5.2.1 GATE (General Architecture for Text Engineering) http://gate.ac.uk/ • Open-source tool developed by University of Sheffield, can easily be embedded (Java jars) in other systems • Includes ANNIE: – an information extraction and semantic tagger system which is extremely tailorable, supports multiple languages, customized gazetteers (based on flat list of terms or from an ontology) • Extensive documentation on web site, however the system is relatively difficult to set up/configure – there is a set of training/certification modules. • GATE cloud currently only available to GATE partners and in alpha phase. • Current projects that use GATE include: o GATE/ETCSL – The project is building generic tools for linguistic annotation and Web based analysis of literary Sumerian o EMILLE (Enabling Minority Language Engineering) – Building a 63 million word electronic corpus of South Asian languages, especially those spoken in the UK o OldBailey Online – Named entity recognition on 17th century Old Bailey Court reports, using a combination of manual markup and GATE

5.2.2 YooName/Balie http://yooname.com • Proof-of-concept built by PhD student based on semi-supervised learning • Identifies 9 types of entities (100 sub-categories) including person, organisation, location, facility, product, event, natural object and unit. • Evolved version of Balie (open source tool by same developer) http://balie.sourceforge.net/

5.2.3 Mallet (MAchine Learning for LanguagE Toolkit) http://mallet.cs.umass.edu/ • Open source (CPL) Java-based tool • Documentation aimed at people familiar with NLP – relatively difficult to get started • Sequence tagging features support Named Entity Recognition, using hidden markov models and linear chain conditional randon fields (CRFs)
5.2.4 FreeLing http://www.lsi.upc.edu/~nlp/freeling/ • Open source (GPL), APIs for both python and php • Supports multiple languages (including Portuguese, Italian, Spanish and English) • Recognises dates/times, quantities/ratios and named entities such as people. • Includes on-line demo http://nlp.lsi.upc.edu/freeling/demo/demo.php
5.2.5 Illinois Named Entity Tagger http://cogcomp.cs.illinois.edu/page/software_view/4 • From University of Illinois at Urbana-Champaign • Tags people, organisations, locations, miscellaneous. Gazetteers are based on Wikipedia • Developed by L. Ratinov and D. Roth, Design Challenges and Misconceptions in Named Entity Recognition, CoNLL 2009
5.2.6 LingPipe http://alias-i.com/lingpipe/ • Java API with source code • Online demo – result is XML that labels the entities using ENAMEX tags identifying persons, organisations and locations • Can be trained to recognize entities from any domain or language based on regular expressions or dictionary • Free for research use, licenses available for commercial use
5.2.7 Open Pipeline http://www.openpipeline.org/ • Open source (Apache License 2.0) Java-based search pipeline platform • Includes wrappers for LingPipe and UIMA • Entity extraction via a commercial add-on
5.2.8 MinorThird http://sourceforge.net/apps/trac/minorthird/wiki A toolkit and collection of Java classes – provides machine learning methods for extracting entities, integrated with tools for manually and programmatically annotating text. • open-source (BSD) Java libraries • annotation and visualisation system as well as entity recognition • Uses stand-off markup of textual documents stored in a databse (TextBase) • Cohen, W. MinorThird: Methods for Identifying Names and Ontological Relations in Text using Heuristics for Inducing Regularities from Data, http://minorthird.sourceforge.net, 2004.
5.2.9 Stanford Named Entity Recognizer http://nlp.stanford.edu/ner/index.shtml • Open source (GPL) Java-based tool (commercial license also available) • Needs to be trained to recognise entities e.g., person, location, organization • Additional tools available e.g., Perl module provides web service interface, and Apache UIMA annotator • Recent new release, active community with mailing lists for support • Output formats include XML, inlineXML and slashTags
5.2.10 TextPro/Typhoon http://textpro.fbk.eu/ TextPro/Typhoon is a classifier combination system for Named Entity Recognition (NER), in which two different classifiers are combined to exploit Data Redundancy and Patterns extracted from a large text corpus.
• Demo recognises persons, locations and organisations • Works for both Italian and English • Free for research/non-profit purposes • Online demo available: http://textpro.fbk.eu/demo.php • Typhoon also available as a web service (Italian only) http://textpro.fbk.eu/typhoon 5.3 Web Services
There is an increasing number of Web services available that perform named entity recognition on textual documents via a Web interface. The majority and the best of such services are not open source or free. There are some free web services (e.g.,tagthe.net http://www.tagthe.net/) but they generally provide poor quality performance. Although the majority of services are commercial, some also have free components/versions with limited functionality/usage (e.g., 10,000 requests/day). Examples that apply this kind of restriction include: • Evri • OpenCalais • AlchemyAPI The most promising services also apply restrictions on the re-use of tags – for example, they don’t provide a mechanism by which users can store the tags for re-use. There are also many web services that to all intents and purposes are commercial because the amount of permitted free usage is very small: Meaningtool, Complexity Intelligenece, TextDigger. Below is a survey of the most widely used, robust and best performing of the semantic tagging web services.
5.3.1 Evri http://www.evri.com • Provides several APIs for NLP text analysis, content recommendations and relationships between semantic entities [43] • The “Get Entities Based on Text API” extracts entities (people, places, things) from news articles, blog posts, twitter tweets and other web content. The full schema of entities is not published, but a zeitgeist of 1000 most popular is available. Entities include persons, locations, concepts, products, organisations and events as well as relations. • Results are XML or JSON • Evri entities are identified by Evri URIs (but no Linked Data URIs to other databases) • Has a mobile application – filters and delivers personalized content via iPhone app • Free, with no fixed limit for non-commercial use, however caching of results is not permitted – exemptions are possible (e.g., for academic use) by contacting the company. Commercial licenses available.
5.3.2 OpenCalais http://www.opencalais.com/ • OpenCalais is a product of Thomson Reuters that provides an open API that has been widely adopted by the open source community. • Identifies specific entities, events and relations from the web and news domain (e.g., company merger, natural disaster, product recall, conviction etc). Also suggests social tags. • A full list of available entities is available here: http://www.opencalais.com/documentation/calais-web-service-api/api-metadata/entity-indexand-definitions • See also the online demo/web service: http://viewer.opencalais.com/ • User-defined vocabularies are planned for “some point in the future”) • Many entities are identified using Calais URIs, some sameAs links to DBPedia and Freebase • Supports disambiguation of companies, geographical locations and electronics products • Results available as: RDF/XML, Microformats, custom XML (Simple Format), JSON: • Provides character offsets that can be used to insert tags into content

• Free for up to 50,000 requests per day after registering for API key, subscription plans above that. Works on documents up to 100K. • Supports English, French and Spanish • Detailed documentation available on the website including RDF schema and demo • It is the semantic tagging engine behind the OpenPublish platform (integrated with Drupal and WordPress) • ClearForest http://www.clearforest.com/ – also have a commercial product called OneCalais
5.3.3 Alchemy API http://www.alchemyapi.com/ • Automatically tags web pages, textual documents, scanned document images. Supports OCR to analyse scans of newspapers, documents etc. • Supports multiple languages (English, Spanish, German, Russian, Italian + others) • Named Entity Extraction API identifies specific entities including people, companies, organisations, cities, geographic features, anniversaries, awards, holidays etc. • Entities identified by URIs from Linked Open Data (LOD) sources e.g. Freebase, UMBEL, CIA Factbook • Disambiguation support (although seems to be missing disambiguated URIs for “person” entities) [43] • Formats: XML, JSON, RDF, Microformats • Requires an access key to access the API • Free for up to 30,000 calls per day, can pay for commercial support. • Detailed documentation available on website including RDF schema and online demo http://www.alchemyapi.com/api/entity/
5.3.4 Zemanta http://www.zemanta.com/api/ • Identifies the following entities: persons, books, music, movies, locations, stocks, companies (documentation does not mention events). • Also returns related tags, categories, pictures and articles. • Free for up to 10,000 API calls per day. Subscription plans above that. • Returns RDF/XML, JSON, or custom XML • Documentation says it supports custom taxonomies • See a recent comparison with Open Calais: Linked Data Entity Extraction with Zemanta and OpenCalais http://bnode.org/blog/2010/07/28/linked-data-entity-extraction-with-zemanta-andopencalais
5.3.5 OpenAmplify http://www.openamplify.com/ • Provides Natural Language Processing APIs, for use in commercial applications • Analyses documents for topics (including named entities such as persons, organisations and locations), actions (i.e. events that can be identified by verbs such as give, learn, repair, request, say etc and when they have or will occur), style, demographics etc. • Results available as custom XML and JSON formats • Good documentation on website including code samples and tutorials. • Free for up to 1,000 requests per day, commercial packages available beyond that.
5.3.6 Meaningtool http://www.meaningtool.com/ • Identifies entities (organisations, companies, locations, persons only), categories, keywords, language • Supports English, Spanish and Portuguese texts • Supports user-defined trees for categorisation • Results available in JSON or custom XML • Free for up to 1,000 requests per day (plans available above that) • Good documentation and demo on website
8
5.3.7 Complexity Intelligence http://www.complexityintelligence.com/ • Free for 10,000 requests per month after registering • Identifies persons, companies, locations (perhaps more?) • Online demo available from web site
5.3.8 TextDigger http://textdigger.com/ • Semantic content tagger – free to tag 25 URLs per day (can purchase additional capacity) • Results are not sent automatically – must request page to be queued for tagging, and then retrieve the results via the web service. • Assigned tags are used to retrieve links to related web pages. • Results are returned as custom XML. Entities have numeric ids. • The results are stored in a database
5.3.9 Inform http://www.informpublisherservices.com/ • Commercial Web service • Not much information on their website – further information by enquiry only
5.3.10 mSpoke mSense http://www.mspoke.com/mSense.html • Commercial Web service • Identifies named entities: people, places, organisations (also topics, categories) • mSense taxonomy based on Wikipedia, also allows customized taxonomy • mSense API available • Further information available through enquiry
5.3.11 Info(N)gen http://www.infongen.com/ • Commercial Web service • Default taxonomy includes entities such as company, industry, language, country, products – targeted at business, finance, pharma, energy, technology, consumer goods, retail, commercial services and media domains. • Customized taxonomies possible (must be created using the InfoNgen Taxonomy Wizard) • Results are RDF/XML or custom XML (via API or feed) • Further information available through enquiry
5.3.12 Alethes OpenEyes http://www.alethes.it/openeyes.html • Commercial system • Website in Italian but can be applied to 8 languages including English • Example: http://www.youtube.com/watch?v=VJdMM8Rhxdo • Recognises people, organisations, places, quantities, dates, currency. Entities can be customized • Compatible with Apache UIMA
5.3.13 TagThe.net http://tagthe.net • Returns custom XML containing tags identifying topics, locations, persons, (but not events). Also tags for title, size, content-type, author and language of the source document. • Does not markup content (or indicate location of entities within content). • Tags are text only (no identifiers or ontology) • Uses statistical approach (from FAQ). Analysis component is written in Java. • Free to use as-is. No limitations on use but also no service level guarantees. • Can invoke via HTTP requests

5.4 Commercial Systems
There are a wide range of commercial named entity recognition (NER) systems available. These systems typically use significant numbers of hand-coded rules, which enable them to achieve reasonable performance for limited numbers of entity types on well-circumscribed corpora, such as news articles. However they generally don’t permit customization or tailoring for domains other than the one for which they were designed. Below we have described some of the more popular and widely used commercial systems for named entity tagging.
5.4.1 SAS Text Miner http://www.sas.com/text-analytics/text-miner/index.html • Mines text from PDFs, HTML, Word docs in multiple languages • Identifies named entities, parts of speech and provides visualisation of concepts • Support for many different entity types, including person and company names, locations, dates, addresses, measurements, and e-mail and URL addresses. • Supports user customization of entity lists • Commercial system, was previously known as Teragram
5.4.2 Leximancer http://www.leximancer.com/ • Commercial • Standalone software or hosted solution • Visualisations as well as named entity recognition
5.4.3 Megaputer PolyAnalyst http://www.megaputer.com/polyanalyst.php • Commercial product • Supports keyword and entity extraction as well as categorization and clustering of textual documents
5.4.4 Trifeed TRAILS http://www.trifeed.com • Identifies entities including people, companies, places, events, books, movies, dates, currency with associated attributes (eg a person’s position). Can also extract relations and quotes. • Commercial • Aimed at online news domain • Demo available on website: http://www.trifeed.com/new-demo.jsp
5.4.5 Nogacom ClassLogic http://www.nogacom.com/ • Commercial entity extraction and classification based on Nogaclass data classification platform • Focused on the business domain: entities include customers, suppliers, partners, products, competitors, locations etc – from their own business taxonomy • Supports 32 languages • Website contains mostly marketing material – not much technical information
5.4.6 NetOWL http://www.sra.com/netowl/ • Extractor recognises entities based on their own NameTag (people, organisations, places, addresses, dates etc), Link, Event (affiliation, transation etc) and Cyber Security ontologies • Supports multiple domains: Business, security, finance, life sciences, military, politics etc • Supports multiple languages • Output includes custom XML and OWL • Provides term extraction and visualisation tool (Java-based) • APIs for Java, C and can be run as a Web service
5.4.7 Basis Technology Rosette Entity Extractor (REX) http://www.basistech.com/entity-extraction/ • REX uses statistical modeling to learn patterns from large corpora of native language
10
• Identifies and tags people, organizations, locations, dates using gazetteers • available for Chinese, Japanese, Korean, Arabic, Farsi, Urdu, Russian, Dutch, English, French, Italian, German and Span 5.5 Bio-medical Semantic Tagging Tools
The majority of discipline specific NER systems have been developed for text mining of biomedical literature and MEDLINE abstracts. Below are some of the most popular and robust tools in this area. They generally enable the identification and tagging of biomedical entities such as: protein, DNA, RNA, Cell Line and Cell Type.
5.5.1 BioNLP http://bionlp.sourceforge.net/ BioNLP is an initiative by the Center for Computational Pharmacology at the University of Colorado to create and distribute code, software, and data for applying natural language processing techniques to biomedical texts. It has generated a number of tools but the most relevant are: • Knowtator: a Protege plug-in for text annotation. • MutationFinder: extracts biomedical entities from text
5.5.2 PennBioIE http://bioie.ldc.upenn.edu/ The aim of the PennBioIE project was to develop better methods for information extraction, specifically from biomedical literature and are annotating texts in two domains of biomedical knowledge: • inhibition of the cytochrome P450 family of enzymes (CYP450 or CYP for short) • molecular genetics of cancer (oncology or onco)
5.5.3 ABNER http://pages.cs.wisc.edu/~bsettles/abner/ ABNER (A Biomedical Named Entity Recognizer) is an open source (CPL) Java tool for molecular biology entity extraction. It recognizes proteins, DNA, RNA, cell line and cell type
5.5.4 POSBIOTM/W http://isoft.postech.ac.kr/Research/Bio/bio.html#Requirements POSBIOTM/W is a workbench for machine-learning oriented biomedical text mining system. It is intended to assist biologist in mining useful information efficiently from biomedical text resources.
5.5.5 GENIA Tagger http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ The GENIA tagger analyzes English sentences and outputs the base forms, part-of-speech tags, chunk tags, and named entity tags. The tagger is specifically tuned for biomedical text such as MEDLINE abstracts and identifies proteins, DNA, RNA, cell_line and cell_type.
5.5.6 AIIAGMT http://bcsp1.iis.sinica.edu.tw/aiiagmt/ This NER system developed by the AIIALab at Academica Sineca in Taiwan, performs tagging of gene and gene products mentioned in textual documents.
5.5.7 DECA – Disease Extraction http://www.nactem.ac.uk/deca_details/start.cgi DECA focuses mainly on disambiguation of model organisms commonly used in biological studies, such as E. coli, C. elegans, Drosophila, Homo sapiens. Given an article, DECA automatically identifies the species-indicating words (e.g., human) and biomedical named entities (e.g., protein P53) in the text, assigns a unique NCBI Taxonomy ID to each entity. 5.6 Scientific and Chemistry Semantic Tagging Tools 5.6.1 OSCAR3 (Open Source Chemistry Analysis Routines) http://sourceforge.net/projects/oscar3-chem/ OSCAR3 is a set of software modules designed to enable semantic annotation of chemistry-related documents It provides two modules: OPSIN (a name to structure converter) and ChemTok (a tokeniser for chemical text) which are also available as standalone libraries. It also attempts to identify:
11
• Chemical names: singular nouns, plurals, verbs etc., also formulae and acronyms, some enzymes and reaction names. • Ontology terms from the ChEBI ontology (http://www.ebi.ac.uk/chebi/) • Chemical data: Spectra, melting/boiling point, yield etc. in experimental sections. In addition, where possible the chemical names that are detected are annotated with structures, either via lookup or name-to-structure parsing (“OPSIN”), and with identifiers from the chemical ontology ChEBI.
5.6.2 SAPIENT – Semantic Annotation tool for Scientific Research Papers http://www.aber.ac.uk/compsci/Research/bio/art/sapient/ A web application designed to take as input, full scientific papers that are represented in XML, and compliant with the SciXML schema. Supports the annotation of papers using topics/concepts in Physical Chemistry and Biochemistry taken from CISP (Core Information about Scientific Concepts). Examples of entities are: Background, Conclusion, Experiment, Goal, Hypothesis, Method, Model, Motivation, Object of Investigation, Observation, Result (based on the EXPO ontology). It provides both manual annotation and auto-annotation tools. The automatic annotation is performed by Oscar3, and generates colour-coded annotations. 5.7 Research – Semantic Tagging of Texts
Automatic semantic annotation requires training to carry out the annotation process autonomously. As such substantial human contribution is required to generate the training corpus and to maintain the corpus as the domain ontology evolves over time. For this reason significant research has been focused on semi-supervised approaches that don’t require a large annotated corpus for training but may require some manual bootstrapping to start the learning process. A simple way to categorize semantic tagging systems is as follows: • Machine-learning methods such as Amilcare that require an annotated training corpus; • Rules-based systems – that rely on manually-created rules; • Pattern-based systems – that require an initial set of seeds in order to discover patterns.
Armadillo [20] uses a pattern -based approach to annotation, based on the Amilcare information extraction system [21]. It is especially suitable for highly structured Web pages. The tool starts from a seed pattern and does not require human input initially – although the patterns for entity recognition have to be added manually.
The knowledge and information management (KIM) platform [22] consists of an ontology and knowledge base as well as an indexing and retrieval server. RDF data is stored in an RDF repository, whilst search is performed using LUCENE. KIM is based on an underlying ontology (KIMO or PROTON) that holds the knowledge required to semantically annotate documents, and on GATE to perform information extraction.
Magpie [23] is a suite of tools that supports the fully automatic annotation of Web pages, by mapping entities found in its internal knowledge base against those identified on Web pages. The quality of the results depends on the background ontology, which has to be manually modeled and populated.
MnM [24] is another tool that supports semi-automatic annotation based on the Amilcare system. It uses machine learning techniques and requires a training data set. The classical usage scenario MnM was designed for is the following: while browsing the Web, the user manually annotates selected Web pages in theMnM Web browser. While doing so, the system learns annotation rules, which are then tested against user feedback. The better the system does, the less user input is required. The PANKOW algorithm [25] is a pattern-based approach to semantic annotation that makes use of the redundant nature of information on the Web. Based on an ontology, the system constructs patterns and combines entities into hypotheses that are validated manually. S-Cream [26] is another approach to semi-automatic annotation that combines two tools: Ont-O-Mat, a manual annotation editor implementing the CREAM framework, and the Amilcare system. S-Cream can be trained for different domains provided the appropriate training data and proposes a set of heuristics for post-processing and mapping of information extraction results to an ontology. S-CREAM
12
uses the Amilcare machine-learning system together with a training corpus of a manually annotated set of documents, to automatically suggest appropriate tags for new documents. ConAnnotator [27] uses Support Vector Machines (SVM) and Natural Language processing (NLP) approaches to facilitate the automated generation of annotations with the support of the domain ontology. The SemTag system [28] is based on the TAP ontology (which is very similar to the KIM ontology). The system firstly annotates all occurrences of instances of the ontology. Secondly, it disambiguates the elements and assigns the correct ontological classes by analysing context. More recently, the OntoNEO [29] system has been developed by Choi and Park to automatically semantically annotate named entities in texts. OntoNEO claims to have 18% better performance than the SemTag algorithm – by using a Hidden Markov Model (HMM) to represent the probabilistic model of named entities from a corpus of documents. The SCORE system [30] for management of semantic metadata (and data extraction) also contains a component for resolving ambiguities. SCORE uses associations from a knowledgebase to determine the best match from candidate entities but detailed implementation details are not available. In ESpotter, named entities are recognized using a lexicon and/or patterns [31]. Ambiguities are resolved by using the URI of the webpage to determine the most likely domain of the term (probabilities are computed using hit count of search-engine results).
Table 1: A classification of approaches for semantically annotating texts. System Name Nature Method Armadillo Automatic Semi-automatic Pre-defined ontology Adapted ontology KIM Automatic Limited focus KIMO ontology Magpie Automatic Semi-automatic Pre-defined ontology Adapted ontology MnM Manual Semi-automatic Without training With training, KMi ontology Pankow Automatic Limited focus S-Cream Manual Semi-automatic No training With training SemTag Automatic Limited focus TAP ontology OntoNEO Automatic Limited focus SCORE Automatic Pre-defined ontology ESpotter Automatic Weighted ontology
5.8 Research – Semantic Tagging of Multimedia Because manual annotation of multimedia is so time-consuming, expensive and subjective, there has been significant research effort focused on automatic semantic annotation of multimedia. automatic low-level feature extraction tools are often employed to extract low level features (e.g., regions, colours, textures, shapes). The Semantic Gap refers to the difference between the low level features and the high-level semantic descriptions of the content (e.g., people, places, events, keywords) represented in discipline-specific ontologies. A range of approaches has been applied (with varying success) to bridge the Semantic Gap. Typically these approaches involve a combination of: • manual annotation of corpuses of training content; • interactively-defined inferencing rules (that specify rules for inferring high level descriptors from combinations of low level features); • and neural networks or machine learning techniques
13
The most significant automatic/semi-automatic semantic annotation tools for multimedia are: • AktiveMedia [32] – an ontology-based annotation system for images and text. It provides semiautomated annotation of JPG, GIF, BMP, PNG and TIFF images by suggesting tags interactively whilst the user is annotating. • Caliph and Emir [51] – are MPEG-7-based Java tools which combine automatic extraction of lowlevel MPEG-7 descriptors with tools for manually annotating digital photos and images with semantic tags. The resulting metadata is stored as an MPEG-7 XML file which is used to enable content-based image retrieval. • The MPEG-7 SpokenContent Description Scheme Extractor automatically recognizes speech, on which one can apply text-related annotation methods. The same applies for Transcriber [34]. • M-OntoMat-Annotizer [35] is a tool that allows the semantic annotation of images and videos for multimedia analysis and retrieval. It provides an interface for linking RDF(S) domain ontologies to automatically extracted low-level MPEG-7 visual descriptors. Table 2: A classification of approaches for semantically annotating multimedia System Format Type Nature Technique AktiveMedia Images Semi-automatic Low-level semantics Caliph Images Automatic Low-level semantics SWAD Images Automatic Low and high-level semantics MPEG-7 SCDSExtractor Audio Automatic Speech Transcriber Audio Automatic Speech 4M Video Automatic Semi-automatic Low-level semantics Machine-learning M-OntoMat-Annotizer Video Automatic Semi-automatic Low-level semantics Machine learning 6 A

Further Research

One significant emerging area is semantic publishing tools – tools that enable users to create and publish content with semantic markup already embedded. Some examples of such approaches include: • OpenPublish – Thomson Reuters and Phase2 Technology recently released OpenPublish which combines Drupal with OpenCalais machine-assisted tagging and built-in RDFa formatting, to semantically tag textual documents as they are published http://drupal.org/project/openpublish • Jiglu Insight and Jiglu Spaces – are commercial products that automatically tags content, finds hidden relationships to other content that’s been published and automatically creates links. http://www.jiglu.com/

Other highly topical and emerging areas of research that are relevant to this report include: • Automatic semantic annotation of dynamic web documents, such as blogs, wikis and twitter/tweets i.e., unstructured textual resources that are constantly changing. • Standardized interoperable annotation models such as the Open Annotation Collaboration [16], that promotes a common model based on Linked Open Data and URIs to ensure persistent and re-usable tags/annotations • Hybrid semi-automatic semantic tagging systems – that combine machine-learning with rulesbased approaches and crowd-sourcing to generate the training set and correct the results. For example, Finin et al, use Amazon’s Mechanical Turk to tag the named entities in twitter data [40]. • The application of cloud computing to high performance, large scale text analysis and named entity recognition e.g., Gate Cloud http://gatecloud.net/

Automated Annoting Tools