Tag: text_mining
SESL 2009 day two
by Andrew on Mar.31, 2009, under Events
Semantic Enrichment of the Scientific Literature 2009
Tue 31 Mar: “Semantic Enrichment of the literature for the benefit of all users”
(Monday’s notes are here)
Missed the early morning session. I don’t work in pharma any more so it didn’t seem worth a 5:45am wake-up (unhelpful train times). Although apparently Eric Neumann’s talk on linked data was good (“semantic web without the ’semantic’” — Duncan)
Alfonso Valencia — ELIXIR — an EU project to upgrade Europe’s bioinformatics infrastructure. Includes a work package on literature integration — making lit. repositories, ontologies and traditional biological databases interoperate better. Good — too much text mining happens in isolation from the rest of the bioinformatics world. Targeted at wet-lab scientists not just computational people. Looks like it might include an effort to turn raw algorithms into usable tools/platforms. Still in the early phases.
He also discussed the BioCreative project which has released various data sets and held challenges on several aspects of text mining. A spin-off from these is the BioCreative MetaServer which identifies genes and proteins mentioned in text by aggregating predictions from several prediction services.
Dietrich Rebholz-Schuhmann — UKPMC — a UK mirror of PubMed Central (with added value apparently) co-ordinated by the British Library. Working on information retrieval and data integration improvements. Sounds like the funding bodies are getting involved, many referring specifically to UKPMC in their open access policies. Paying for OA journal submissions is an issue. Apparently the Wellcome Trust have an OA fund which is under-utilized.
Also, CALBC — a project to semantically annotate a large biomedical corpus (named entities only?) by getting a consensus annotation from iteratively integrating the output of various information extraction systems, and then manually cleaning up the disagreements.
Stefano Bertolo (EU) — funding calls — deadline 3rd November…
(Great analogy: Human history has entered a phase where we can produce information by machine quicker than we can interpret it. What we need is ‘cognitive levers’.)
7th framework, SO 4.3, Call 5, themes:
- Capturing tractable information
- Delivering pertinent information
- Collaboration and decision support
- Personal sphere
- Impact and science & tech leadership
They all sound a bit vague and buzzwordy without the explanations…
Key themes: large data sets and (close to) real-time processing. Requirement for robust, strongly-tested tools that can be distributed — not just ‘only works on the PC of the postdoc that wrote it, on a good day’
Informal queries about proposals: infso-e2 at ec.europa.eu
Lunch! Then…
Keynote from UMLS guru Olivier Bodenreider on normalizing terms/concepts across different lexical/taxonomical/ontological resources. Lexical vs. semantic approaches — e.g. string munging vs. traversing known relationships. Latter complicated by fact that some pairs of concepts are synonymous in one resource and hyponymous in another. Also, semantic similarity — lowest common subsumer/definition by extension, e.g. famous Resnik measure.
Also mentioned BioPortal, not sure exactly how this differs from the UMLS in scope, probably more biological than medical? Must be overlap though.
These are forming a key part of CALBC (see above).
Sophia Ananiadou from NaCTeM– NLP view of semantic enrichment: terms and names entities — concepts — events and relationships. Termine and Acromine — extraction of terms and acronyms. Accelerated annotation methods — cunning. More on the importance of building proper tools rather than just prototypes/in-house algorithms. Glad the NLP scene is catching on to this. Hopefully they allow querying by unique accession rather than just names — this is another area where the NLP people don’t always understand what the bio people need.
She discussed some of NaCTeM’s flagship tools like MEDIE, FACTA and KLEIO — it does look like they’re starting to take all the pain out of text mining, by doing the difficult bits for us, so we can use the results to do actual mining. Also they are offering web service interfaces (‘overdue’ for some of them according to Sophia) — excellent news.
More from Udo — what do we mean by ’semantics’? Mixed-bag talk. Problems with folksonomies/tag clouds, e.g. Flickr: “newyork” “newyorkcity” “nyc” “new” “york”. Biomedical lexicon an order of magnitude bigger than general English lexicon (based on Wordnet and typical competent speaker). Wow. Domain dictionaries like GO/UMLS: these inherit some of the problems of natural language because the terms themselves are stated/defined in natural language! Also often ontological relations are vague/underspecified/changing.
Last session…
Anita de Waard (Elsevier) — FEBS Letters structural digital abstracts experiment (author-provided PPI annotation). 75% author compliance, avg 1 hour per abstract. They’ve moved responsibility to the MINT curators instead of the authors, to increase compliance and efficiency! What does that tell us… Also mentioned OKKAM — a consortium trying to provide a UID for every single entity on the web. Umm… Holy crap. So far, 1.5 million entities covered, so they have a bit of a way to go, to say the least. She went on to discuss some aspects of discourse analysis of scientific text. Interesting point, hedging gets eroded by citation “these results suggest that” becomes “author X shows that” becomes just a cited fact.
She also discussed the Elsevier Grand Challenge — what’s the most interesting thing you can do with half a million full text articles? Finalists have been chosen, the winners will be announced next month. Next year: Future of Research Communication conference on same themes, probably March at Harvard.
EU-ADR (Erik van Mulligen) — federated data mining/text mining/epidemiological analysis to discover & monitor novel adverse drug reactions. Five-year pan-European project. Sounds like an enormous piece of work with lots of engineering challenges — anonymization etc. for a start.
WikiGenes (Robert Hoffmann) — a wiki for genes, chemicals, MeSH terms obviously — but pre-seeded with sentences yanked from iHOP. So experts can step in and add/fix stuff but without the momentum barrier to getting started. ‘Narcissistic drive’ for authors of missed papers to add their own — cunning. Custom engine based on Apache Cocoon and Lucene. Authorship tracking down to individual strings of text, and it’s easy to view this information. The idea is that scientists will want to add their own work and get credit for it.
He makes the point that this is in many ways a much better way to publish biological information than several thousand different journals, and gives much better influence metrics than impact factors and H-index etc.
Is it in direct competition with WikiProteins? Not according to Robert — that’s more about knowledge engineering and formal semantic relationships, machine-readable stuff, whereas this is more supposed to be a modern publishing medium for human-readable information. Which hopefully the biologists will take to more readily.
Workshop notes — SESL 2009
by Andrew on Mar.30, 2009, under Events
Semantic Enrichment of the Scientific Literature 2009
Monday 30 Mar: “Reliable factual data from the literature based on ontological resources”
Highlight of the morning session was Junichi Tsujii’s demo of the PathText system, which integrates manually-curated pathway information in CellDesigner or SBML format with text-mined relationships, and lets you browse the pathway maps and drill through to related literature.
It’s not finished yet but there’s a preview video available from NaCTeM.
Also a bit of a preview of the BioNLP 2009 Shared Task on extracting biomolecular events from text into semantic networks — which I’m reviewing entries for at the moment.
Lots of material about gene regulation today. An intro to the Gene Regulation Ontology (Jung-Jae Kim), a couple of talks about extracting regulatory events from free text (Kim and Udo Hahn), and the ORegAnno project which is using text mining to support its manual curation of regulatory events (Stephen Montgomery). The new(ish) GeneReg corpus will be useful to anyone building systems like this, as would be the BioNLP 2009 data, I’ll find out if it is available to non-entrants.
Also a talk about populating/extending ontologies automatically from clinical reports (Wendy Chapman). Patterns like “NOUN_PHRASE_1, such as NOUN_PHRASE_2″. Simple and effective.
Back from lunch… And sitting at the front so I can hear better. Hence better notes!
Simonetta Montemagni just gave an excellent introduction to the BioLexicon project (also NaCTeM-related) which is essentially a huge database of biomedical/biological language, including such things as domain-specific syntax and semantic metadata, dead useful for text mining developers. Also contains a thesaurus of gene and protein names (inc. synonyms and variants) with links back to UniProt IDs which makes it much more useful for general bioinformatics use.
It isn’t all available yet, and will be published via a linguistic data provider, bit vague about licensing! So it may or may not be free (I’m guessing free for academic use, commercial for other uses).
Lots of data and tools for natural language processing from Udo Hahn’s group: http://www.julielab.de/ … Plus some war stories about the difference in information extraction accuracy between ‘lab’ tests and real world performance, e.g. from ~60% (close to human levels) to ~20% F-score. Ouch… But we’ve all been there. (See also note about GeneReg above)
Su Jian talked about designing evaluation tasks for genomic information retrieval (i.e. search engine) algorithms, and improving said algorithms with dedicated gene/protein name recognizers. Bit specialized for me — lots of score functions I didn’t know the definitions of…
Quick coffee break!
Nice talk about the Experimental Factor Ontology from the ArrayExpress project (James Malone). This is for classifying experimental conditions in microarray experiments. They’ve gone to a lot of trouble to link their ontology into others as painlessly as possible, and have developed autonomous agents to trawl the semantic web for other ontologies that may be related, and to alert them when ontologies they link to change, as this might imply a link is no longer true. Cute. The EFO has also allowed them to offer federated queries with other databases, and they use it for sanity-checking the data people submit via reasoning rules — e.g. cardiovascular disease can’t occur in hair follicle cells.
UCSD (Lynn Fink) have written a very nice plugin for Word 2007 that watches your text as you type and automatically tags biomolecular database identifiers and terms from OBO ontologies when they appear — with the option to add/edit/remove/override manually of course, and the tags being preserved in Word’s XML files. Kind of like a spellchecker/thesaurus for semantic markup. I’m not a fan of word processors (give me LaTeX any day) but this is an excellent idea. Hopefully publishers and curators will be able to parse useful metadata out of the resulting files.
Some similar ideas from PaperMaker (Piotr Pezik) which also does semantic tagging, along with things like spotting missing references, acronyms that haven’t been defined, and genes/proteins that have been referred to by non-recommended identifiers. It can also trawl PubMed for similar publications, at the whole-doc or paragraph level. Throws in spell checking, word count etc. Neat work, but I’m not entirely sure who it’s aimed at — biologists would surely prefer this to be a Word plugin like the previous.
More tomorrow.
Going back over notes and adding links as I have time.