Monday 30 Mar: “Reliable factual data from the literature based on ontological resources”
Highlight of the morning session was Junichi Tsujii’s demo of the PathText system, which integrates manually-curated pathway information in CellDesigner or SBML format with text-mined relationships, and lets you browse the pathway maps and drill through to related literature.
It’s not finished yet but there’s a preview video available from NaCTeM.
Also a bit of a preview of the BioNLP 2009 Shared Task on extracting biomolecular events from text into semantic networks — which I’m reviewing entries for at the moment.
Lots of material about gene regulation today. An intro to the Gene Regulation Ontology (Jung-Jae Kim), a couple of talks about extracting regulatory events from free text (Kim and Udo Hahn), and the ORegAnno project which is using text mining to support its manual curation of regulatory events (Stephen Montgomery). The new(ish) GeneReg corpus will be useful to anyone building systems like this, as would be the BioNLP 2009 data, I’ll find out if it is available to non-entrants.
Also a talk about populating/extending ontologies automatically from clinical reports (Wendy Chapman). Patterns like “NOUN_PHRASE_1, such as NOUN_PHRASE_2″. Simple and effective.
Back from lunch… And sitting at the front so I can hear better. Hence better notes!
Simonetta Montemagni just gave an excellent introduction to the BioLexicon project (also NaCTeM-related) which is essentially a huge database of biomedical/biological language, including such things as domain-specific syntax and semantic metadata, dead useful for text mining developers. Also contains a thesaurus of gene and protein names (inc. synonyms and variants) with links back to UniProt IDs which makes it much more useful for general bioinformatics use.
It isn’t all available yet, and will be published via a linguistic data provider, bit vague about licensing! So it may or may not be free (I’m guessing free for academic use, commercial for other uses).
Lots of data and tools for natural language processing from Udo Hahn’s group: http://www.julielab.de/ … Plus some war stories about the difference in information extraction accuracy between ‘lab’ tests and real world performance, e.g. from ~60% (close to human levels) to ~20% F-score. Ouch… But we’ve all been there. (See also note about GeneReg above)
Su Jian talked about designing evaluation tasks for genomic information retrieval (i.e. search engine) algorithms, and improving said algorithms with dedicated gene/protein name recognizers. Bit specialized for me — lots of score functions I didn’t know the definitions of…
Quick coffee break!
Nice talk about the Experimental Factor Ontology from the ArrayExpress project (James Malone). This is for classifying experimental conditions in microarray experiments. They’ve gone to a lot of trouble to link their ontology into others as painlessly as possible, and have developed autonomous agents to trawl the semantic web for other ontologies that may be related, and to alert them when ontologies they link to change, as this might imply a link is no longer true. Cute. The EFO has also allowed them to offer federated queries with other databases, and they use it for sanity-checking the data people submit via reasoning rules — e.g. cardiovascular disease can’t occur in hair follicle cells.
UCSD (Lynn Fink) have written a very nice plugin for Word 2007 that watches your text as you type and automatically tags biomolecular database identifiers and terms from OBO ontologies when they appear — with the option to add/edit/remove/override manually of course, and the tags being preserved in Word’s XML files. Kind of like a spellchecker/thesaurus for semantic markup. I’m not a fan of word processors (give me LaTeX any day) but this is an excellent idea. Hopefully publishers and curators will be able to parse useful metadata out of the resulting files.
Some similar ideas from PaperMaker (Piotr Pezik) which also does semantic tagging, along with things like spotting missing references, acronyms that haven’t been defined, and genes/proteins that have been referred to by non-recommended identifiers. It can also trawl PubMed for similar publications, at the whole-doc or paragraph level. Throws in spell checking, word count etc. Neat work, but I’m not entirely sure who it’s aimed at — biologists would surely prefer this to be a Word plugin like the previous.
Going back over notes and adding links as I have time.