Tue 31 Mar: “Semantic Enrichment of the literature for the benefit of all users”
(Monday’s notes are here)
Missed the early morning session. I don’t work in pharma any more so it didn’t seem worth a 5:45am wake-up (unhelpful train times). Although apparently Eric Neumann’s talk on linked data was good (“semantic web without the ‘semantic’” — Duncan)
Alfonso Valencia — ELIXIR — an EU project to upgrade Europe’s bioinformatics infrastructure. Includes a work package on literature integration — making lit. repositories, ontologies and traditional biological databases interoperate better. Good — too much text mining happens in isolation from the rest of the bioinformatics world. Targeted at wet-lab scientists not just computational people. Looks like it might include an effort to turn raw algorithms into usable tools/platforms. Still in the early phases.
He also discussed the BioCreative project which has released various data sets and held challenges on several aspects of text mining. A spin-off from these is the BioCreative MetaServer which identifies genes and proteins mentioned in text by aggregating predictions from several prediction services.
Dietrich Rebholz-Schuhmann — UKPMC — a UK mirror of PubMed Central (with added value apparently) co-ordinated by the British Library. Working on information retrieval and data integration improvements. Sounds like the funding bodies are getting involved, many referring specifically to UKPMC in their open access policies. Paying for OA journal submissions is an issue. Apparently the Wellcome Trust have an OA fund which is under-utilized.
Also, CALBC — a project to semantically annotate a large biomedical corpus (named entities only?) by getting a consensus annotation from iteratively integrating the output of various information extraction systems, and then manually cleaning up the disagreements.
Stefano Bertolo (EU) — funding calls — deadline 3rd November…
(Great analogy: Human history has entered a phase where we can produce information by machine quicker than we can interpret it. What we need is ‘cognitive levers’.)
7th framework, SO 4.3, Call 5, themes:
- Capturing tractable information
- Delivering pertinent information
- Collaboration and decision support
- Personal sphere
- Impact and science & tech leadership
They all sound a bit vague and buzzwordy without the explanations…
Key themes: large data sets and (close to) real-time processing. Requirement for robust, strongly-tested tools that can be distributed — not just ‘only works on the PC of the postdoc that wrote it, on a good day’ :-)
Informal queries about proposals: infso-e2 at ec.europa.eu
Keynote from UMLS guru Olivier Bodenreider on normalizing terms/concepts across different lexical/taxonomical/ontological resources. Lexical vs. semantic approaches — e.g. string munging vs. traversing known relationships. Latter complicated by fact that some pairs of concepts are synonymous in one resource and hyponymous in another. Also, semantic similarity — lowest common subsumer/definition by extension, e.g. famous Resnik measure.
Also mentioned BioPortal, not sure exactly how this differs from the UMLS in scope, probably more biological than medical? Must be overlap though.
These are forming a key part of CALBC (see above).
Sophia Ananiadou from NaCTeM– NLP view of semantic enrichment: terms and names entities — concepts — events and relationships. Termine and Acromine — extraction of terms and acronyms. Accelerated annotation methods — cunning. More on the importance of building proper tools rather than just prototypes/in-house algorithms. Glad the NLP scene is catching on to this. Hopefully they allow querying by unique accession rather than just names — this is another area where the NLP people don’t always understand what the bio people need.
She discussed some of NaCTeM’s flagship tools like MEDIE, FACTA and KLEIO — it does look like they’re starting to take all the pain out of text mining, by doing the difficult bits for us, so we can use the results to do actual mining. Also they are offering web service interfaces (‘overdue’ for some of them according to Sophia) — excellent news.
More from Udo — what do we mean by ‘semantics’? Mixed-bag talk. Problems with folksonomies/tag clouds, e.g. Flickr: “newyork” “newyorkcity” “nyc” “new” “york”. Biomedical lexicon an order of magnitude bigger than general English lexicon (based on Wordnet and typical competent speaker). Wow. Domain dictionaries like GO/UMLS: these inherit some of the problems of natural language because the terms themselves are stated/defined in natural language! Also often ontological relations are vague/underspecified/changing.
Anita de Waard (Elsevier) — FEBS Letters structural digital abstracts experiment (author-provided PPI annotation). 75% author compliance, avg 1 hour per abstract. They’ve moved responsibility to the MINT curators instead of the authors, to increase compliance and efficiency! What does that tell us… Also mentioned OKKAM — a consortium trying to provide a UID for every single entity on the web. Umm… Holy crap. So far, 1.5 million entities covered, so they have a bit of a way to go, to say the least. She went on to discuss some aspects of discourse analysis of scientific text. Interesting point, hedging gets eroded by citation “these results suggest that” becomes “author X shows that” becomes just a cited fact.
She also discussed the Elsevier Grand Challenge — what’s the most interesting thing you can do with half a million full text articles? Finalists have been chosen, the winners will be announced next month. Next year: Future of Research Communication conference on same themes, probably March at Harvard.
EU-ADR (Erik van Mulligen) — federated data mining/text mining/epidemiological analysis to discover & monitor novel adverse drug reactions. Five-year pan-European project. Sounds like an enormous piece of work with lots of engineering challenges — anonymization etc. for a start.
WikiGenes (Robert Hoffmann) — a wiki for genes, chemicals, MeSH terms obviously — but pre-seeded with sentences yanked from iHOP. So experts can step in and add/fix stuff but without the momentum barrier to getting started. ‘Narcissistic drive’ for authors of missed papers to add their own — cunning. Custom engine based on Apache Cocoon and Lucene. Authorship tracking down to individual strings of text, and it’s easy to view this information. The idea is that scientists will want to add their own work and get credit for it.
He makes the point that this is in many ways a much better way to publish biological information than several thousand different journals, and gives much better influence metrics than impact factors and H-index etc.
Is it in direct competition with WikiProteins? Not according to Robert — that’s more about knowledge engineering and formal semantic relationships, machine-readable stuff, whereas this is more supposed to be a modern publishing medium for human-readable information. Which hopefully the biologists will take to more readily.