Forum for Science, Industry and Business

Sponsored by:     3M 
Search our Site:

 

The future of search engines

07.08.2017

Researchers combine artificial intelligence, crowdsourcing and supercomputers to develop better, and more reasoned, information extraction and classification methods

How do search engines generate lists of relevant links?


WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. Researchers from The University of Texas at Austin developed a method to incorporate information from WordNet into informational retrieval systems.

Credit: Visuwords

The outcome is the result of two powerful forces in the evolution of information retrieval: artificial intelligence -- especially natural language processing -- and crowdsourcing.

Computer algorithms interpret the relationship between the words we type and the vast number of possible web pages based on the frequency of linguistic connections in the billions of texts on which the system has been trained.

But that is not the only source of information. The semantic relationships get strengthened by professional annotators who hand-tune results -- and the algorithms that generate them -- for topics of importance, and by web searchers (us) who, in our clicks, tell the algorithms which connections are the best ones.

Despite the incredible, world-changing success of this model, it has its flaws. Search engine results are often not as "smart" as we'd like them to be, lacking a true understanding of language and human logic. Beyond that, they sometimes replicate and deepen the biases embedded in our searches, rather than bringing us new information or insight.

Matthew Lease, an associate professor in the School of Information at The University of Texas at Austin (UT Austin), believes there may be better ways to harness the dual power of computers and human minds to create more intelligent information retrieval (IR) systems.

Combining AI with the insights of annotators and the information encoded in domain-specific resources, he and his collaborators are developing new approaches to IR that will benefit general search engines, as well as niche ones like those for medical knowledge or non-English texts.

This week, at the Annual Meeting of the Association for Computational Linguistics in Vancouver, Canada, Lease and collaborators from UT Austin and Northeastern University presented two papers describing their novel IR systems. Their research leverages the supercomputing resources at the Texas Advanced Computing Center [https://www.tacc.utexas.edu/], one of the leading supercomputing research centers in the world.

ANNOTATOR CONSENSUS AND ATTRIBUTIONS PROVIDE RATIONALES FOR SEARCH RESULTS

In one paper, led by Ph.D. student An Nguyen, they presented a method that combines input from multiple annotators to determine the best overall annotation for a given text. They applied this method to two problems: analyzing free-text research articles describing medical studies to extract details of each study (e.g., the condition, patient demographics, treatments, and outcomes), and recognizing named-entities -- analyzing breaking news stories to identify the events, people, and places involved.

"An important challenge in natural language processing is accurately finding important information contained in free-text, which lets us extract into databases and combine it with other data in order to make more intelligent decisions and new discoveries," Lease said. "We've been using crowdsourcing to annotate medical and news articles at scale so that our intelligent systems will be able to more accurately find the key information contained in each article."

Such annotation has traditionally been performed by in-house, domain experts. However, more recently crowdsourcing has become a popular means to acquire large labeled datasets at lower cost. Predictably, annotations from laypeople are of lower quality than those from domain experts, so it is necessary to estimate the reliability of crowd annotators and also aggregate individual annotations to come up with a single set of "reference standard" consensus labels.

Lease's team found that their method was able to train a neural network -- a form of AI modeled on the human brain -- so it could very accurately predict named entities and extract relevant information in unannotated texts. The new method improved upon existing tagging and training methods.

The method also provides an estimate of each worker's label quality, which can be transferred between tasks and is useful for error analysis and intelligently routing tasks -- identifying the best person to annotate each particular text.

LEVERAGING EXISTING KNOWLEDGE TO CREATE BETTER NEURAL MODELS

The group's second paper, led by Ph.D. student Ye Zhang, addressed the fact that neural models for natural language processing (NLP) often ignore existing resources like WordNet -- a lexical database for the English language that groups words into sets of synonyms -- or domain-specific ontologies, such as the Unified Medical Language System, which encode knowledge about a given field.

They proposed a method for exploiting these existing linguistic resources via weight sharing to improve NLP models for automatic text classification. For example, their model learns to classify whether or not published medical articles describing clinical trials are relevant to a well-specified clinical question.

In weight sharing, words that are similar share some fraction of a weight, or assigned numerical value. Weight sharing constrains the number of free parameters that a system must learn, thereby increasing the efficiency and accuracy of the neural model, and serving as a flexible way to incorporate prior knowledge. In doing so, they combine the best of human knowledge with machine learning.

"Neural network models have tons of parameters and need lots of data to fit them," said Lease. "We had this idea that if you could somehow reason about some words being related to other words a priori, then instead of having to have a parameter for each one of those word separately, you could tie together the parameters across multiple words and in that way need less data to learn the model. It would realize the benefits of deep learning without large data constraints."

They applied a form of weight sharing to a sentiment analysis of movie reviews and to a biomedical search related to anemia. Their approach consistently yielded improved performance on classification tasks compared to strategies that did not exploit weight sharing.

"This provides a general framework for codifying and exploiting domain knowledge in data-driven neural network models," say Byron Wallace, Lease's collaborator from Northeastern University. (Wallace was formerly also a faculty member at UT Austin and became a frequent user of TACC as well.)

Lease, Wallace and their collaborators used the GPUs (graphics processing units) on the Maverick supercomputer at TACC to enable their analyses and train the machine learning system.

"Training neural computing models for big data takes a lot of compute time," Lease said. "That's where TACC fits in as a wonderful resource, not only because of the great storage that's available, but also the large number of nodes and the high processing speeds available for training neural models."

In addition to GPUs, TACC deploys cutting-edge processing architectures developed by Intel to which the machine learning libraries are playing catch up, according to Lease.

"Though many deep learning libraries have been highly optimized for processing on GPUs, there's reason to think that these other architectures will be faster in the long term once they've been optimized as well," he said.

"With the introduction of Stampede2 and its many core infrastructure, we are glad to see more optimization of CPU-based machine learning frameworks," said Niall Gaffney, Director of Data Intensive Computing at TACC. "Project like Matt's show the power of machine learning in both measured and simulated data analysis."

Gaffney says that in TACC's initial work with Caffe -- a deep learning framework developed at the University of California, Berkeley, which has been optimized by Intel for Xeon Phi processors -- they are finding that CPUs have roughly the equivalent performance for many AI jobs at GPUs.

"This can be transformative as it allows us to offer more nodes that can satisfy these researchers as well as allowing HPC users to leverage AI in their analysis phases, without having to move to a different GPU enabled system," he said.

By improving core natural language processing technologies for automatic information extraction and the classification of texts, web search engines built on these technologies can continue to improve.

Lease has received grants from the National Science Foundation (NSF), the Institute of Museum and Library Services (IMLS) and the Defense Advanced Research Projects Agency (DARPA) to improve the quality of crowdsourcing across a variety of tasks, scales, and settings. He says that though commercial web search companies invest a lot of resources to develop practical, effective solutions, the demands of industry lead them to focus on problems with commercial application and short-term solutions.

"Industry is great at looking at near-term things, but they don't have the same freedom as academic researchers to pursue research ideas that are higher risk but could be more transformative in the long-term," Lease said. "This is where we benefit from public investment for powering discoveries. Resources like TACC are incredibly empowering for researchers in enabling us to pursue high-risk, potentially transformative research."

Media Contact

Aaron Dubrow
aarondubrow@tacc.utexas.edu
512-820-5785

 @TACC

http://www.tacc.utexas.edu/ 

Aaron Dubrow | EurekAlert!

More articles from Information Technology:

nachricht NASA CubeSat to test miniaturized weather satellite technology
10.11.2017 | NASA/Goddard Space Flight Center

nachricht New approach uses light instead of robots to assemble electronic components
08.11.2017 | The Optical Society

All articles from Information Technology >>>

The most recent press releases about innovation >>>

Die letzten 5 Focus-News des innovations-reports im Überblick:

Im Focus: A “cosmic snake” reveals the structure of remote galaxies

The formation of stars in distant galaxies is still largely unexplored. For the first time, astron-omers at the University of Geneva have now been able to closely observe a star system six billion light-years away. In doing so, they are confirming earlier simulations made by the University of Zurich. One special effect is made possible by the multiple reflections of images that run through the cosmos like a snake.

Today, astronomers have a pretty accurate idea of how stars were formed in the recent cosmic past. But do these laws also apply to older galaxies? For around a...

Im Focus: Visual intelligence is not the same as IQ

Just because someone is smart and well-motivated doesn't mean he or she can learn the visual skills needed to excel at tasks like matching fingerprints, interpreting medical X-rays, keeping track of aircraft on radar displays or forensic face matching.

That is the implication of a new study which shows for the first time that there is a broad range of differences in people's visual ability and that these...

Im Focus: Novel Nano-CT device creates high-resolution 3D-X-rays of tiny velvet worm legs

Computer Tomography (CT) is a standard procedure in hospitals, but so far, the technology has not been suitable for imaging extremely small objects. In PNAS, a team from the Technical University of Munich (TUM) describes a Nano-CT device that creates three-dimensional x-ray images at resolutions up to 100 nanometers. The first test application: Together with colleagues from the University of Kassel and Helmholtz-Zentrum Geesthacht the researchers analyzed the locomotory system of a velvet worm.

During a CT analysis, the object under investigation is x-rayed and a detector measures the respective amount of radiation absorbed from various angles....

Im Focus: Researchers Develop Data Bus for Quantum Computer

The quantum world is fragile; error correction codes are needed to protect the information stored in a quantum object from the deteriorating effects of noise. Quantum physicists in Innsbruck have developed a protocol to pass quantum information between differently encoded building blocks of a future quantum computer, such as processors and memories. Scientists may use this protocol in the future to build a data bus for quantum computers. The researchers have published their work in the journal Nature Communications.

Future quantum computers will be able to solve problems where conventional computers fail today. We are still far away from any large-scale implementation,...

Im Focus: Wrinkles give heat a jolt in pillared graphene

Rice University researchers test 3-D carbon nanostructures' thermal transport abilities

Pillared graphene would transfer heat better if the theoretical material had a few asymmetric junctions that caused wrinkles, according to Rice University...

All Focus news of the innovation-report >>>

Anzeige

Anzeige

Event News

Ecology Across Borders: International conference brings together 1,500 ecologists

15.11.2017 | Event News

Road into laboratory: Users discuss biaxial fatigue-testing for car and truck wheel

15.11.2017 | Event News

#Berlin5GWeek: The right network for Industry 4.0

30.10.2017 | Event News

 
Latest News

NASA detects solar flare pulses at Sun and Earth

17.11.2017 | Physics and Astronomy

NIST scientists discover how to switch liver cancer cell growth from 2-D to 3-D structures

17.11.2017 | Health and Medicine

The importance of biodiversity in forests could increase due to climate change

17.11.2017 | Studies and Analyses

VideoLinks
B2B-VideoLinks
More VideoLinks >>>