New algorithm can separate unstructured text into topics with high accuracy and reproducibility
Much of our reams of data sit in large databases of unstructured text. Finding insights among emails, text documents, and websites is extremely difficult unless we can search, characterize, and classify their text data in a meaningful way.
One of the leading big data algorithms for finding related topics within unstructured text (an area called topic modeling) is latent Dirichlet allocation (LDA). But when Northwestern University professor Luis Amaral set out to test LDA, he found that it was neither as accurate nor reproducible as a leading topic modeling algorithm should be.
Using his network analysis background, Amaral, professor of chemical and biological engineering in Northwestern's McCormick School of Engineering and Applied Science, developed a new topic modeling algorithm that has shown very high accuracy and reproducibility during tests. His results, published with co-author Konrad Kording, associate professor of physical medicine and rehabilitation, physiology, and applied mathematics at Northwestern, were published Jan. 29 in Physical Review X.
Topic modeling algorithms take unstructured text and find a set of topics that can be used to describe each document in the set. They are the workhorses of big data science, used as the foundation for recommendation systems, spam filtering, and digital image processing. The LDA topic modeling algorithm was developed in 2003 and has been widely used for academic research and for commercial applications, like search engines.
When Amaral explored how LDA worked, he found that the algorithm produced different results each time for the same set of data, and it often did so inaccurately. Amaral and his group tested LDA by running it on documents they created that were written in English, French, Spanish, and other languages. By doing this, they were able to prevent text overlap among documents.
"In this simple case, the algorithm should be able to perform at 100 percent accuracy and reproducibility," he said. But when LDA was used, it separated these documents into similar groups with only 90 percent accuracy and 80 percent reproducibility. "While these numbers may appear to be good, they are actually very poor, since they are for an exceedingly easy case," Amaral said.
To create a better algorithm, Amaral took a network approach. The result, called TopicMapping, begins by preprocessing data to replace words with their stem (so "star" and "stars" would be considered the same word). It then builds a network of connecting words and identifies a "community" of related words (just as one could look for communities of people in Facebook). The words within a given community define a topic.
The algorithm was able to perfectly separate the documents according to language and was able to reproduce its results. It also had high accuracy and reproducibility when separating 23,000 scientific papers and 1.2 million Wikipedia articles by topic.
These results show the need for more testing of big data algorithms and more research into making them more accurate and reproducible, Amaral said.
"Companies that make products must show that their products work," he said. "They must be certified. There is no such case for algorithms. We have a lot of uninformed consumers of big data algorithms that are using tools that haven't been tested for reproducibility and accuracy."
Megan Fellman | EurekAlert!
Stanford researchers create new special-purpose computer that may someday save us billions
21.10.2016 | Stanford University
New 3-D wiring technique brings scalable quantum computers closer to reality
19.10.2016 | University of Waterloo
Researchers from the Institute for Quantum Computing (IQC) at the University of Waterloo led the development of a new extensible wiring technique capable of controlling superconducting quantum bits, representing a significant step towards to the realization of a scalable quantum computer.
"The quantum socket is a wiring method that uses three-dimensional wires based on spring-loaded pins to address individual qubits," said Jeremy Béjanin, a PhD...
In a paper in Scientific Reports, a research team at Worcester Polytechnic Institute describes a novel light-activated phenomenon that could become the basis for applications as diverse as microscopic robotic grippers and more efficient solar cells.
A research team at Worcester Polytechnic Institute (WPI) has developed a revolutionary, light-activated semiconductor nanocomposite material that can be used...
By forcefully embedding two silicon atoms in a diamond matrix, Sandia researchers have demonstrated for the first time on a single chip all the components needed to create a quantum bridge to link quantum computers together.
"People have already built small quantum computers," says Sandia researcher Ryan Camacho. "Maybe the first useful one won't be a single giant quantum computer...
COMPAMED has become the leading international marketplace for suppliers of medical manufacturing. The trade fair, which takes place every November and is co-located to MEDICA in Dusseldorf, has been steadily growing over the past years and shows that medical technology remains a rapidly growing market.
In 2016, the joint pavilion by the IVAM Microtechnology Network, the Product Market “High-tech for Medical Devices”, will be located in Hall 8a again and will...
'Ferroelectric' materials can switch between different states of electrical polarization in response to an external electric field. This flexibility means they show promise for many applications, for example in electronic devices and computer memory. Current ferroelectric materials are highly valued for their thermal and chemical stability and rapid electro-mechanical responses, but creating a material that is scalable down to the tiny sizes needed for technologies like silicon-based semiconductors (Si-based CMOS) has proven challenging.
Now, Hiroshi Funakubo and co-workers at the Tokyo Institute of Technology, in collaboration with researchers across Japan, have conducted experiments to...
14.10.2016 | Event News
14.10.2016 | Event News
12.10.2016 | Event News
21.10.2016 | Health and Medicine
21.10.2016 | Information Technology
21.10.2016 | Materials Sciences