New algorithm can separate unstructured text into topics with high accuracy and reproducibility
Much of our reams of data sit in large databases of unstructured text. Finding insights among emails, text documents, and websites is extremely difficult unless we can search, characterize, and classify their text data in a meaningful way.
One of the leading big data algorithms for finding related topics within unstructured text (an area called topic modeling) is latent Dirichlet allocation (LDA). But when Northwestern University professor Luis Amaral set out to test LDA, he found that it was neither as accurate nor reproducible as a leading topic modeling algorithm should be.
Using his network analysis background, Amaral, professor of chemical and biological engineering in Northwestern's McCormick School of Engineering and Applied Science, developed a new topic modeling algorithm that has shown very high accuracy and reproducibility during tests. His results, published with co-author Konrad Kording, associate professor of physical medicine and rehabilitation, physiology, and applied mathematics at Northwestern, were published Jan. 29 in Physical Review X.
Topic modeling algorithms take unstructured text and find a set of topics that can be used to describe each document in the set. They are the workhorses of big data science, used as the foundation for recommendation systems, spam filtering, and digital image processing. The LDA topic modeling algorithm was developed in 2003 and has been widely used for academic research and for commercial applications, like search engines.
When Amaral explored how LDA worked, he found that the algorithm produced different results each time for the same set of data, and it often did so inaccurately. Amaral and his group tested LDA by running it on documents they created that were written in English, French, Spanish, and other languages. By doing this, they were able to prevent text overlap among documents.
"In this simple case, the algorithm should be able to perform at 100 percent accuracy and reproducibility," he said. But when LDA was used, it separated these documents into similar groups with only 90 percent accuracy and 80 percent reproducibility. "While these numbers may appear to be good, they are actually very poor, since they are for an exceedingly easy case," Amaral said.
To create a better algorithm, Amaral took a network approach. The result, called TopicMapping, begins by preprocessing data to replace words with their stem (so "star" and "stars" would be considered the same word). It then builds a network of connecting words and identifies a "community" of related words (just as one could look for communities of people in Facebook). The words within a given community define a topic.
The algorithm was able to perfectly separate the documents according to language and was able to reproduce its results. It also had high accuracy and reproducibility when separating 23,000 scientific papers and 1.2 million Wikipedia articles by topic.
These results show the need for more testing of big data algorithms and more research into making them more accurate and reproducible, Amaral said.
"Companies that make products must show that their products work," he said. "They must be certified. There is no such case for algorithms. We have a lot of uninformed consumers of big data algorithms that are using tools that haven't been tested for reproducibility and accuracy."
Megan Fellman | EurekAlert!
Stable magnetic bit of three atoms
21.09.2017 | Sonderforschungsbereich 668
Drones can almost see in the dark
20.09.2017 | Universität Zürich
Our brains house extremely complex neuronal circuits, whose detailed structures are still largely unknown. This is especially true for the so-called cerebral cortex of mammals, where among other things vision, thoughts or spatial orientation are being computed. Here the rules by which nerve cells are connected to each other are only partly understood. A team of scientists around Moritz Helmstaedter at the Frankfiurt Max Planck Institute for Brain Research and Helene Schmidt (Humboldt University in Berlin) have now discovered a surprisingly precise nerve cell connectivity pattern in the part of the cerebral cortex that is responsible for orienting the individual animal or human in space.
The researchers report online in Nature (Schmidt et al., 2017. Axonal synapse sorting in medial entorhinal cortex, DOI: 10.1038/nature24005) that synapses in...
Whispering gallery mode (WGM) resonators are used to make tiny micro-lasers, sensors, switches, routers and other devices. These tiny structures rely on a...
Using ultrafast flashes of laser and x-ray radiation, scientists at the Max Planck Institute of Quantum Optics (Garching, Germany) took snapshots of the briefest electron motion inside a solid material to date. The electron motion lasted only 750 billionths of the billionth of a second before it fainted, setting a new record of human capability to capture ultrafast processes inside solids!
When x-rays shine onto solid materials or large molecules, an electron is pushed away from its original place near the nucleus of the atom, leaving a hole...
For the first time, physicists have successfully imaged spiral magnetic ordering in a multiferroic material. These materials are considered highly promising candidates for future data storage media. The researchers were able to prove their findings using unique quantum sensors that were developed at Basel University and that can analyze electromagnetic fields on the nanometer scale. The results – obtained by scientists from the University of Basel’s Department of Physics, the Swiss Nanoscience Institute, the University of Montpellier and several laboratories from University Paris-Saclay – were recently published in the journal Nature.
Multiferroics are materials that simultaneously react to electric and magnetic fields. These two properties are rarely found together, and their combined...
MBM ScienceBridge GmbH successfully negotiated a license agreement between University Medical Center Göttingen (UMG) and the biotech company Tissue Systems Holding GmbH about commercial use of a multi-well tissue plate for automated and reliable tissue engineering & drug testing.
MBM ScienceBridge GmbH successfully negotiated a license agreement between University Medical Center Göttingen (UMG) and the biotech company Tissue Systems...
19.09.2017 | Event News
12.09.2017 | Event News
06.09.2017 | Event News
21.09.2017 | Life Sciences
21.09.2017 | Health and Medicine
21.09.2017 | Earth Sciences