New algorithm can separate unstructured text into topics with high accuracy and reproducibility
Much of our reams of data sit in large databases of unstructured text. Finding insights among emails, text documents, and websites is extremely difficult unless we can search, characterize, and classify their text data in a meaningful way.
One of the leading big data algorithms for finding related topics within unstructured text (an area called topic modeling) is latent Dirichlet allocation (LDA). But when Northwestern University professor Luis Amaral set out to test LDA, he found that it was neither as accurate nor reproducible as a leading topic modeling algorithm should be.
Using his network analysis background, Amaral, professor of chemical and biological engineering in Northwestern's McCormick School of Engineering and Applied Science, developed a new topic modeling algorithm that has shown very high accuracy and reproducibility during tests. His results, published with co-author Konrad Kording, associate professor of physical medicine and rehabilitation, physiology, and applied mathematics at Northwestern, were published Jan. 29 in Physical Review X.
Topic modeling algorithms take unstructured text and find a set of topics that can be used to describe each document in the set. They are the workhorses of big data science, used as the foundation for recommendation systems, spam filtering, and digital image processing. The LDA topic modeling algorithm was developed in 2003 and has been widely used for academic research and for commercial applications, like search engines.
When Amaral explored how LDA worked, he found that the algorithm produced different results each time for the same set of data, and it often did so inaccurately. Amaral and his group tested LDA by running it on documents they created that were written in English, French, Spanish, and other languages. By doing this, they were able to prevent text overlap among documents.
"In this simple case, the algorithm should be able to perform at 100 percent accuracy and reproducibility," he said. But when LDA was used, it separated these documents into similar groups with only 90 percent accuracy and 80 percent reproducibility. "While these numbers may appear to be good, they are actually very poor, since they are for an exceedingly easy case," Amaral said.
To create a better algorithm, Amaral took a network approach. The result, called TopicMapping, begins by preprocessing data to replace words with their stem (so "star" and "stars" would be considered the same word). It then builds a network of connecting words and identifies a "community" of related words (just as one could look for communities of people in Facebook). The words within a given community define a topic.
The algorithm was able to perfectly separate the documents according to language and was able to reproduce its results. It also had high accuracy and reproducibility when separating 23,000 scientific papers and 1.2 million Wikipedia articles by topic.
These results show the need for more testing of big data algorithms and more research into making them more accurate and reproducible, Amaral said.
"Companies that make products must show that their products work," he said. "They must be certified. There is no such case for algorithms. We have a lot of uninformed consumers of big data algorithms that are using tools that haven't been tested for reproducibility and accuracy."
Megan Fellman | EurekAlert!
Information integration and artificial intelligence for better diagnosis and therapy decisions
24.05.2017 | Fraunhofer MEVIS - Institut für Bildgestützte Medizin
World's thinnest hologram paves path to new 3-D world
18.05.2017 | RMIT University
Staphylococcus aureus is a feared pathogen (MRSA, multi-resistant S. aureus) due to frequent resistances against many antibiotics, especially in hospital infections. Researchers at the Paul-Ehrlich-Institut have identified immunological processes that prevent a successful immune response directed against the pathogenic agent. The delivery of bacterial proteins with RNA adjuvant or messenger RNA (mRNA) into immune cells allows the re-direction of the immune response towards an active defense against S. aureus. This could be of significant importance for the development of an effective vaccine. PLOS Pathogens has published these research results online on 25 May 2017.
Staphylococcus aureus (S. aureus) is a bacterium that colonizes by far more than half of the skin and the mucosa of adults, usually without causing infections....
Physicists from the University of Würzburg are capable of generating identical looking single light particles at the push of a button. Two new studies now demonstrate the potential this method holds.
The quantum computer has fuelled the imagination of scientists for decades: It is based on fundamentally different phenomena than a conventional computer....
An international team of physicists has monitored the scattering behaviour of electrons in a non-conducting material in real-time. Their insights could be beneficial for radiotherapy.
We can refer to electrons in non-conducting materials as ‘sluggish’. Typically, they remain fixed in a location, deep inside an atomic composite. It is hence...
Two-dimensional magnetic structures are regarded as a promising material for new types of data storage, since the magnetic properties of individual molecular building blocks can be investigated and modified. For the first time, researchers have now produced a wafer-thin ferrimagnet, in which molecules with different magnetic centers arrange themselves on a gold surface to form a checkerboard pattern. Scientists at the Swiss Nanoscience Institute at the University of Basel and the Paul Scherrer Institute published their findings in the journal Nature Communications.
Ferrimagnets are composed of two centers which are magnetized at different strengths and point in opposing directions. Two-dimensional, quasi-flat ferrimagnets...
An Australian-Chinese research team has created the world's thinnest hologram, paving the way towards the integration of 3D holography into everyday...
24.05.2017 | Event News
23.05.2017 | Event News
22.05.2017 | Event News
26.05.2017 | Life Sciences
26.05.2017 | Life Sciences
26.05.2017 | Physics and Astronomy