Building Trustworthy Big Data Algorithms for Insightful Analysis

Much of our reams of data sit in large databases of unstructured text. Finding insights among emails, text documents, and websites is extremely difficult unless we can search, characterize, and classify their text data in a meaningful way.

One of the leading big data algorithms for finding related topics within unstructured text (an area called topic modeling) is latent Dirichlet allocation (LDA). But when Northwestern University professor Luis Amaral set out to test LDA, he found that it was neither as accurate nor reproducible as a leading topic modeling algorithm should be.

Using his network analysis background, Amaral, professor of chemical and biological engineering in Northwestern's McCormick School of Engineering and Applied Science, developed a new topic modeling algorithm that has shown very high accuracy and reproducibility during tests. His results, published with co-author Konrad Kording, associate professor of physical medicine and rehabilitation, physiology, and applied mathematics at Northwestern, were published Jan. 29 in Physical Review X.

Topic modeling algorithms take unstructured text and find a set of topics that can be used to describe each document in the set. They are the workhorses of big data science, used as the foundation for recommendation systems, spam filtering, and digital image processing. The LDA topic modeling algorithm was developed in 2003 and has been widely used for academic research and for commercial applications, like search engines.

When Amaral explored how LDA worked, he found that the algorithm produced different results each time for the same set of data, and it often did so inaccurately. Amaral and his group tested LDA by running it on documents they created that were written in English, French, Spanish, and other languages. By doing this, they were able to prevent text overlap among documents.

“In this simple case, the algorithm should be able to perform at 100 percent accuracy and reproducibility,” he said. But when LDA was used, it separated these documents into similar groups with only 90 percent accuracy and 80 percent reproducibility. “While these numbers may appear to be good, they are actually very poor, since they are for an exceedingly easy case,” Amaral said.

To create a better algorithm, Amaral took a network approach. The result, called TopicMapping, begins by preprocessing data to replace words with their stem (so “star” and “stars” would be considered the same word). It then builds a network of connecting words and identifies a “community” of related words (just as one could look for communities of people in Facebook). The words within a given community define a topic.

The algorithm was able to perfectly separate the documents according to language and was able to reproduce its results. It also had high accuracy and reproducibility when separating 23,000 scientific papers and 1.2 million Wikipedia articles by topic.

These results show the need for more testing of big data algorithms and more research into making them more accurate and reproducible, Amaral said.

“Companies that make products must show that their products work,” he said. “They must be certified. There is no such case for algorithms. We have a lot of uninformed consumers of big data algorithms that are using tools that haven't been tested for reproducibility and accuracy.”

academic research accuracy algorithm Applied Science biological engineering commercial applications digital image processing image processing LDA reproducible

Comments (0) Cancel reply

Information Technology

Next-Gen Data Armor: 2D Perovskites Bring Cheap, Secure Crypto

Very secure and highly efficient: encryption and decryption with luminescent perovskites To guarantee high data security, encryption must be unbreakable while the data remains rapidly and easily readable. A novel strategy for optical encryption/decryption of information has now been introduced in the journal Angewandte Chemie by a Chinese research team. It is based on compounds with carefully modulated luminescent properties that change in response to external stimuli. The compounds are hybrid two-dimensional organic-inorganic metal-halide perovskites, whose structure consists of inorganic…

28.05.2025

Nina de Lacy, MD, MBA. Credit: Kristan Jacobsen Photography / University of Utah Health.

Information Technology

University of Utah Unveils AI Toolkit to Predict Diseases Early

Researchers at the University of Utah’s Department of Psychiatry and Huntsman Mental Health Institute today published a paper introducing RiskPath, an open source software toolkit that uses Explainable Artificial Intelligence (XAI) to predict whether individuals will develop progressive and chronic diseases years before symptoms appear, potentially transforming how preventive healthcare is delivered. XAI is an artificial intelligence system that can explain complex decisions in ways humans can understand. The new technology represents a significant advancement in disease prediction and prevention…

05.05.2025

Amyloid aggregation inside cells marked using fluorescence techniques. Credit: Benedetta Bolognesi/IBEC

Information Technology

“Explainable” AI Decodes Sticky Proteins’ Secret Language

Researchers train AI to predict if and why proteins form sticky clumps, a mechanism linked to 50 human diseases affecting half a billion people An AI tool has made a step forward in translating the language proteins use to dictate whether they form sticky clumps similar to those linked to Alzheimer’s Disease and around fifty other types of human disease. In a departure from typical “black-box” AI models, the new tool, CANYA, was designed to be able to explain its…

02.05.2025

deflected by an unconventional anomalous Hall effect in an altermagnetic crystal of ruthenium oxide Credit: ill./©: Libor Šmejkal and Matthias Greber

Information Technology

Unlocking Unconventional Magnetism for IT Devices in Germany

Jairo Sinova of Mainz University to coordinate a new Priority Program for fundamental and applied research into information technology based on altermagnetism Professor Jairo Sinova of Johannes Gutenberg University Mainz (JGU) will be coordinating a new Priority Program in the field of condensed matter physics that will be dealing with unconventional magnetism. The Priority Program will involve fundamental and applied research in the field of unconventional magnetic systems to develop IT components or devices that will reach the technical limits…

29.04.2025

Building Trustworthy Big Data Algorithms for Insightful Analysis

Comments (0) Cancel reply

Most Read Articles

Related Posts

Next-Gen Data Armor: 2D Perovskites Bring Cheap, Secure Crypto

University of Utah Unveils AI Toolkit to Predict Diseases Early

“Explainable” AI Decodes Sticky Proteins’ Secret Language

Unlocking Unconventional Magnetism for IT Devices in Germany

Do You Like Our New Design?