Forum for Science, Industry and Business

Sponsored by:     3M 
Search our Site:

 

Uncovering hidden structures in massive data collections

03.12.2013
Advances in computer storage have created collections of data so huge that researchers often have trouble uncovering critical patterns in connections among individual items, making it difficult for them to realize fully the power of computing as a research tool.

Now, computer scientists at Princeton University have developed a method that offers a solution to this data overload. Using a mathematical method that calculates the likelihood of a pattern repeating throughout a subset of data, the researchers have been able to cut dramatically the time needed to find patterns in large collections of information such as social networks.

The tool allows researchers to identify quickly the connections between seemingly disparate groups such as theoretical physicists who study intermolecular forces and astrophysicists researching black holes.

"The data we are interested in are graphs of networks like friends on Facebook or lists of academic citations," said David Blei, an associate professor of computer science and co-author on the research, which was published Sept. 3 in the Proceedings of the National Academy of Science. "These are vast data sets and we want to apply sophisticated statistical models to them in order to understand various patterns."

Finding patterns in the connections among points of data can be critical for many applications. For example, checking citations to scientific papers can provide insights to the development of new fields of study or show overlap between different academic disciplines. Links between patents can map out groups that indicate new technological developments. And analysis of social networks can provide information about communities and allow predictions of future interests.

"The goal is to detect overlapping communities," Blei said. "The problem is that these data collections have gotten so big that the algorithms cannot solve the problem in a reasonable amount of time."

Currently, Blei said, many algorithms uncover hidden patterns by analyzing potential interactions between every pair of nodes (either connected or unconnected) in the entire data set; that becomes impractical for large amounts of data such as the collected citations of the U.S. Patent Office. Many are also limited to sorting data into single groups.

"In most cases, nodes belong to multiple groups," said Prem Gopalan, a doctoral student in Blei's research group and lead author of the paper. "We want to be able to reflect that."

The research was supported by the Office of Naval Research, the National Science Foundation and the Alfred. P. Sloan Foundation.

In very basic terms, the researchers approached the problem by dividing the analysis into two broad tasks. In one, they created an algorithm that quickly analyzes a subset of a large database. The algorithm calculates the likelihood that nodes belong to various groups in the database. In the second broad task, the researchers created an adjustable matrix that accepts the analysis of the subset and assigns "weights" to each data point reflecting the likelihood that it belongs to different groups.

Blei and Gopalan designed the sampling algorithm to refine its accuracy as it samples more subsets. At the same time, the continual input from the sampling to the weighted matrix refines the accuracy of the overall analysis.

The math behind the work is complex. Essentially, the researchers used a technique called stochastic optimization, which is a method to determine a central pattern from a group of data that seem chaotic or, as mathematicians call it, "noisy." Blei likens it to finding your way from New York to Los Angeles by stopping random people and asking for directions — if you ask enough people, you will eventually find your way. The key is to know what question to ask and how to interpret the answers.

"With noisy measurements, you can still make good progress by doing it many times as long as the average gives you the correct result," he said.

In their PNAS article, the researchers describe how they used their method to discover patterns in the connections between patents. Using public data from the U.S. National Bureau of Economic Research, Gopalan and Blei analyzed connections to the 1976 patent "Process for producing porous products."

The patent, filed by Robert W. Gore (who several years earlier discovered the process that led to the creation of the waterproof fabric Gore-Tex), described a method for producing porous material from tetrafluoroethylene polymers. The researchers analyzed a data collection of 3.7 million nodes and found that connections between Gore's 1976 filing and other patents formed 39 distinct communities in the database.

The patent "has influenced the design of many everyday materials such as waterproof laminate, adhesives, printed circuit boards, insulated conductors, dental floss and strings of musical instruments," the researchers wrote.

In the past, researchers struggled to find nuggets of critical information in data. The new challenge is not finding the needle in the data haystack, but finding the hidden patterns in the hay.

"Take the data from the world, from what you observe, and then untangle it," Blei said. "What generated it? What are the hidden structures?"

John Sullivan | EurekAlert!
Further information:
http://www.princeton.edu

More articles from Information Technology:

nachricht New technology enables 5-D imaging in live animals, humans
16.01.2017 | University of Southern California

nachricht Fraunhofer FIT announces CloudTeams collaborative software development platform – join it for free
10.01.2017 | Fraunhofer-Institut für Angewandte Informationstechnik FIT

All articles from Information Technology >>>

The most recent press releases about innovation >>>

Die letzten 5 Focus-News des innovations-reports im Überblick:

Im Focus: Studying fundamental particles in materials

Laser-driving of semimetals allows creating novel quasiparticle states within condensed matter systems and switching between different states on ultrafast time scales

Studying properties of fundamental particles in condensed matter systems is a promising approach to quantum field theory. Quasiparticles offer the opportunity...

Im Focus: Designing Architecture with Solar Building Envelopes

Among the general public, solar thermal energy is currently associated with dark blue, rectangular collectors on building roofs. Technologies are needed for aesthetically high quality architecture which offer the architect more room for manoeuvre when it comes to low- and plus-energy buildings. With the “ArKol” project, researchers at Fraunhofer ISE together with partners are currently developing two façade collectors for solar thermal energy generation, which permit a high degree of design flexibility: a strip collector for opaque façade sections and a solar thermal blind for transparent sections. The current state of the two developments will be presented at the BAU 2017 trade fair.

As part of the “ArKol – development of architecturally highly integrated façade collectors with heat pipes” project, Fraunhofer ISE together with its partners...

Im Focus: How to inflate a hardened concrete shell with a weight of 80 t

At TU Wien, an alternative for resource intensive formwork for the construction of concrete domes was developed. It is now used in a test dome for the Austrian Federal Railways Infrastructure (ÖBB Infrastruktur).

Concrete shells are efficient structures, but not very resource efficient. The formwork for the construction of concrete domes alone requires a high amount of...

Im Focus: Bacterial Pac Man molecule snaps at sugar

Many pathogens use certain sugar compounds from their host to help conceal themselves against the immune system. Scientists at the University of Bonn have now, in cooperation with researchers at the University of York in the United Kingdom, analyzed the dynamics of a bacterial molecule that is involved in this process. They demonstrate that the protein grabs onto the sugar molecule with a Pac Man-like chewing motion and holds it until it can be used. Their results could help design therapeutics that could make the protein poorer at grabbing and holding and hence compromise the pathogen in the host. The study has now been published in “Biophysical Journal”.

The cells of the mouth, nose and intestinal mucosa produce large quantities of a chemical called sialic acid. Many bacteria possess a special transport system...

Im Focus: Newly proposed reference datasets improve weather satellite data quality

UMD, NOAA collaboration demonstrates suitability of in-orbit datasets for weather satellite calibration

"Traffic and weather, together on the hour!" blasts your local radio station, while your smartphone knows the weather halfway across the world. A network of...

All Focus news of the innovation-report >>>

Anzeige

Anzeige

Event News

12V, 48V, high-voltage – trends in E/E automotive architecture

10.01.2017 | Event News

2nd Conference on Non-Textual Information on 10 and 11 May 2017 in Hannover

09.01.2017 | Event News

Nothing will happen without batteries making it happen!

05.01.2017 | Event News

 
Latest News

Water - as the underlying driver of the Earth’s carbon cycle

17.01.2017 | Earth Sciences

Satellite-based Laser Measurement Technology against Climate Change

17.01.2017 | Machine Engineering

Studying fundamental particles in materials

17.01.2017 | Physics and Astronomy

VideoLinks
B2B-VideoLinks
More VideoLinks >>>