Forum for Science, Industry and Business

Sponsored by:     3M 
Search our Site:

 

Data mining made faster

22.07.2010
New method eases analysis of 'multidimensional' information

To many big companies, you aren't just a customer, but are described by multiple "dimensions" of information within a computer database. Now, a University of Utah computer scientist has devised a new method for simpler, faster "data mining," or extracting and analyzing massive amounts of such data.

"Whether you like it or not, Google, Facebook, Walmart and the government are building profiles of you, and these consist of hundreds of attributes describing you" – your online searches, purchases, shared videos and recommendations to your Facebook friends, says Suresh Venkatasubramanian, an assistant professor of computer science.

"If you line them up for each person, you have a line of hundreds of numbers that paint a picture of a person: who they are, what their interests are, who their friends are and so forth," he says. "These strings of hundreds of attributes are called high-dimensional data because each attribute is called one dimension. Data mining is about digging up interesting information from this high-dimensional data."

A group of data-mining methods named "multidimensional scaling" or MDS first was used in the 1930s by psychologists and has been used ever since to make data analysis simpler by reducing the "dimensionality" of the data. Venkatasubramanian says it is "probably one of the most important tools in data mining and is used by countless researchers everywhere."

Now, Venkatasubramanian and colleagues have devised a new method of multidimensional scaling that is faster, simpler, can be used universally for numerous problems and can handle more data, basically by "squashing things [data] down to size."

He is scheduled to present the new method on Wednesday, July 28 in Washington at the premier meeting in his field, the Conference on Knowledge Discovery and Data Mining sponsored by the Association for Computing Machinery.

"This problem of dimensionality reduction and data visualization is fundamental in many disciplines in natural and social sciences," says Venkatasubramanian. "So we believe our method will be useful in doing better data analysis in all of these areas."

"What our approach does is unify into one common framework a number of different methods for doing this dimensionality reduction" to simplify high-dimensional data, he says. "We have a computer program that unifies many different methods people have developed over the past 60 or 70 years. One thing that makes it really good for today's data – in addition to being a one-stop shopping procedure – is it also handles much larger data sets than prior methods were able to handle."

He adds: "Prior methods on modern computers struggle with data from more than 5,000 people. Our method smoothly handles well above 50,000 people."

Venkatasubramanian conducted the research with University of Utah computer science doctoral student Arvind Agarwal and postdoctoral fellow Jeff Phillips. It was funded by the National Science Foundation.

The Curse of Dimensionality

When analyzing long strings of attributes describing people, "you are looking at not just the individual variables but how they interact with each other," he says. "For example, if you describe a person by their height and weight, these are individual variables that describe a person. However, they have correlations among them; a person who is taller is expected to be heavier than someone who is shorter."

The high "dimensionality" of data stems from the fact "the variables interact with each other. That's where you get a [multidimensional] space, not just a list of variables."

"Data mining means finding patterns, relationships and correlations in high-dimensional data," Venkatasubramanian says. "You literally are digging through the data to find little veins of information."

He says uses of data mining include Amazon's recommendations to individual customers based not only on their past purchases, but on those of people with similar preferences, and Netflix's similar method for recommending films. Facebook recommends friends based on people who already are your friends, and on their friends.

"The challenge of data mining is dealing with the dimensionality of the data and the volume of it. So one expression common in the data mining community is 'the curse of dimensionality,'" says Venkatasubramanian.

"The curse of dimensionality is the observed phenomenon that as you throw in more attributes to describe individuals, the data mining tasks you wish to perform become exponentially more difficult," he adds. "We are now at the point where the dimensionality and size of the data is a big problem. It makes things computationally very difficult to find these patterns we want to find."

Multidimensional scaling to simplify multidimensional data is an attempt "to reduce the dimensionality of data by finding key attributes defining most of the behavior," says Venkatasubramanian.

Universal, Fast Data Mining

Venkatasubramanian's new method is universal – "a new way of abstracting the problem into little pieces, and realizing many different versions of this problem can be abstracted the same way." In other words, one set of instructions can be used to do a wide variety of multidimensional scaling that previously required separate instructions.

The new method can handle large amounts of data because "rather than trying to analyze the entire set of data as a whole, we analyze it incrementally, sort of person by person," Venkatasubramanian says. That speeds data mining "because you don't need to have all the data in front of you before you start reducing its dimensionality"

Venkatasubramanian and colleagues performed a series of tests of their new method with "synthetic data" – data points in a "high-dimensional space."

The tests show the new way of data mining by multidimensional scaling "can be faster and equally accurate – and usually more accurate" than existing methods, he says.

The method has what is known as "guaranteed convergence," meaning that "it gets you a better and better and better answer, and it eventually will stop when it gets the best answer it can find," Venkatasubramanian says. It also is modular, which means parts of the software are easily swapped out as improvements are found.

Privacy and Data Mining

What of concerns that we are sacrificing our privacy to marketers?

"The issue of privacy in data mining is like any set of potentially negative consequences of scientific advances," says Venkatasubramanian, adding that much research has examined how to mine data in a manner that protects individual privacy.

He cites Netflix's movie recommendations, for example, noting that "if you target advertising based on what people need, it becomes useful. The better the advertising gets, the more it becomes useful information and not advertising."

"And the way we are being inundated with all forms of information in today's world, whether we like it or not we have no choice but to allow machines and automated systems to sift through all this to make sense of the deluge of information passing our eyes every day."

For more information on the University of Utah School of Computing and College of Engineering, see: http://www.cs.utah.edu and http://www.coe.utah.edu

University of Utah Public Relations
201 Presidents Circle, Room 308
Salt Lake City, Utah 84112-9017
(801) 581-6773 fax: (801) 585-3350

Kate Ferebee | EurekAlert!
Further information:
http://www..utah.edu
http://www.unews.utah.edu

More articles from Information Technology:

nachricht Robots as Tools and Partners in Rehabilitation
17.08.2018 | Albert-Ludwigs-Universität Freiburg im Breisgau

nachricht Low bandwidth? Use more colors at once
17.08.2018 | Purdue University

All articles from Information Technology >>>

The most recent press releases about innovation >>>

Die letzten 5 Focus-News des innovations-reports im Überblick:

Im Focus: Color effects from transparent 3D-printed nanostructures

New design tool automatically creates nanostructure 3D-print templates for user-given colors
Scientists present work at prestigious SIGGRAPH conference

Most of the objects we see are colored by pigments, but using pigments has disadvantages: such colors can fade, industrial pigments are often toxic, and...

Im Focus: Unraveling the nature of 'whistlers' from space in the lab

A new study sheds light on how ultralow frequency radio waves and plasmas interact

Scientists at the University of California, Los Angeles present new research on a curious cosmic phenomenon known as "whistlers" -- very low frequency packets...

Im Focus: New interactive machine learning tool makes car designs more aerodynamic

Scientists develop first tool to use machine learning methods to compute flow around interactively designable 3D objects. Tool will be presented at this year’s prestigious SIGGRAPH conference.

When engineers or designers want to test the aerodynamic properties of the newly designed shape of a car, airplane, or other object, they would normally model...

Im Focus: Robots as 'pump attendants': TU Graz develops robot-controlled rapid charging system for e-vehicles

Researchers from TU Graz and their industry partners have unveiled a world first: the prototype of a robot-controlled, high-speed combined charging system (CCS) for electric vehicles that enables series charging of cars in various parking positions.

Global demand for electric vehicles is forecast to rise sharply: by 2025, the number of new vehicle registrations is expected to reach 25 million per year....

Im Focus: The “TRiC” to folding actin

Proteins must be folded correctly to fulfill their molecular functions in cells. Molecular assistants called chaperones help proteins exploit their inbuilt folding potential and reach the correct three-dimensional structure. Researchers at the Max Planck Institute of Biochemistry (MPIB) have demonstrated that actin, the most abundant protein in higher developed cells, does not have the inbuilt potential to fold and instead requires special assistance to fold into its active state. The chaperone TRiC uses a previously undescribed mechanism to perform actin folding. The study was recently published in the journal Cell.

Actin is the most abundant protein in highly developed cells and has diverse functions in processes like cell stabilization, cell division and muscle...

All Focus news of the innovation-report >>>

Anzeige

Anzeige

VideoLinks
Industry & Economy
Event News

LaserForum 2018 deals with 3D production of components

17.08.2018 | Event News

Within reach of the Universe

08.08.2018 | Event News

A journey through the history of microscopy – new exhibition opens at the MDC

27.07.2018 | Event News

 
Latest News

Smallest transistor worldwide switches current with a single atom in solid electrolyte

17.08.2018 | Physics and Astronomy

Robots as Tools and Partners in Rehabilitation

17.08.2018 | Information Technology

Climate Impact Research in Hannover: Small Plants against Large Waves

17.08.2018 | Life Sciences

VideoLinks
Science & Research
Overview of more VideoLinks >>>