Forum for Science, Industry and Business

Sponsored by:     3M 
Search our Site:

 

Data mining made faster

22.07.2010
New method eases analysis of 'multidimensional' information

To many big companies, you aren't just a customer, but are described by multiple "dimensions" of information within a computer database. Now, a University of Utah computer scientist has devised a new method for simpler, faster "data mining," or extracting and analyzing massive amounts of such data.

"Whether you like it or not, Google, Facebook, Walmart and the government are building profiles of you, and these consist of hundreds of attributes describing you" – your online searches, purchases, shared videos and recommendations to your Facebook friends, says Suresh Venkatasubramanian, an assistant professor of computer science.

"If you line them up for each person, you have a line of hundreds of numbers that paint a picture of a person: who they are, what their interests are, who their friends are and so forth," he says. "These strings of hundreds of attributes are called high-dimensional data because each attribute is called one dimension. Data mining is about digging up interesting information from this high-dimensional data."

A group of data-mining methods named "multidimensional scaling" or MDS first was used in the 1930s by psychologists and has been used ever since to make data analysis simpler by reducing the "dimensionality" of the data. Venkatasubramanian says it is "probably one of the most important tools in data mining and is used by countless researchers everywhere."

Now, Venkatasubramanian and colleagues have devised a new method of multidimensional scaling that is faster, simpler, can be used universally for numerous problems and can handle more data, basically by "squashing things [data] down to size."

He is scheduled to present the new method on Wednesday, July 28 in Washington at the premier meeting in his field, the Conference on Knowledge Discovery and Data Mining sponsored by the Association for Computing Machinery.

"This problem of dimensionality reduction and data visualization is fundamental in many disciplines in natural and social sciences," says Venkatasubramanian. "So we believe our method will be useful in doing better data analysis in all of these areas."

"What our approach does is unify into one common framework a number of different methods for doing this dimensionality reduction" to simplify high-dimensional data, he says. "We have a computer program that unifies many different methods people have developed over the past 60 or 70 years. One thing that makes it really good for today's data – in addition to being a one-stop shopping procedure – is it also handles much larger data sets than prior methods were able to handle."

He adds: "Prior methods on modern computers struggle with data from more than 5,000 people. Our method smoothly handles well above 50,000 people."

Venkatasubramanian conducted the research with University of Utah computer science doctoral student Arvind Agarwal and postdoctoral fellow Jeff Phillips. It was funded by the National Science Foundation.

The Curse of Dimensionality

When analyzing long strings of attributes describing people, "you are looking at not just the individual variables but how they interact with each other," he says. "For example, if you describe a person by their height and weight, these are individual variables that describe a person. However, they have correlations among them; a person who is taller is expected to be heavier than someone who is shorter."

The high "dimensionality" of data stems from the fact "the variables interact with each other. That's where you get a [multidimensional] space, not just a list of variables."

"Data mining means finding patterns, relationships and correlations in high-dimensional data," Venkatasubramanian says. "You literally are digging through the data to find little veins of information."

He says uses of data mining include Amazon's recommendations to individual customers based not only on their past purchases, but on those of people with similar preferences, and Netflix's similar method for recommending films. Facebook recommends friends based on people who already are your friends, and on their friends.

"The challenge of data mining is dealing with the dimensionality of the data and the volume of it. So one expression common in the data mining community is 'the curse of dimensionality,'" says Venkatasubramanian.

"The curse of dimensionality is the observed phenomenon that as you throw in more attributes to describe individuals, the data mining tasks you wish to perform become exponentially more difficult," he adds. "We are now at the point where the dimensionality and size of the data is a big problem. It makes things computationally very difficult to find these patterns we want to find."

Multidimensional scaling to simplify multidimensional data is an attempt "to reduce the dimensionality of data by finding key attributes defining most of the behavior," says Venkatasubramanian.

Universal, Fast Data Mining

Venkatasubramanian's new method is universal – "a new way of abstracting the problem into little pieces, and realizing many different versions of this problem can be abstracted the same way." In other words, one set of instructions can be used to do a wide variety of multidimensional scaling that previously required separate instructions.

The new method can handle large amounts of data because "rather than trying to analyze the entire set of data as a whole, we analyze it incrementally, sort of person by person," Venkatasubramanian says. That speeds data mining "because you don't need to have all the data in front of you before you start reducing its dimensionality"

Venkatasubramanian and colleagues performed a series of tests of their new method with "synthetic data" – data points in a "high-dimensional space."

The tests show the new way of data mining by multidimensional scaling "can be faster and equally accurate – and usually more accurate" than existing methods, he says.

The method has what is known as "guaranteed convergence," meaning that "it gets you a better and better and better answer, and it eventually will stop when it gets the best answer it can find," Venkatasubramanian says. It also is modular, which means parts of the software are easily swapped out as improvements are found.

Privacy and Data Mining

What of concerns that we are sacrificing our privacy to marketers?

"The issue of privacy in data mining is like any set of potentially negative consequences of scientific advances," says Venkatasubramanian, adding that much research has examined how to mine data in a manner that protects individual privacy.

He cites Netflix's movie recommendations, for example, noting that "if you target advertising based on what people need, it becomes useful. The better the advertising gets, the more it becomes useful information and not advertising."

"And the way we are being inundated with all forms of information in today's world, whether we like it or not we have no choice but to allow machines and automated systems to sift through all this to make sense of the deluge of information passing our eyes every day."

For more information on the University of Utah School of Computing and College of Engineering, see: http://www.cs.utah.edu and http://www.coe.utah.edu

University of Utah Public Relations
201 Presidents Circle, Room 308
Salt Lake City, Utah 84112-9017
(801) 581-6773 fax: (801) 585-3350

Kate Ferebee | EurekAlert!
Further information:
http://www..utah.edu
http://www.unews.utah.edu

More articles from Information Technology:

nachricht New technology enables 5-D imaging in live animals, humans
16.01.2017 | University of Southern California

nachricht Fraunhofer FIT announces CloudTeams collaborative software development platform – join it for free
10.01.2017 | Fraunhofer-Institut für Angewandte Informationstechnik FIT

All articles from Information Technology >>>

The most recent press releases about innovation >>>

Die letzten 5 Focus-News des innovations-reports im Überblick:

Im Focus: Designing Architecture with Solar Building Envelopes

Among the general public, solar thermal energy is currently associated with dark blue, rectangular collectors on building roofs. Technologies are needed for aesthetically high quality architecture which offer the architect more room for manoeuvre when it comes to low- and plus-energy buildings. With the “ArKol” project, researchers at Fraunhofer ISE together with partners are currently developing two façade collectors for solar thermal energy generation, which permit a high degree of design flexibility: a strip collector for opaque façade sections and a solar thermal blind for transparent sections. The current state of the two developments will be presented at the BAU 2017 trade fair.

As part of the “ArKol – development of architecturally highly integrated façade collectors with heat pipes” project, Fraunhofer ISE together with its partners...

Im Focus: How to inflate a hardened concrete shell with a weight of 80 t

At TU Wien, an alternative for resource intensive formwork for the construction of concrete domes was developed. It is now used in a test dome for the Austrian Federal Railways Infrastructure (ÖBB Infrastruktur).

Concrete shells are efficient structures, but not very resource efficient. The formwork for the construction of concrete domes alone requires a high amount of...

Im Focus: Bacterial Pac Man molecule snaps at sugar

Many pathogens use certain sugar compounds from their host to help conceal themselves against the immune system. Scientists at the University of Bonn have now, in cooperation with researchers at the University of York in the United Kingdom, analyzed the dynamics of a bacterial molecule that is involved in this process. They demonstrate that the protein grabs onto the sugar molecule with a Pac Man-like chewing motion and holds it until it can be used. Their results could help design therapeutics that could make the protein poorer at grabbing and holding and hence compromise the pathogen in the host. The study has now been published in “Biophysical Journal”.

The cells of the mouth, nose and intestinal mucosa produce large quantities of a chemical called sialic acid. Many bacteria possess a special transport system...

Im Focus: Newly proposed reference datasets improve weather satellite data quality

UMD, NOAA collaboration demonstrates suitability of in-orbit datasets for weather satellite calibration

"Traffic and weather, together on the hour!" blasts your local radio station, while your smartphone knows the weather halfway across the world. A network of...

Im Focus: Repairing defects in fiber-reinforced plastics more efficiently

Fiber-reinforced plastics (FRP) are frequently used in the aeronautic and automobile industry. However, the repair of workpieces made of these composite materials is often less profitable than exchanging the part. In order to increase the lifetime of FRP parts and to make them more eco-efficient, the Laser Zentrum Hannover e.V. (LZH) and the Apodius GmbH want to combine a new measuring device for fiber layer orientation with an innovative laser-based repair process.

Defects in FRP pieces may be production or operation-related. Whether or not repair is cost-effective depends on the geometry of the defective area, the tools...

All Focus news of the innovation-report >>>

Anzeige

Anzeige

Event News

12V, 48V, high-voltage – trends in E/E automotive architecture

10.01.2017 | Event News

2nd Conference on Non-Textual Information on 10 and 11 May 2017 in Hannover

09.01.2017 | Event News

Nothing will happen without batteries making it happen!

05.01.2017 | Event News

 
Latest News

Multiregional brain on a chip

16.01.2017 | Power and Electrical Engineering

New technology enables 5-D imaging in live animals, humans

16.01.2017 | Information Technology

Researchers develop environmentally friendly soy air filter

16.01.2017 | Power and Electrical Engineering

VideoLinks
B2B-VideoLinks
More VideoLinks >>>