Forum for Science, Industry and Business

Sponsored by:     3M 
Search our Site:

 

New search engine ranks tables by title, document content, text reference

13.08.2007
Penn State researchers have developed a search engine-TableSeer-which not only can identify and extract tables from PDF documents but also can index and rank the search results using factors including the table's title, text references to the table and date of publication.

The engine's innovative ranking algorithm, TableRank, also can identify tables found in frequently cited documents and weigh that factor as well in the search results, said Prasenjit Mitra, an assistant professor in the Penn State College of Information Sciences and Technology (IST) and one of the lead researchers in the development of the search engine.

"TableSeer makes it easier for scientists and scholars to find and access the important information presented in tables, and as far as we know, is the first search engine for tables," Mitra said.

Tables are an important data resource for researchers. In a search of 10,000 documents from s and conferences, the researchers found that more than 70 percent of papers in chemistry, biology and computer science included tables. Furthermore, most of those documents had multiple tables.

But while some software can identify and extract tables from text, existing software cannot search for tables across documents. That means scientists and scholars must manually browse documents in order to find tables-a time-consuming and cumbersome process.

TableSeer automates that process and captures data not only within the table but also in tables' titles and footnotes. In addition, it enables column-name-based search so that a user can search for a particular column in a table.

In tests with documents from the Royal Society of Chemistry, TableSeer correctly identified and retrieved 93.5 percent of tables created in text-based formats, Mitra said.

Searching for tables has some unique challenges, as there is no standard table representation, so tables can appear in PDF, PowerPoint, HTML and Microsoft Word documents. The researchers chose to focus on PDF documents because of their growing popularity in digital libraries and because PDF documents had been overlooked in other table-search efforts.

"Tables can be made using a number of editor tools, and the techniques we are using in TableSeer should work with any text-based tool," said C. Lee Giles, professor of information sciences and technology and co-director of the IST Cyber-Infrastructure Lab where the research originated. "While we designed and developed TableSeer to facilitate searching of tables occurring in articles in the chemistry domain, it can be used in any domain where data is presented in tabular form including other scientific, technical, social and business areas."

The development of TableSeer is part of an open-source cyber-infrastructure project focusing on chemical document search for environmental chemistry and funded by the National Science Foundation. The grant awarded to the Penn State Department of Chemistry aims to enable automatic data analysis.

"Searching and extracting information from data tables is an essential component of data analysis in environmental science, where many research groups publish large amounts of kinetic data describing chemical changes in the environment," said Karl Mueller, professor of chemistry and principal investigator for the NSF grant.

"As we approach multidisciplinary problems within the Penn State Center for Environmental Kinetics Analysis, our students spend many days hunting down and compiling large amount of data from tables. The TableSeer tools will definitely increase the efficiency of this process and allow more time to be spent on creative scientific analysis," he added.

TableSeer can be tested online (see http://chemxseer.ist.psu.edu). The source code will be made available near the completion of the project, the researchers said.

In the meantime, research is ongoing to improve the ranking algorithm by adding additional features. The researchers also are working on a search engine that can identify, extract and rank figures found in documents, as figures are another important device for disseminating data and findings in the natural sciences.

Margaret Hopkins | EurekAlert!
Further information:
http://chemxseer.ist.psu.edu
http://www.psu.edu

More articles from Information Technology:

nachricht Smarter robot vacuum cleaners for automated office cleaning
15.08.2017 | Fraunhofer-Institut für Arbeitswirtschaft und Organisation IAO

nachricht Researchers 3-D print first truly microfluidic 'lab on a chipl devices
15.08.2017 | Brigham Young University

All articles from Information Technology >>>

The most recent press releases about innovation >>>

Die letzten 5 Focus-News des innovations-reports im Überblick:

Im Focus: Fizzy soda water could be key to clean manufacture of flat wonder material: Graphene

Whether you call it effervescent, fizzy, or sparkling, carbonated water is making a comeback as a beverage. Aside from quenching thirst, researchers at the University of Illinois at Urbana-Champaign have discovered a new use for these "bubbly" concoctions that will have major impact on the manufacturer of the world's thinnest, flattest, and one most useful materials -- graphene.

As graphene's popularity grows as an advanced "wonder" material, the speed and quality at which it can be manufactured will be paramount. With that in mind,...

Im Focus: Exotic quantum states made from light: Physicists create optical “wells” for a super-photon

Physicists at the University of Bonn have managed to create optical hollows and more complex patterns into which the light of a Bose-Einstein condensate flows. The creation of such highly low-loss structures for light is a prerequisite for complex light circuits, such as for quantum information processing for a new generation of computers. The researchers are now presenting their results in the journal Nature Photonics.

Light particles (photons) occur as tiny, indivisible portions. Many thousands of these light portions can be merged to form a single super-photon if they are...

Im Focus: Circular RNA linked to brain function

For the first time, scientists have shown that circular RNA is linked to brain function. When a RNA molecule called Cdr1as was deleted from the genome of mice, the animals had problems filtering out unnecessary information – like patients suffering from neuropsychiatric disorders.

While hundreds of circular RNAs (circRNAs) are abundant in mammalian brains, one big question has remained unanswered: What are they actually good for? In the...

Im Focus: RAVAN CubeSat measures Earth's outgoing energy

An experimental small satellite has successfully collected and delivered data on a key measurement for predicting changes in Earth's climate.

The Radiometer Assessment using Vertically Aligned Nanotubes (RAVAN) CubeSat was launched into low-Earth orbit on Nov. 11, 2016, in order to test new...

Im Focus: Scientists shine new light on the “other high temperature superconductor”

A study led by scientists of the Max Planck Institute for the Structure and Dynamics of Matter (MPSD) at the Center for Free-Electron Laser Science in Hamburg presents evidence of the coexistence of superconductivity and “charge-density-waves” in compounds of the poorly-studied family of bismuthates. This observation opens up new perspectives for a deeper understanding of the phenomenon of high-temperature superconductivity, a topic which is at the core of condensed matter research since more than 30 years. The paper by Nicoletti et al has been published in the PNAS.

Since the beginning of the 20th century, superconductivity had been observed in some metals at temperatures only a few degrees above the absolute zero (minus...

All Focus news of the innovation-report >>>

Anzeige

Anzeige

Event News

Call for Papers – ICNFT 2018, 5th International Conference on New Forming Technology

16.08.2017 | Event News

Sustainability is the business model of tomorrow

04.08.2017 | Event News

Clash of Realities 2017: Registration now open. International Conference at TH Köln

26.07.2017 | Event News

 
Latest News

Gold shines through properties of nano biosensors

17.08.2017 | Physics and Astronomy

Greenland ice flow likely to speed up: New data assert glaciers move over sediment, which gets more slippery as it gets wetter

17.08.2017 | Earth Sciences

Mars 2020 mission to use smart methods to seek signs of past life

17.08.2017 | Physics and Astronomy

VideoLinks
B2B-VideoLinks
More VideoLinks >>>