Forum for Science, Industry and Business

Sponsored by:     3M 
Search our Site:

 

Algorithm makes tongue tree

22.01.2002


Compression helps a computer tell Dante from Machiavelli


New computer programme could settle literary debates.

To date, unlike us, computers have struggled to differentiate a page of Jane Austen from one by Jackie Collins. Now researchers in Italy have developed a program that can spot enough subtle differences between two authors’ works to attribute authorship1.

The program can tell a text by Machiavelli from one by Pirandello, Dante or a host of other great Italian writers. It constructed a language tree of the degree of affinity between 50 different tongues. The tree identifies all the main linguistic groups, such as Romance, Celtic, Slavic and so forth and highlights Maltese (an Afro-Asiatic language) and Basque as anomalies.



As well as settling a few literary arguments, the technique might be useful for comparing other information-rich sequences of data. These might include genetic sequences, medical-monitoring measurements and stock-market fluctuations.

Style over substance

Identifying the language of a particular text is generally not hard in itself: one need simply look for the greatest overlap between the words used and those in a reference list for each language. Classifying linguistic styles is altogether more tricky.

One obvious approach is to compare the range and frequency of words in the sample text against reference texts from various candidate authors. That might work for markedly different styles: it would quickly distinguish Shakespeare from Tom Clancy.

But literary scholars often argue furiously about attributions for old texts. The task can become immensely difficult even for those with a great deal of knowledge about the candidate authors’ writing styles.

Clash of symbols

So Dario Benedetto and colleagues at the Universita ’La Sapienza’ in Rome try a different approach. They start from the premise that written language is in the end no more than a string of symbols. It might look rather random, but it is not.

Some groups of characters recur commonly (such as ’the’ in English), and particular authors favour certain constructions and turns of phrase. These can be measured, rather than being reliant on subjective impressions or anecdotal comparisons.

The team begin from the classic insight of telecommunications engineer Claude Shannon in the 1940s that the information content of a message is related to its entropy. Roughly speaking, entropy is a measure of how much redundancy a message contains. It can be defined as the smallest program that will produce the original message as the output.

For a random string of characters, this program would simply specify every character - it would be the same size as the original message. For a string of just A’s, the program could be very concise: ’repeat A’. Most real messages lie somewhere in-between: they can usually be compressed a little without losing significant information. This is the basis of data-compression computer algorithms, used to make ’zip’ files, for instance.

Benedetto and his colleagues borrow the principles of data-compression algorithms to calculate a kind of relative entropy for two different character strings: a measure of how much they differ. This distance between two texts is smaller for two works by the same author than for two works by different authors.

References

  1. Benedetto, D., Caglioti, E. & Loreto, V. Language trees and zipping. Physical Review Letters, 88, 048702, (2002).


PHILIP BALL | © Nature News Service

More articles from Information Technology:

nachricht Supercomputing the emergence of material behavior
18.05.2018 | University of Texas at Austin, Texas Advanced Computing Center

nachricht Keeping a Close Eye on Ice Loss
18.05.2018 | Alfred-Wegener-Institut, Helmholtz-Zentrum für Polar- und Meeresforschung

All articles from Information Technology >>>

The most recent press releases about innovation >>>

Die letzten 5 Focus-News des innovations-reports im Überblick:

Im Focus: Explanation for puzzling quantum oscillations has been found

So-called quantum many-body scars allow quantum systems to stay out of equilibrium much longer, explaining experiment | Study published in Nature Physics

Recently, researchers from Harvard and MIT succeeded in trapping a record 53 atoms and individually controlling their quantum state, realizing what is called a...

Im Focus: Dozens of binaries from Milky Way's globular clusters could be detectable by LISA

Next-generation gravitational wave detector in space will complement LIGO on Earth

The historic first detection of gravitational waves from colliding black holes far outside our galaxy opened a new window to understanding the universe. A...

Im Focus: Entangled atoms shine in unison

A team led by Austrian experimental physicist Rainer Blatt has succeeded in characterizing the quantum entanglement of two spatially separated atoms by observing their light emission. This fundamental demonstration could lead to the development of highly sensitive optical gradiometers for the precise measurement of the gravitational field or the earth's magnetic field.

The age of quantum technology has long been heralded. Decades of research into the quantum world have led to the development of methods that make it possible...

Im Focus: Computer-Designed Customized Regenerative Heart Valves

Cardiovascular tissue engineering aims to treat heart disease with prostheses that grow and regenerate. Now, researchers from the University of Zurich, the Technical University Eindhoven and the Charité Berlin have successfully implanted regenerative heart valves, designed with the aid of computer simulations, into sheep for the first time.

Producing living tissue or organs based on human cells is one of the main research fields in regenerative medicine. Tissue engineering, which involves growing...

Im Focus: Light-induced superconductivity under high pressure

A team of scientists of the Max Planck Institute for the Structure and Dynamics of Matter (MPSD) at the Center for Free-Electron Laser Science in Hamburg investigated optically-induced superconductivity in the alkali-doped fulleride K3C60under high external pressures. This study allowed, on one hand, to uniquely assess the nature of the transient state as a superconducting phase. In addition, it unveiled the possibility to induce superconductivity in K3C60 at temperatures far above the -170 degrees Celsius hypothesized previously, and rather all the way to room temperature. The paper by Cantaluppi et al has been published in Nature Physics.

Unlike ordinary metals, superconductors have the unique capability of transporting electrical currents without any loss. Nowadays, their technological...

All Focus news of the innovation-report >>>

Anzeige

Anzeige

VideoLinks
Industry & Economy
Event News

Save the date: Forum European Neuroscience – 07-11 July 2018 in Berlin, Germany

02.05.2018 | Event News

Invitation to the upcoming "Current Topics in Bioinformatics: Big Data in Genomics and Medicine"

13.04.2018 | Event News

Unique scope of UV LED technologies and applications presented in Berlin: ICULTA-2018

12.04.2018 | Event News

 
Latest News

Supersonic waves may help electronics beat the heat

18.05.2018 | Power and Electrical Engineering

Keeping a Close Eye on Ice Loss

18.05.2018 | Information Technology

CrowdWater: An App for Flood Research

18.05.2018 | Information Technology

VideoLinks
Science & Research
Overview of more VideoLinks >>>