Forum for Science, Industry and Business

Sponsored by:     3M 
Search our Site:

 

Algorithm makes tongue tree

22.01.2002


Compression helps a computer tell Dante from Machiavelli


New computer programme could settle literary debates.

To date, unlike us, computers have struggled to differentiate a page of Jane Austen from one by Jackie Collins. Now researchers in Italy have developed a program that can spot enough subtle differences between two authors’ works to attribute authorship1.

The program can tell a text by Machiavelli from one by Pirandello, Dante or a host of other great Italian writers. It constructed a language tree of the degree of affinity between 50 different tongues. The tree identifies all the main linguistic groups, such as Romance, Celtic, Slavic and so forth and highlights Maltese (an Afro-Asiatic language) and Basque as anomalies.



As well as settling a few literary arguments, the technique might be useful for comparing other information-rich sequences of data. These might include genetic sequences, medical-monitoring measurements and stock-market fluctuations.

Style over substance

Identifying the language of a particular text is generally not hard in itself: one need simply look for the greatest overlap between the words used and those in a reference list for each language. Classifying linguistic styles is altogether more tricky.

One obvious approach is to compare the range and frequency of words in the sample text against reference texts from various candidate authors. That might work for markedly different styles: it would quickly distinguish Shakespeare from Tom Clancy.

But literary scholars often argue furiously about attributions for old texts. The task can become immensely difficult even for those with a great deal of knowledge about the candidate authors’ writing styles.

Clash of symbols

So Dario Benedetto and colleagues at the Universita ’La Sapienza’ in Rome try a different approach. They start from the premise that written language is in the end no more than a string of symbols. It might look rather random, but it is not.

Some groups of characters recur commonly (such as ’the’ in English), and particular authors favour certain constructions and turns of phrase. These can be measured, rather than being reliant on subjective impressions or anecdotal comparisons.

The team begin from the classic insight of telecommunications engineer Claude Shannon in the 1940s that the information content of a message is related to its entropy. Roughly speaking, entropy is a measure of how much redundancy a message contains. It can be defined as the smallest program that will produce the original message as the output.

For a random string of characters, this program would simply specify every character - it would be the same size as the original message. For a string of just A’s, the program could be very concise: ’repeat A’. Most real messages lie somewhere in-between: they can usually be compressed a little without losing significant information. This is the basis of data-compression computer algorithms, used to make ’zip’ files, for instance.

Benedetto and his colleagues borrow the principles of data-compression algorithms to calculate a kind of relative entropy for two different character strings: a measure of how much they differ. This distance between two texts is smaller for two works by the same author than for two works by different authors.

References

  1. Benedetto, D., Caglioti, E. & Loreto, V. Language trees and zipping. Physical Review Letters, 88, 048702, (2002).


PHILIP BALL | © Nature News Service

More articles from Information Technology:

nachricht Man versus machine: Can AI do science?
14.01.2020 | Okinawa Institute of Science and Technology (OIST) Graduate University

nachricht Beyond 5G lab: Communication technology of the future
13.01.2020 | Friedrich-Alexander-Universität Erlangen-Nürnberg

All articles from Information Technology >>>

The most recent press releases about innovation >>>

Die letzten 5 Focus-News des innovations-reports im Überblick:

Im Focus: Miniature double glazing: Material developed which is heat-insulating and heat-conducting at the same time

Styrofoam or copper - both materials have very different properties with regard to their ability to conduct heat. Scientists at the Max Planck Institute for Polymer Research (MPI-P) in Mainz and the University of Bayreuth have now jointly developed and characterized a novel, extremely thin and transparent material that has different thermal conduction properties depending on the direction. While it can conduct heat extremely well in one direction, it shows good thermal insulation in the other direction.

Thermal insulation and thermal conduction play a crucial role in our everyday lives - from computer processors, where it is important to dissipate heat as...

Im Focus: Fraunhofer IAF establishes an application laboratory for quantum sensors

In order to advance the transfer of research developments from the field of quantum sensor technology into industrial applications, an application laboratory is being established at Fraunhofer IAF. This will enable interested companies and especially regional SMEs and start-ups to evaluate the innovation potential of quantum sensors for their specific requirements. Both the state of Baden-Württemberg and the Fraunhofer-Gesellschaft are supporting the four-year project with one million euros each.

The application laboratory is being set up as part of the Fraunhofer lighthouse project »QMag«, short for quantum magnetometry. In this project, researchers...

Im Focus: How Cells Assemble Their Skeleton

Researchers study the formation of microtubules

Microtubules, filamentous structures within the cell, are required for many important processes, including cell division and intracellular transport. A...

Im Focus: World Premiere in Zurich: Machine keeps human livers alive for one week outside of the body

Researchers from the University Hospital Zurich, ETH Zurich, Wyss Zurich and the University of Zurich have developed a machine that repairs injured human livers and keep them alive outside the body for one week. This breakthrough may increase the number of available organs for transplantation saving many lives of patients with severe liver diseases or cancer.

Until now, livers could be stored safely outside the body for only a few hours. With the novel perfusion technology, livers - and even injured livers - can now...

Im Focus: SuperTIGER on its second prowl -- 130,000 feet above Antarctica

A balloon-borne scientific instrument designed to study the origin of cosmic rays is taking its second turn high above the continent of Antarctica three and a half weeks after its launch.

SuperTIGER (Super Trans-Iron Galactic Element Recorder) is designed to measure the rare, heavy elements in cosmic rays that hold clues about their origins...

All Focus news of the innovation-report >>>

Anzeige

Anzeige

VideoLinks
Industry & Economy
Event News

11th Advanced Battery Power Conference, March 24-25, 2020 in Münster/Germany

16.01.2020 | Event News

Laser Colloquium Hydrogen LKH2: fast and reliable fuel cell manufacturing

15.01.2020 | Event News

„Advanced Battery Power“- Conference, Contributions are welcome!

07.01.2020 | Event News

 
Latest News

A new 'cool' blue

17.01.2020 | Life Sciences

EU-project SONAR: Better batteries for electricity from renewable energy sources

17.01.2020 | Power and Electrical Engineering

Neuromuscular organoid: It’s contracting!

17.01.2020 | Life Sciences

VideoLinks
Science & Research
Overview of more VideoLinks >>>