Forum for Science, Industry and Business

Sponsored by:     3M 
Search our Site:

 

Algorithm makes tongue tree

22.01.2002


Compression helps a computer tell Dante from Machiavelli


New computer programme could settle literary debates.

To date, unlike us, computers have struggled to differentiate a page of Jane Austen from one by Jackie Collins. Now researchers in Italy have developed a program that can spot enough subtle differences between two authors’ works to attribute authorship1.

The program can tell a text by Machiavelli from one by Pirandello, Dante or a host of other great Italian writers. It constructed a language tree of the degree of affinity between 50 different tongues. The tree identifies all the main linguistic groups, such as Romance, Celtic, Slavic and so forth and highlights Maltese (an Afro-Asiatic language) and Basque as anomalies.



As well as settling a few literary arguments, the technique might be useful for comparing other information-rich sequences of data. These might include genetic sequences, medical-monitoring measurements and stock-market fluctuations.

Style over substance

Identifying the language of a particular text is generally not hard in itself: one need simply look for the greatest overlap between the words used and those in a reference list for each language. Classifying linguistic styles is altogether more tricky.

One obvious approach is to compare the range and frequency of words in the sample text against reference texts from various candidate authors. That might work for markedly different styles: it would quickly distinguish Shakespeare from Tom Clancy.

But literary scholars often argue furiously about attributions for old texts. The task can become immensely difficult even for those with a great deal of knowledge about the candidate authors’ writing styles.

Clash of symbols

So Dario Benedetto and colleagues at the Universita ’La Sapienza’ in Rome try a different approach. They start from the premise that written language is in the end no more than a string of symbols. It might look rather random, but it is not.

Some groups of characters recur commonly (such as ’the’ in English), and particular authors favour certain constructions and turns of phrase. These can be measured, rather than being reliant on subjective impressions or anecdotal comparisons.

The team begin from the classic insight of telecommunications engineer Claude Shannon in the 1940s that the information content of a message is related to its entropy. Roughly speaking, entropy is a measure of how much redundancy a message contains. It can be defined as the smallest program that will produce the original message as the output.

For a random string of characters, this program would simply specify every character - it would be the same size as the original message. For a string of just A’s, the program could be very concise: ’repeat A’. Most real messages lie somewhere in-between: they can usually be compressed a little without losing significant information. This is the basis of data-compression computer algorithms, used to make ’zip’ files, for instance.

Benedetto and his colleagues borrow the principles of data-compression algorithms to calculate a kind of relative entropy for two different character strings: a measure of how much they differ. This distance between two texts is smaller for two works by the same author than for two works by different authors.

References

  1. Benedetto, D., Caglioti, E. & Loreto, V. Language trees and zipping. Physical Review Letters, 88, 048702, (2002).


PHILIP BALL | © Nature News Service

More articles from Information Technology:

nachricht World first: 'Storing lightning inside thunder'
18.09.2017 | University of Sydney

nachricht New software turns mobile-phone accessory into breathing monitor
14.09.2017 | The Optical Society

All articles from Information Technology >>>

The most recent press releases about innovation >>>

Die letzten 5 Focus-News des innovations-reports im Überblick:

Im Focus: Ultrafast snapshots of relaxing electrons in solids

Using ultrafast flashes of laser and x-ray radiation, scientists at the Max Planck Institute of Quantum Optics (Garching, Germany) took snapshots of the briefest electron motion inside a solid material to date. The electron motion lasted only 750 billionths of the billionth of a second before it fainted, setting a new record of human capability to capture ultrafast processes inside solids!

When x-rays shine onto solid materials or large molecules, an electron is pushed away from its original place near the nucleus of the atom, leaving a hole...

Im Focus: Quantum Sensors Decipher Magnetic Ordering in a New Semiconducting Material

For the first time, physicists have successfully imaged spiral magnetic ordering in a multiferroic material. These materials are considered highly promising candidates for future data storage media. The researchers were able to prove their findings using unique quantum sensors that were developed at Basel University and that can analyze electromagnetic fields on the nanometer scale. The results – obtained by scientists from the University of Basel’s Department of Physics, the Swiss Nanoscience Institute, the University of Montpellier and several laboratories from University Paris-Saclay – were recently published in the journal Nature.

Multiferroics are materials that simultaneously react to electric and magnetic fields. These two properties are rarely found together, and their combined...

Im Focus: Fast, convenient & standardized: New lab innovation for automated tissue engineering & drug

MBM ScienceBridge GmbH successfully negotiated a license agreement between University Medical Center Göttingen (UMG) and the biotech company Tissue Systems Holding GmbH about commercial use of a multi-well tissue plate for automated and reliable tissue engineering & drug testing.

MBM ScienceBridge GmbH successfully negotiated a license agreement between University Medical Center Göttingen (UMG) and the biotech company Tissue Systems...

Im Focus: Silencing bacteria

HZI researchers pave the way for new agents that render hospital pathogens mute

Pathogenic bacteria are becoming resistant to common antibiotics to an ever increasing degree. One of the most difficult germs is Pseudomonas aeruginosa, a...

Im Focus: Artificial Enzymes for Hydrogen Conversion

Scientists from the MPI for Chemical Energy Conversion report in the first issue of the new journal JOULE.

Cell Press has just released the first issue of Joule, a new journal dedicated to sustainable energy research. In this issue James Birrell, Olaf Rüdiger,...

All Focus news of the innovation-report >>>

Anzeige

Anzeige

Event News

“Lasers in Composites Symposium” in Aachen – from Science to Application

19.09.2017 | Event News

I-ESA 2018 – Call for Papers

12.09.2017 | Event News

EMBO at Basel Life, a new conference on current and emerging life science research

06.09.2017 | Event News

 
Latest News

“Lasers in Composites Symposium” in Aachen – from Science to Application

19.09.2017 | Event News

New quantum phenomena in graphene superlattices

19.09.2017 | Physics and Astronomy

A simple additive to improve film quality

19.09.2017 | Power and Electrical Engineering

VideoLinks
B2B-VideoLinks
More VideoLinks >>>