Forum for Science, Industry and Business

Sponsored by:     3M 
Search our Site:

 

Algorithm makes tongue tree

22.01.2002


Compression helps a computer tell Dante from Machiavelli


New computer programme could settle literary debates.

To date, unlike us, computers have struggled to differentiate a page of Jane Austen from one by Jackie Collins. Now researchers in Italy have developed a program that can spot enough subtle differences between two authors’ works to attribute authorship1.

The program can tell a text by Machiavelli from one by Pirandello, Dante or a host of other great Italian writers. It constructed a language tree of the degree of affinity between 50 different tongues. The tree identifies all the main linguistic groups, such as Romance, Celtic, Slavic and so forth and highlights Maltese (an Afro-Asiatic language) and Basque as anomalies.



As well as settling a few literary arguments, the technique might be useful for comparing other information-rich sequences of data. These might include genetic sequences, medical-monitoring measurements and stock-market fluctuations.

Style over substance

Identifying the language of a particular text is generally not hard in itself: one need simply look for the greatest overlap between the words used and those in a reference list for each language. Classifying linguistic styles is altogether more tricky.

One obvious approach is to compare the range and frequency of words in the sample text against reference texts from various candidate authors. That might work for markedly different styles: it would quickly distinguish Shakespeare from Tom Clancy.

But literary scholars often argue furiously about attributions for old texts. The task can become immensely difficult even for those with a great deal of knowledge about the candidate authors’ writing styles.

Clash of symbols

So Dario Benedetto and colleagues at the Universita ’La Sapienza’ in Rome try a different approach. They start from the premise that written language is in the end no more than a string of symbols. It might look rather random, but it is not.

Some groups of characters recur commonly (such as ’the’ in English), and particular authors favour certain constructions and turns of phrase. These can be measured, rather than being reliant on subjective impressions or anecdotal comparisons.

The team begin from the classic insight of telecommunications engineer Claude Shannon in the 1940s that the information content of a message is related to its entropy. Roughly speaking, entropy is a measure of how much redundancy a message contains. It can be defined as the smallest program that will produce the original message as the output.

For a random string of characters, this program would simply specify every character - it would be the same size as the original message. For a string of just A’s, the program could be very concise: ’repeat A’. Most real messages lie somewhere in-between: they can usually be compressed a little without losing significant information. This is the basis of data-compression computer algorithms, used to make ’zip’ files, for instance.

Benedetto and his colleagues borrow the principles of data-compression algorithms to calculate a kind of relative entropy for two different character strings: a measure of how much they differ. This distance between two texts is smaller for two works by the same author than for two works by different authors.

References

  1. Benedetto, D., Caglioti, E. & Loreto, V. Language trees and zipping. Physical Review Letters, 88, 048702, (2002).


PHILIP BALL | © Nature News Service

More articles from Information Technology:

nachricht Marine Skin dives deeper for better monitoring
23.04.2019 | King Abdullah University of Science & Technology (KAUST)

nachricht CubeSats prove their worth for scientific missions
17.04.2019 | American Physical Society

All articles from Information Technology >>>

The most recent press releases about innovation >>>

Die letzten 5 Focus-News des innovations-reports im Überblick:

Im Focus: Quantum gas turns supersolid

Researchers led by Francesca Ferlaino from the University of Innsbruck and the Austrian Academy of Sciences report in Physical Review X on the observation of supersolid behavior in dipolar quantum gases of erbium and dysprosium. In the dysprosium gas these properties are unprecedentedly long-lived. This sets the stage for future investigations into the nature of this exotic phase of matter.

Supersolidity is a paradoxical state where the matter is both crystallized and superfluid. Predicted 50 years ago, such a counter-intuitive phase, featuring...

Im Focus: Explosion on Jupiter-sized star 10 times more powerful than ever seen on our sun

A stellar flare 10 times more powerful than anything seen on our sun has burst from an ultracool star almost the same size as Jupiter

  • Coolest and smallest star to produce a superflare found
  • Star is a tenth of the radius of our Sun
  • Researchers led by University of Warwick could only see...

Im Focus: Quantum simulation more stable than expected

A localization phenomenon boosts the accuracy of solving quantum many-body problems with quantum computers which are otherwise challenging for conventional computers. This brings such digital quantum simulation within reach on quantum devices available today.

Quantum computers promise to solve certain computational problems exponentially faster than any classical machine. “A particularly promising application is the...

Im Focus: Largest, fastest array of microscopic 'traffic cops' for optical communications

The technology could revolutionize how information travels through data centers and artificial intelligence networks

Engineers at the University of California, Berkeley have built a new photonic switch that can control the direction of light passing through optical fibers...

Im Focus: A long-distance relationship in femtoseconds

Physicists observe how electron-hole pairs drift apart at ultrafast speed, but still remain strongly bound.

Modern electronics relies on ultrafast charge motion on ever shorter length scales. Physicists from Regensburg and Gothenburg have now succeeded in resolving a...

All Focus news of the innovation-report >>>

Anzeige

Anzeige

VideoLinks
Industry & Economy
Event News

Revered mathematicians and computer scientists converge with 200 young researchers in Heidelberg!

17.04.2019 | Event News

First dust conference in the Central Asian part of the earth’s dust belt

15.04.2019 | Event News

Fraunhofer FHR at the IEEE Radar Conference 2019 in Boston, USA

09.04.2019 | Event News

 
Latest News

Marine Skin dives deeper for better monitoring

23.04.2019 | Information Technology

Geomagnetic jerks finally reproduced and explained

23.04.2019 | Earth Sciences

Overlooked molecular machine in cell nucleus may hold key to treating aggressive leukemia

23.04.2019 | Life Sciences

VideoLinks
Science & Research
Overview of more VideoLinks >>>