Home
Letters to Editor
Domestic
World
Business & Trade
Culture & Science
Travel
Society
Government
Opinions
Policy Making in Depth
People
Investment
Life
Books/Reviews
News of This Week
Learning Chinese
Novel Way to Identify Author of Text

That notable quotable is instantly recognizable among people as a line from Shakespeare's Hamlet. But soon, even computers may be able to automatically identify strings of text with their appropriate authors -- and by using a free computer program already available on the Internet.

According to a report to be published in the Physical Review Letters magazine, researchers at La Sapienza University in Rome have found that a computer file compression program called Gzip provided an unusual means of analyzing strings of data.

Typically, computer compression programs such as Gzip shrink large computer files -- text files, for instance -- by searching for repetitive strings of information. By finding and identifying those patterns, the compression program can reduce the original file to a smaller one that contains just the basic "building blocks" of data and instructions on how to use those blocks to recreate the original, larger file.

But Emanuele Caglioti, an associate professor of mathematics at the university and one of the report's authors says that the program's compression process is also the key that helps identify files of unknown data.

When a program such as Gzip shrinks or "zips" a file, "it is learning something about the file," says Caglioti. Specifically, it is learning the file's so-called entropy, or the minimum number of bits needed to encode the file. Files of similar content would share similar entropies since they share the same common "building blocks."

"If you zip a file -- say one composed of English text -- while [the Gzip program] is reading the file, it's learning the statistics of English," says Caglioti. "The more it reads it, the more it can compress it." And adding additional English files wouldn't produce a great change in the file's size since the basic pattern -- its entropy -- is already known.

But, if the second file turns out to be Italian, Caglioti says the process has to start all over again and a new entropy is created. "It has to learn [the] Italian," says Caglioti. And "This effort has a cost in terms of bits. It takes more space to incorporate the Italian file because it's a different language."

And Caglioti and his team of researchers discovered that this same process and principle can be used to "identify' works by author. In their research, the Italian scientists collected 90 texts by 11 Italian authors and in 93 percent of the cases; the method correctly matched small text samples with the authors.

"It's pretty clever what they did," said James Riordon, a physicist with the American Institute of Physics, the group that publishes the Physical Review Letters. "Effectively, it's like you're training someone in a language to identify it."

And Caglioti say that there's no reason to believe that the compression process couldn't be used in other means. "Aside from text recognition, it can be used to compare Web pages and find ones that are similar," he says. In addition to creating a better Web search engine, Caglioti notes, "there is the challenge of biological DNA sequencing." He said genetic researchers have already reported in Bioinformatics of using similar zipper approaches to map the human genome.

Mark Adler, the programmer who co-created Gzip in early 1990 as an alternative to other file compression programs, said he was surprised someone had used his program in such a manner. "It is impressive and a little surprising that simply comparing the length of the compressed output from concatenated known and unknown text provides such high accuracy," he says.

But he remains skeptical that the Italians' research paves the way to foolproof text identifiers -- at least until more studies are done.

"At some point using entropy as a measure may not be fine enough to distinguish between authors with similar styles or use of words and phrases," he says. "I'd wonder how well it would work for author recognition if you tried to distinguish between a thousand authors instead of a dozen."

"Up to now, this is more theoretical than practical," Caglioti conceeds. But he says he and his team will continue to work with the program and see what else turns up. "We ought to try and see where it can work."

(China Daily January 31, 2002)

Copyright ? China Internet Information Center. All Rights Reserved
E-mail: webmaster@china.org.cn Tel: 86-10-68996214/15/16
主站蜘蛛池模板: 国产精品免费拍拍1000部| 欧洲精品99毛片免费高清观看 | 国产小视频免费在线观看| mp1pud麻豆媒体| 日韩国产欧美精品在线| 亚洲综合精品伊人久久| 请与我同眠未删减未遮挡小说| 在线中文字幕有码中文| 久久99精品久久久久久| 1024手机基地在线看手机| 成年大片免费视频| 亚洲人成人一区二区三区| 精品久久久久久中文| 国产大片b站免费观看推荐| 97精品伊人久久久大香线蕉| 扒开粉嫩的小缝开始亲吻男女 | 放荡的女老板bd中文在线观看| 亚洲小说区图片区另类春色| 精品欧美一区二区精品久久| 国产欧美va欧美va香蕉在线 | 免费五级在线观看日本片| 韩国三级女电影完整版| 国产综合亚洲欧美日韩一区二区| 中文字幕乱码人妻综合二区三区 | 娇喘午夜啪啪五分钟娇喘| 妖精色AV无码国产在线看| 久久男人av资源网站| 欧美蜜桃臀在线观看一区| 又黄又爽又色又刺激的视频| 很黄很黄的网站免费的| 处破痛哭A√18成年片免费| 中文字幕永久更新| 未满十八18禁止免费无码网站 | 91在线亚洲综合在线| 李丽珍蜜桃成熟时电影3在线观看| 免费无码又爽又刺激网站| 青草草在线视频永久免费| 国产精品成熟老女人视频| jux-222椎名由奈在线观看| 日产国产欧美视频一区精品| 免费**的网址|