Version: 1.00 Date: 29/1/05
Authors: Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman


The ADIOS project addresses the problem, fundamental to linguistics, bioinformatics and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, nucleotide base pairs, amino acid sequence data, musical notation, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (Automatic DIstillation of Structure) algorithm relies on a statistical method for pattern extraction (The MEX algorithm) and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, on coding regions in DNA sequences, and on protein data correlating sequence with function. This is the first time an unsupervised algorithm is shown capable of learning complex syntax, generating grammatical novel sentences, scoring well in standard language proficiency tests, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.

For further details see Zach Solan's thesis



Many types of sequential symbolic data possess structure that is (i) hierarchical, and (ii) context-sensitive. Natural-language text or transcribed speech are prime examples of such data: a corpus of language consists of sentences, defined over a finite lexicon of symbols such as words. Linguists traditionally analyze the sentences into recursively structured phrasal constituents (1); at the same time, a distributional analysis of partially aligned sentential contexts (2) reveals in the lexicon clusters that are said to correspond to various syntactic categories (such as nouns or verbs). Such structure, however, is not limited to the nat-ural languages: recurring motifs are found, on a level of description that is common to all life on earth, in the base sequences of DNA that constitute the genome. We introduce a novel unsupervised algorithm that discovers hierarchical structure in any sequence data, on the basis of the minimal assumption that the corpus at hand contains partially overlapping strings at multiple levels of organization. In the linguistic domain, our algorithm has been successfully tested both on artificial-grammar output and on natural-language corpora such as ATIS (3), CHILDES (4), and the Bible (5). In bioinformatics, the algorithm has been shown to extract from protein sequences syntactic structures that are highly correlated with the functional properties of these proteins.



Contact us

Back to ADIOS