
Ebook: Biology, Computation and Linguistics

Over time, the root discipline of philosophy separated into many disciplines and sub-disciplines, each of which has developed its own specific methods. Whilst cross-disciplinary interaction between the three vertices of biology, computing and language processing has often occurred quite naturally in the past, these interactions were mostly two-way, because combining more than two disciplines presents significant challenges. But for some disciplines, reaching out to others is no longer a luxury, but a necessity, and an inverse process of integration is now required. Some disciplines have become unnaturally disconnected from others or from the whole, with the result that a broad view is sometimes lost, and parallels which could be exploited to great advantage cannot even be seen. This book presents a series of essays on biology, computation and linguistics. It seeks to make their connectedness more apparent so that these single disciplines, which relate naturally, but which have drifted far apart, can fruitfully reconnect from their present degrees of specialization. Topics covered include: finding isomorphisms between genetic code and verbal language; viewing childhood dialects as modeling computers; the use of a computational formalism - concept formation -for mining both linguistic and biological texts; lower bounds for asymmetrical insertion-deletion languages; a computational model for linguistic complexity; enumerated speculation on possible languages, as well as an exploration of cognitive architectures for multi agent systems from a computational point of view. The book aims to promote further interconnectedness between the humanistic and the formal sciences, for a less dichotomized, more integrated world.
Exciting times are ahead of us. Instantaneous communication made possible by contemporary technology is propitiating more cross-disciplinary interactions than ever, as well as accelerating those within each field. And it was high time, because it is becoming clearer that some disciplines simply need to join forces with others.
Biology, one of our disciplines in focus, is a case in point: an unprecedented volume growth of biological data over the past few years (most notably, human language text produced in the form of articles, books, web sites, etc., and genetic code text in nucleic acid language, such as DNA sequences) has created formidable challenges for their timely and interrelated processing. Biologists' traditional methods for processing and making sense of such information can no longer keep up with the exponentially growing information tsunami. Artificial Intelligence methods are coming to the rescue, in particular those for knowledge retrieval and analysis, computational linguistics and natural language processing. In the process, AI methods are specializing into biologically oriented methods, and influencing computing sciences methods in turn, in a full circle dance.
Historically, cross-disciplinary interaction between our three vertices (Biology, Computing and Language Processing) has been a naturally occurring phenomenon in many cases: the Prolog language evolved from a computational linguistics formalism (Alain Colmerauer's Q-systems); the memoing technique underneath tableaux-oriented computing platforms is intimately related to chart parsing in language processing; logic grammars were extensively used around the world to help find the human genome; statistical computing is permeating computational linguistics… However these interactions are mostly two-way ones, partly because combining two disciplines is challenging enough already: one must acquire at least a familiarity with the other discipline's jargon in order to communicate unambiguously and effectively with collaborators in that discipline; the usual methods in each discipline and the ways results are evaluated might differ and even the conventions by which publications are viewed may differ greatly, e.g. in some disciplines a publication's main author is typically listed first, in others, last.
The fact remains that for some disciplines, reaching out to others is no longer a luxury but a necessity. It is our belief that once two-way interactions develop enough to constitute a solid basis for further cross-fertilizations, some now incipient multiple-discipline endeavors will become more prevalent. For instance, as soon as knowledge extraction from text – now largely dependent on keyword and corpus-or domain-oriented restrictions – matures enough to allow for arbitrary text to be reliably transformed into usably coded knowledge that can be later consulted, the field of web text mining will be able to interact with many fields that desperately need such abilities. This will include those that already draw from two different disciplines, e.g. computational life sciences.
Just as our root discipline – Philosophy – needed to slowly separate into a myriad of disciplines and sub-disciplines in order to develop methods specific to each, to achieve depth, etc., we believe we are now at a time in which an inverse process of integration needs to happen: in specializing, some disciplines have become unnaturally disconnected from others or from the whole, with the result that the broad view of the forest is sometimes lost, and that parallels that could be exploited cannot even be seen. A reconnection from the more mature present standpoints of these different branches seems in order and in any case, is simply happening.
It is with this integration in mind that we introduce in the present volume a series of essays on biology, computation and linguistics, as a first effort to promote a level of granularity among the three where their connectedness becomes more apparent, and from which these single disciplines that relate naturally but have become far apart can fruitfully reconnect from their present degrees of specialization.
Thus, incidences of biology upon computing sciences are examined from the point of view of using the structure of DNA molecules as computational tools, and of biologically-inspired computational models. Computational tools such as tailored parsing methodologies and web querying, probabilistic extended regular expressions and matrix insertion-deletion systems are applied to health sciences/biological tasks such as de-identifying medical records, modeling repeats in DNA, modeling intermolecular structures and defining ambiguity in gene sequences.
Biology and linguistics are joined from the point of view of finding isomorphisms between genetic code and verbal language, and of how childhood dialects can be seen as modeling computers. Three-way interdisciplinary work includes the use of a computational formalism- concept formation- for mining both linguistic and biological texts, and a three-way analysis of dependency crossing, from the points of view of language, biology and satisfiability.
The influences of computing science upon language are represented by work on lower bounds for asymmetrical insertion-deletion languages, on a computational model for linguistic complexity, on an enumerative speculation on possible languages, and on computational realizations of vowel-consonant speech segmentation by neuromorphic units. Finally, the language of acyclic recursions is applied to linguistic descriptions, and cognitive architectures for multi-agent systems are explored from a computational point of view.
With this work we hope to stimulate further research on these emerging themes and their three-way relationships, and to propitiate further interconnectedness in general between the humanistic and the formal sciences, for a less dichotomized, more integrated world.
Tarragona, March 2011
Gemma Bel-Enguix, Veronica Dahl, M. Dolores Jiménez-López
DNA molecules have several structural features that allow DNA to be used as tools for solving computational problems. Some of these same features of DNA molecules allow them to generate, in the presence of appropriate enzymes, potentially infinite ‘languages’ over the alphabet of paired nucleotides A/T, C/G, G/C and T/A. Fundamental features of DNA are described in detail. Language generation, via ‘splicing systems’, is then described at some length. Then, in three short Sections, computations of solutions of toy instances of classical combinatorial problems are indicated with references given for detailed study. We close with an invitation to investigate the exciting new work in DNA constructions and computations through the web pages of several researchers. These notes were prepared to accompany tutorial talks given at the 1st Intern. Work Confer. on Linguistics, Biology and Computer Science: Interplays in Tarragona, Spain March 14–19, 2011.
In this paper we present a brief overview on models, results and research directions in membrane computing, and discuss variants and properties of networks of evolutionary processors, a framework and research area in bio-inspired computing. We also propose some new directions and topics for future research.
Hidden Markov Models, or HMMs, are a family of probabilistic models used for describing and analyzing sequential phenomena such as written and spoken text, biological sequences and sensor data from monitoring of hospital patients and industrial plants. An inherent characteristic of all HMM subspecies is their control by some sort of probabilistic, finite state machine, but which may differ in the detailed structure and specific sorts of conditional probabilities. In the literature, however, the different HMM subspecies tend to be described as separate kingdoms with their entrails and inference methods defined from scratch in each particular case. Here we suggest a unified characterization using a generic, probabilistic-logic framework and generic inference methods, which also promote experiments with new hybrids and mutations. This may even involve context dependencies that traditionally are considered beyond reach of HMMs.
De-identification is the process of automatic removal of all Private Health Information (PHI) from medical records. The main focus in this active and important research area is on semi-structured records. This narrow focus has allowed the development of standard criteria that formally determines the boundaries of privacy and can be used for evaluations. However, medical records include, as well as semi-structured data from filling in forms, etc., free text in which identifiers are more difficult to detect. In this article we address the problem of de-identification within unstructured medical records. We show how through the following methods we are able to recognize, in some cases, identifiers that currently go undetected: (1) Parsing free-form medical text into typed logical relationships including assumptions for candidate identifiers. (2) A novel use of the state-of-the-art engines for processing English queries to the web. We present as well a formal definition of our approach within a rigorous logical system that supports the implementation of our ideas.
Regular expressions is a familiar and widely used formalism which is integrated in many modern programming languages. Contemporary versions of regular expressions are typically extended variants whose expressive power goes beyond regular languages. Extended regular expressions are inherently non-deterministic and require procedural control such as backtracking. We propose a probabilistic version of extended regular expressions, where the affinity for strings and matches can be learned from examples. The procedural control semantics are replaced by a probabilistic semantics, where the possible matches are ranked by their probability and the most probable match is the one returned. In the present paper, we show how probabilistic extended regular expressions can be used to model repeats in DNA. To deal with cases where the expressive power of probabilistic extended regular expressions is insufficient, we extend the syntax to integrate external functions, which may be deterministic or probabilistic.
Gene insertion and deletion are considered as the basic operations in DNA processing and RNA editing. Based on these evolutionary transformations, a computing model has been formulated in formal language theory known as insertion-deletion systems. Recently, in [6], a new computing model named Matrix insertion-deletion system has been introduced to model various bio-molecular structures such as hairpin, stem and loop, pseudoknot, attenuator, cloverleaf, dumbbell that occur at intramolecular level. In this paper, we model some of the intermolecular structures such as double strand languages, nick languages, hybrid molecules (with R-loops), holliday structure, replication fork and linear hybridization (ligated) languages using Matrix insertion-deletion system. In [2], the ambiguity in gene sequence was defined as deriving more than one structure for a single gene sequence. Here, we propose a different view of understanding the ambiguity in gene sequences: A gene sequence is obtained by more than one way such that their intermediate sequences are different. We further classify the ambiguity into many levels based on the components axiom, string (order of deletion/insertion) and contexts (order of the used contexts). We notice that some of the inter and intramolecular structures obey the newly defined ambiguity levels.
The aim of this paper is analyzing the similarities between the genetic code and verbal language. First, some syntactical analogies between both “systems of communication” are exposed. Secondly, we discuss the possibility of finding similar semantic phenomena in both systems. Moreover, we provide an overview of several theories appeared in the last years that make a remarkable contribution to the topic, with some new data and suggestions to establish a fruitful linguistic parallelism. Finally, some conclusions and discussion are presented.
This paper focuses on a subfield of machine learning, the so-called grammatical inference. A a new framework of grammatical inference motivated by the process of acquiring natural language is suggested. The final goal of this research is to improve techniques in machine learning by using as a model natural language acquisition. We claim that the simulation of the process of acquiring a natural language could improve grammatical inference techniques and this improvement could have important implications in the field of human language technologies.
We present, discuss and exemplify a fully implemented specialization of the Concept Formation Cognitive model into a model of text mining that can be applied to spoken languages as well as to molecular biology languages.
We discuss two dichotomic views on crossing dependencies. This kind of phenomena is relevant both for natural language and biology as has been extensively argued. On one hand, we discuss an approach that looks for formalisms more powerful than context free grammars to capture the phenomena as a word recognition problem. On the other hand, we present an approach that deals with the same kind of problem, using finite state technology, but instead of treating the problem as a word recognition problem, it is solved using an automata construction approach. Unrestricted crossing dependencies are exemplified using the well known problem of propositional satisfiability.
This article investigates languages generated by insertion-deletion systems of small size. Four classes of asymmetrical insertion-deletion systems are considered; two of them are known to be computationally uncomplete. We show a series of proper inclusions that permit to determine some non-trivial lower bounds for the corresponding languages.
This paper presents an approach contributing to the evaluation of linguistic complexity. Many different studies address this question. However, it remains very difficult to propose a precise evaluation on the basis of quantifiable criteria. The model we propose answers to this need thanks to an original representation of linguistic information by means of constraints. It brings together different complexity factors, coming from linguistic, psycholinguistic and computational domains.
We adopt a description of natural languages in terms of strings of crosslinguistically variable syntactic features (parameters), complying with a specified hypothesis of “universal grammar”, and we deal with two problems: first, assessing the statistical significance of language distances calculated on the basis of such features and recently used to reconstruct phylogenetic trees; second, counting the minimal overall number of possible human languages, i.e. of strings satisfying the implicational rules which describe dependencies between parameters of the specified universal grammar. In order to accomplish these tasks, we had to develop a sampling algorithm capable of dealing correctly with such rules. The potential significance of these results for historical and theoretical linguistics is then briefly highlighted.
Speech perception is still a much complex process far from being fully understood. To gain some insight on specific open problems in its automatic treatment (recognition, synthesis, diarization, segmentation, etc.) neuromorphisms and knowledge derived from the understanding on how the Auditory System proceeds may be of crucial importance. The present paper is part of a series of preliminary work carried out trying to translate some of this understanding to solve specific tasks as speech segmentation and labeling in a parallel way to the neural resources found in the Auditory Pathways and Cortex. The bio-inspired (neuromorphic) design of some elementary units (responsible for receptive field implementation) covering simple tasks as static and dynamic formant tracking, or vowel labeling is exposed. In a further step it is shown how simply neural circuits employing these units may convey successful vowel-consonant separation independently of the speaker. The paper is illustrated with results from simulations.
The formal language Lλar of acyclic recursion and its type-theory was introduced by Moschovakis between 2003 and 2006 for applications to computational semantics of human languages. Independently, implementations of large-scale grammars within the HPSG framework have been using semantic representations casted in the language of Minimal Recursion Semantics (MRS). While lacking strict formalization, MRS represents successfully ambiguous quantifier scoping. In this paper, we introduce the basic definitions of MRS by reflecting on possibilities for formalization of MRS with a version of the language Lλar.
The paper introduces work on computational syntax-semantics interface in lexicon for verbal inflection. It uses syntactic theory in the approach of Constraint-Based Lexicalist Grammar (CBLG). Semantic representations are renderings by Moschovakis' formal language of acyclic recursion.
We propose the basic computational framework, which can be used for modeling of hierarchical cognitive architectures. The system is build from at least two layers. The first is used as the semantic network for storing semiotic triads, i.e. objects, categories and symbols, which represent them. The network is constructed from an ensemble of machine learning, or clustering methods, each trained on a set of prototypes for a given category. Those semiotic triads are linked with weights representing the cognitive similarity between represented symbols. The number of symbols that can be stored in such network scales linearly with the number of stored datasets of prototypical objects, as it is assumed in symbol grounding paradigm. The second layer is the communication, or interaction network, which allow agents to communicate with each other, namely to speak about symbols and/or underlying objects (categories). Agent can pass an object (perceptual item), and its name to other agent, in order to modify the categorical system of the second agent, or to communicate the symbolic name for a given category. Both cognitive layers are simulated within Agent-Based Modeling (ABM) approach. Therefore, typical cognitive task can be implemented here as the dynamical process of equilibration for a set of semiotic networks in the population of interacting agents with language.