L-systems are a special class of parallel grammars that can model the growth of living organisms, e. They produce sentences that can be interpreted graphically to produce images of fractals or organisms. The essential difference between sequential grammars such as BNF and L-systems is that in sequential grammars the production rules are applied sequentially, one at a time; whereas in L-systems they are applied in parallel and may simultaneously replace all letters in a given word.
All symbols that appear in the grammar are valid in the final string, and any symbol in the alphabet can head a rule [Prusinkiewicz and Hanan, ].
The recursive nature of the L-system rules leads to self-similarity and fractal-like forms which is also a property of DNA sequences [Abramson et al. A rule consists of two strings - the predecessor an individual symbol from V and the successor L -length string composed from V elements.
Each production rule refers only to an individual symbol and not to its neighbors. Machine Learning Using Support Vector Machine Support Vector Machines SVM [Vapnik, ] are one of the most popular tools in bioinformatics for supervised classification of genetic data biosequences, protein structure data, microarray gene expression, etc.
SVM is a structural risk minimization-based method for creating binary classification functions from a set of labeled training data.
SVM requires that each data instance is represented as a vector of real numbers in feature space. Hence, if there are categorical attributes, we first have to convert them into numeric data.
First, SVM implicitly maps the training data into a usually higher-dimensional feature space. A hyperplane decision surface is then constructed in this feature space that bisects the two categories and maximizes the margin of separation between itself and those points lying nearest to it the support vectors.
This decision surface can then be used as a basis for classifying vectors of unknown classification. Here we have a binary classification problem in which the outcomes are labeled either as positive P or negative N class. Here we use a simple Accuracy ACC metric, which is a measure of how well a binary classification test correctly identifies or excludes a condition.
Grammar Rule Inference Derivation of formal grammar rules, also known as grammatical induction or grammar inference, refers to the process of inducing a formal grammar usually in the form of production rules from a set of observations using machine learning techniques.
The result of grammar inference is a model that reflects the characteristics of the observed objects. The method suggests successively guessing grammar rules productions and testing them against positive and negative observations.
The best ruleset is then mutated randomly following the ideas of genetic programming until no improvement in accuracy is found within a certain number of iterations.
The rules of the L-system grammar are applied iteratively and simultaneously starting from the initial string. We describe the grammar inference problem as a classical optimization problem as follows. The test file contains examples promoters, introns, and coding sequences , each bp length. The promoters were extracted from the Eukaryotic Promoter Database rel.
The positive set consists of promoter sequences and the negative set consists of introns and coding sequences. Description of L-grammar rule generation The L-grammar rule generation process is performed in two stages as follows: 1 Derivation of a learned classifier. The promoter dataset is used for training a SVM classifier. For training, the linear kernel function is used. The result is a model of a learned classifier that can separate promoter and non-promoter sequences.
The classifier model is used for evaluating the fitness of the generation of L- grammar rules. Rules are generated by L-grammar rule generator using a directed random search method: the best ruleset so far is mutated randomly and used to generate of bp length L-system strings.
The mutation parameters are the start string, successor strings and production rule probabilities. The number of rules as well as the predecessor strings is fixed. There are four rules for each type of nucleotide: A-rule, C-rule, G-rule and T-rule. The generated strings are encoded as 4-dimensional orthogonal binary vectors, fed to the trained SVM classifier and the accuracy of the classification is obtained. If the accuracy of the mutated ruleset is better than the accuracy of the previous best ruleset, the mutated ruleset is saved as the best ruleset; otherwise the previous best ruleset is retained.
The process is continued until the required accuracy is achieved. The structure of the L-grammar induction system is summarized in Fig. Structure of the L-grammar induction system Results The classification was done in two stages: 1 SVM was trained using promoter vs.
The classification results using the Accuracy metric see Eq. We can see that artificially generated sequences can be classified vs. That suggests that induced L-grammar rules indeed capture some essential dataset properties of promoter sequences. The results are worse for the vertebrate dataset, because vertebrate promoters have more complex patterns with more irregularities and exceptions. The derived stochastic L-grammar rules are presented in Fig.
Note that in drosophila grammar rules C and in vertebrate grammar rules T symbols are completely missing from the right successor side of production rules. Classification results Classified sequences No. Stochastic L-grammar rules for generating a drosophila, and b vertebrate promoter-like sequences Evaluation After analyzing the induced promoter sequence production rules, we can conclude that these rules can fairly good characterize the subsequences that are typical for promoters.
Other subsequences typical for the promoter sequences are produced by the successive application of the production rules, e. The vertebrate promoters are typically more complex than the drosophila promoters, because there are many TATA-less promoters, which are characterized by other more complex subsequences.
These subsequences also can be produced from the induced grammar rules, e. Conclusion The advantage of the machine learning method combined with formal grammar is that additionally to recognition which does not provide any structural information, the modular structure of the analyzed genetic sequences can be identified.
The classification results of the artificial promoter sequences generated using the derived L-grammar rules are almost as accurate as that of the natural promoters, thus showing a great deal of similarity between the discriminating features of both types of sequences. Bibliography [Abramson et al. Biosystems 49 1 : [Denise et al. Drosophila promoter dataset. Grammars 6 1 : [Gheorghe and Mitrana, ] Gheorghe, M.
Comparative and Functional Genomics 5: 91—94 [Grate et al. Human promoter dataset. Remember me on this computer. Enter the email address you signed up with and we'll email you a reset link.
Need an account? Click here to sign up. Download Free PDF. Rogerio Reis. A short summary of this paper. CGM: A context-free grammar manipulator. CGM: A context-free grammar manipulator? We present an interactive graphical environment for the ma- nipulation of context-free languages. The graphical environment allows the editing of a context-free grammar, the conversion to Chomsky normal form, word parsing and the interactive construction of parse trees. It is possible to store positive and negative datasets of words to be tested by a given grammar.
The main goal of this system is to provide a pedagogical tool for studying grammar construction and parse trees. But for experimenting, studying and teaching their formal and computational models it is useful to have tools for manipulating them as first-class objects. Automata the- ory and formal languages courses are mathematical in essence, and traditionally are taught without computers. Well known advantages of the use of computers in education are: interactive manipulation, concepts visualisation and feedback to the students.
In this paper, we describe an interactive graphical environment, implemented in Python [vR05], for manipulation of context-free languages.
The use of Python, a high-level object-oriented language with high-level data types and dynamic typing, allows us to have a system which is modular, extensible, clear, easy to implement, and portable.
The Python interface to the Wxwidgets graphical toolkit [SRZD] gives a good platform to build a graphical environment. This work is part of the FAdo project which aims the development of interactive environments for symbolic manipulation of formal languages [MR,MR05], that includes already a toolkit for the manipulation of finite automata and regular expressions [MR05], a Turing machine simulator and an LR parser generator, Yappy [RM03].
The context-free grammar manipulator now presented allows the editing of a context-free grammar, the conversion to Chomsky normal form, word parsing? It is possible to store positive and negative datasets of words to be tested by the grammar, respectively.
Gram- mar ambiguity will the detected if an inputed word has several parse trees, and all those parse trees will be shown. The paper is organized as follows. In the next section we review the basic concepts of context-free languages and context-free grammars. In Section 3 we describe the parsing algorithm that was implemented and discuss how it can be used to allow the interactive construction of parse trees.
Section 4 describes the CGM interactive graphical environment. Section 5 concludes with a short discussion of related work and some future work. To each word derivation corresponds a parse tree. Applications of grammars for programming languages development, namely compilers, normally require that grammars are non-ambiguous otherwise several semantics can be given to a word.
Efficient linear parsing algorithms are well-known for sub-classes of context-free languages as the ones generated by LR-grammars [ASU86]. Here we wanted a parser for a generic context-free grammar that would allow the interactive construction of parse trees. There are two main approaches to parse a word: top-down building the parse tree from the start symbol down to the yield the word bottom-up building the parse tree from the leaves the word up to the start symbol We choose a bottom-up parsing method: the CYK algorithm due to J.
Cocke, D. H Younger and T. Kasami [HMU00]. All these steps can be automatically implemented. Given a CNF grammar G and an input word x, the CYK algorithm constructs a table that indicates which non-terminals derive which subwords of the input word. This is the recognition phase. In a second phase, the table and the grammar are used to construct all possible parse trees derivations.
But it is possible to recover the removed non-terminals and add that information when building the CYK recognition table. In our implementation we build a list of tuples that relates the original non-terminals with the ones that appear in the CNF grammar. When building a parse tree, each occurrence of a new symbol in the recogni- tion table is substituted by the respective original symbol.
To verify if a word belongs to the language, the program checks if the start symbol occurs N [1, n]. If so it begins to build the possible parse trees as tuples of trees. If not, the program will try to find trees for the subwords, that in the worst case, will be the trees for each symbol of the word. In Figure 2 we present the general view of the CGM application, with the main five working areas identified. The CGM interactive graphical environment.
In the following we will describe each of the components and some imple- mentation details. The grammar info panel provides information about a compiled grammar: the start symbol, the set of terminals and the set of non-terminals. Each grammar has also associated special characters that indicates the empty word and the concatenation symbol that allows the separation of grammar symbols.
Digit Digit Int. Frac Note that the concatenation symbol will be also used for the input words. For instance 1. By default the empty word will be represented by e, but both of the special symbols concatenation and empty word are configurable. If a rule is not well-formed the editor will detect and a red bullet will appear in the correspondent line.
There is no need to introduce the start symbol and the sets of terminals and of non-terminals. Whenever the grammar is compiled those symbols are inferred by the set of rules. By default, the start symbol corresponds to the symbol in the left-hand side of the first rule. The line containing the rule with the start symbol will be marked with a blue arrow. If option auto-compile is enabled, the grammar will be automatically com- piled when the editor loses focus. If a grammar is suc- cessfully compiled, some information appears in the grammar info panel and words can now be inputed for parsing.
0コメント