Regular Language Manipulator

(RLM)

1.0

 

 

 

Users’ Manual

(Document version 1.0)

 

 

Copyright (c) 2001 by David Theese


Acknowledgments

 

Sincere thanks are owed to Dr. Goran Konjevod of Arizona State University.  Dr. Konjevod contributed to this project in several ways.  His most important contribution was having taught me about regular languages in such a way that the subject was interesting enough to inspire me to take on this project.  In addition, Dr. Konjevod made significant contributions to the debugging process.  Finally, he served as an excellent sounding board when I needed to discuss a new idea or when problems arose.

 

Many thanks are also owed to Raul Aguilar of IPS-Sendero in Scottsdale, AZ.  Raul provided review of this document, made many suggestions on ways to improve the product and also served as an excellent sounding board.


 Introduction

 

The main goal of the Regular Language Manipulator (RLM) is to allow the user to convert freely between any of the following three equivalent representations of a regular language:

 

·         Finite Automaton

·         Regular Expression

·         Regular Grammar

 

In addition, RLM has the ability to perform several different manipulations on automata and can perform basic simplifications on a regular expression.

 

It is assumed throughout this document that the reader is familiar with regular languages, their various representations and the terminology of the subject.  Nevertheless, as individual menu options provided by RLM are explained, a brief description of the theory behind the command may be discussed.  In addition, a section is included which provides a brief review of terminology (but it IS brief and is intended only as review).  For the reader who desires a detailed treatment of the theory of regular languages (and perhaps others areas of theoretical Computer Science), the texts in [1] and [2] are suggested.

 

Many of the algorithms used by RLM are either taken verbatim from [1] or are very similar with slight optimizations made.  In addition, some of the example automata in this document are taken from this text.

 

This document includes a number of illustrations which graphically depict several different automata.  Definition files for these automata are included with your RLM distribution in the “examples” directory.  The names of these files are of the form “figure_x.nfa” where ‘x’ is the figure number.

 

An additional note about the illustrations:  the notation used in [1] has been adopted in this document and is described below.

 

·         Non-accepting states are designated by a single circle.

 

·         Accepting states are designated by a double circle.

 

·         The start state is identified by placing an arrow head (no shaft) pointing at and in contact with  the left side of the state.  In addition, the start states of the automata depicted in this document are always named “q0”.

 

Technical support for RLM may be obtained by sending email to this address:

 

regular_languages@yahoo.com

 

Bug reports and suggestions for improvement are very much appreciated and may also be sent to this email address.


Contents of Your RLM Distribution

 

If you received your RLM distribution by download or email, you should have received a compressed tar file or a zip file.  After unpacking this file, you will have duplicated the exact directory structure that exists on an RLM distribution CD.  This directory structure is shown below.  Directory names are shown in bold and a slightly larger point size.

 

bin

          sparc_sunos_57

                                rlm                                           (executable file)

          x86_dos

                                rlm.exe                                    (executable file)

          x86_linux

                                rlm                                           (executable file)

documentation

          adobe_acrobat

                                rlm.pdf

          html

                   rlm_files

                                                image001.gif

                                                image002.gif

                                                image003.gif

                                                image004.gif

                                                image005.gif

                                                image006.gif

                                                image007.gif

                                                image008.gif

                                                image009.gif

                                                image0010.gif

                                rlm.html

          ms_word_2000

                                rlm.doc

          ms_word_97_rtf

                                rlm.doc                                   (Word 97 Rich Text Format)

examples

figure_1.nfa

figure_2.nfa

figure_3.nfa

figure_4.nfa

figure_5.nfa

figure_6.nfa

figure_7.nfa

figure_8.nfa

figure_9.nfa

figure_10.nfa

sample.exp

sample.grm

 

The documents in the various subdirectories of “documentation” are electronic copies of this document and have identical content.  The documentation is provided in several formats for your convenience.


Installing / Running RLM

 

RLM is a command line-based application which consists of a single executable.  Currently, three platforms are supported:

 

·         DOS on an x86 architecture

·         Linux on an x86 architecture

·         SunOS 5.7 on a SPARC architecture

 

To get started, create a directory from which you intend to run RLM.  If you received your RLM distribution as a compressed tar file or a zip file, unpack the archive into this directory.  If you received your RLM distribution on CD, copy the entire CD contents to this directory (alternatively, you may simply run RLM directly from the distribution CD).

 

Locate the RLM executable in the directory under “bin” that corresponds to your platform.  Execute RLM as described next.

 

RLM is normally run without any command line parameters by simply typing the following at your DOS or UNIX / Linux command prompt:

 

rlm

 

However, RLM does support one command line option.  This option, “-v”, will provide you with version information for your RLM executable.  In this case, RLM is started as follows:

 

rlm -v

 

When run in this manner, RLM will display its version information and exit.  Should you need technical support, you may be requested to use this feature to determine the version of RLM you have.


A Brief Terminology Review

 

The terms “automaton” (plural: automata) and “finite automaton” are synonymous and are used interchangeably.  An automaton is categorized using the following three classifications:

 

·         Non-deterministic Finite Automata – Lambda (NFA-Lambda)

·         Non-deterministic Finite Automaton (NFA)

·         Deterministic Finite Automata (DFA)

 

An NFA-Lambda is the most general type of automaton – it can be non-deterministic, including non-determinism as a result of the presence of Lambda transitions (a review of non-determinism appears later in this document).  As such, ALL automata fall into this category.  This includes automata that are fully deterministic.  Though this may appear to be a contradiction, it is not since the fact there is no non-determinism doesn’t change the fact that there COULD be.

 

All automata that lack Lambda transitions can also be categorized as an NFA.  Again, this also includes automata that are fully deterministic for the same reason cited in the previous paragraph.

 

Finally, an automaton that is fully deterministic can also be categorized as a DFA.  Note that ONLY deterministic automata fall into this category.

 

An “alphabet” is a set of “letters”.  These letters are typically taken from the lowercase letters of the English alphabet.  As such, the definition of “alphabet” in this context is very similar to its common well-known definition.  The alphabet used in most academic examples usually consists of about 1 – 3 letters.  A very typical alphabet is {a, b}, but it certainly can be much larger, and there is no requirement that the letters be taken from the English alphabet.  This is just common practice.  It is also common to see cardinal numbers used as alphabet letters.

 

A “string” is a series of alphabet letters concatenated together.  This is much like an English word being made up of letters from the English alphabet.  Examples of strings over the alphabet {a, b} include a, b, ab, ba, aa, bb, aaa, bbb, aab, aba, etc…

 

A “language” is a set of strings over some particular alphabet.  Again, this is similar to the common definition of the word in that the English language is (partially) a set of words (it is also a set of rules for combining these words into sentences: a grammar; here, the analogy between regular languages and English starts to break down since the English grammar specifies rules for combining words (strings) into sentences whereas a regular grammar specifies rules for combining letters into strings (words).  From this perspective, the analogy holds up better if one thinks of the letters in the alphabet of a regular language representation to be analogous to words, rather than letters, in the English language).

 

Regular expressions and regular grammars can be thought of as “generators” of strings in a language and automata can be thought of as “acceptors” of strings in a language.  The set of all languages that these constructs are capable of working with is known as the “regular” languages.


Operational Overview

 

RLM allows no more than one each of a finite automaton, regular expression and regular grammar to be in memory at once.  It is important to understand that any of these regular language representations that are in memory simultaneously will always be equivalent.  So, for example, if you load an automaton and convert it to a regular expression, you will then have both an automaton and a regular expression in memory which represent the same regular language.  If you then load a different automaton, memory will first be cleared before performing the load.  If this were not done, the old regular expression and the new automaton would be in memory simultaneously even though they would not represent the same language.  This is in conflict with the M.O. RLM follows and is therefore not allowed.

 

With one exception, RLM does not provide facilities for the direct creation of finite automata, regular expressions or regular grammars by the user (the exception is generation of a random automaton).  Rather, the user creates a text file, using a syntax specified later in this document, to define one of these constructs.  Then, RLM is started and instructed to load this file.  The user may then proceed to perform operations on the regular language just loaded into memory.


The Menu Options

 

This section presents a detailed explanation for each of the individual pieces of functionality provided by RLM.  RLM is a menu-driven application with the following menu items available:

 

1.        Load a regular language

2.        Generate a random NFA-Lambda

3.        Complete the specification of an incomplete automaton

4.        Remove unreachable states from the automaton

5.        Remove non-determinism from an NFA-Lambda

6.        Minimize a DFA

7.        Reverse the language accepted by the automaton

8.        Complement the language accepted by a DFA

9.        Simplify the state names of the automaton

10.     Simplify the regular expression

11.     Convert an NFA-Lambda to a regular expression

12.     Convert the regular expression to an NFA-Lambda

13.     Convert an NFA to a regular grammar

14.     Convert the regular grammar to an NFA

15.     Display the automaton

16.     Display the characteristics of the automaton

17.     Display the regular expression

18.     Display the parse tree of the regular expression

19.     Display the regular grammar

20.     Save the automaton to disk

21.     Save the regular expression to disk

22.     Save the regular grammar to disk

23.     Quit

 

All operations that cause a regular language representation to come into existence or modify a representation already in memory display that representation immediately upon completion of the operation.

 

A note needs to be made for UNIX / Linux users with regard to the menu options for saving and loading regular language representations to / from disk.  The ‘~’ notation which is often used in the shell environment to refer to a user’s home directory may not be used when specifying the path to a file.  The underlying library call RLM makes to open the specified file will not accept a path with a ‘~’ character in it.  Note, however, that in all supported environments the “.” and “..” notations for current and parent directory, respectively, may be used.

 

Finally, a note should also be made regarding large automata, regular expressions and regular grammars.  The maximum size of these constructs is bounded only by the amount of memory in your system (except for randomly generated automata; here, there is a maximum of 1000 states and a maximum of 26 alphabet letters).  However, working with large instances of these constructs may prove impractical depending on which manipulations you wish to perform.  This is due to the running times of some of the algorithms RLM uses.  For example, the algorithm to convert an automaton to a regular expression has a running time of O(n4) where n is the number of states in the automaton.  As a result, trying to convert an automaton of, say, 100 states to a regular expression will not be possible (unless the graph of the automaton happens to be very sparse) - it will, quite simply, take a very, very long time.  This has not been a problem for academically-sized, and somewhat larger, examples the author has come across.  But, again, examples that are quite large may prove to be problematic.  It is not possible to characterize here in a general sense what a “large” regular language representation is - there are just so many variables that come into play in determining whether a given regular language representation will be practical to work with (including very non-quantifiable things such as a user’s patience level).  The best recommendation that can be made in this regard is to experiment and just try it if you want to work with large regular language representations.

 

Load a regular language

Invoke this option to load a regular language from disk.  RLM recognizes the following file extensions:

 

·         .nfa

·         .exp

·         .grm

 

These represent finite automata, regular expressions and regular grammars, respectively.

 

When prompted, enter the path to and name of the file to be loaded (the path is optional if the file to be loaded is in the same directory as the RLM executable).

 

As noted previously, memory is cleared before loading a regular language to ensure that non-equivalent regular language representations are not in memory simultaneously.

 

Generate a random NFA-Lambda

Invoke this option to randomly generate an NFA-Lambda.  You will be prompted for nine parameters which will be used in generating the automaton.  These parameters are:

 

1.        Minimum number of states.  This value must be in the range 1 - 1000, inclusive.

 

2.        Maximum number of states.  This must be greater than or equal to the previous value and no greater than 1000.

 

3.        Minimum number of alphabet letters.  This value must be in the range 1 - 26, inclusive.

 

4.        Maximum number of alphabet letters.  This must be greater than or equal to the previous value and no greater than 26.

 

5.        Minimum percentage of states to be accepting states

 

6.        Maximum percentage of states to be accepting states.  This must be greater than or equal to the previous value.

 

7.        Percentage probability of existence of the first outbound transition for a given state and letter.  If you wish to ensure a fully specified automaton, set this value to 100.

 

8.        Percentage probability of existence per subsequent outbound transition for a given state and letter.  If you wish to ensure a deterministic automaton, set both this and the following value to 0.

 

9.        Percentage probability of a Lambda transition from any given state to any other given state.  If you wish to ensure a deterministic automaton, set both this and the previous value to 0.

 

RLM will randomly generate an automaton with the following characteristics:

 

The number of states will be somewhere between the values supplied for parameters 1 and 2, inclusive.

 

The number of alphabet letters will be somewhere between the values supplied for parameters 3 and 4, inclusive.  Note that the algorithm which generates a random automaton uses lowercase letters from the English alphabet as members of the automaton’s alphabet.  As a result, no more than 26 alphabet letters are possible in a randomly generated automaton.

 

The percentage of states that are accepting states will be somewhere between the values supplied for parameters 5 and 6, inclusive.  In determining this range, the minimum number of accepting states will be rounded up (if it is not integral).  For example, if there are 9 states and the user has specified that a minimum of 20% of these states will be accepting states, this means that there will be no fewer than 1.8 accepting states.  Since it is not possible to have a non-integral number of accepting states, RLM will use a value of 2, as opposed to 1, as the minimum possible value when randomly choosing the number of accepting states.  Similarly, the maximum number of accepting states will be rounded down (if it is not integral).  For example, if there are 9 states and the user has specified that a maximum of 40% of these states will be accepting states, this means that there will be no more than 3.6 accepting states.  Since it is not possible to have a non-integral number of accepting states, RLM will use a value of 3, as opposed to 4, as the maximum possible value when randomly choosing the number of accepting states.

 

RLM will loop over every state / letter pair.  For each pair, a decision will be made as to whether that state will have an outbound transition on that latter.  This decision is made in a probabilistic manner based on the value supplied for parameter 7.  If the transition is added, the target will be randomly chosen (note that it is entirely possible for the target to be the same state).

 

For each transition that is added as described in the previous paragraph, RLM will decide in a probabilistic manner, based on the value of parameter 8, if a second outbound transition will be added for the same state / letter pair.  If the transition is to be added, its target will again be randomly selected but will be different than the first target.  This step is repeated until either the probabilistic condition for adding a transition is not met or an outbound transition has been added on that state / letter pair to all states.

 

Finally, RLM will loop over every permutation of two states (note that it’s looping over permutations, not combinations; this means order matters.  i.e. {q0, q1} is distinct from {q1, q0}).  RLM will decide in a probabilistic manner, based on the value of parameter 9, whether a Lambda transition will be added from the first state to the second state.

 

Complete the specification of an incomplete automaton

An automaton is said to be fully specified when there is a transition leaving every state for every letter in the alphabet.  An automaton that is not fully specified can be thought of as having an implied “error” state as the target of all missing transitions.  This implied state is non-accepting and also transitions to itself on every alphabet letter.  Adding this state to an automaton does not change the language accepted by the automaton but will ensure that any input string can be completely processed since a transition now exists for every state / letter pair that may be encountered.

 

Invoke this option to complete the specification of an incompletely specified automaton.  This is accomplished as described in the previous paragraph.  The new state will be named “ERROR”.

 

Figure 1 depicts an automaton which is not fully specified.  Figure 2 depicts an equivalent automaton which is fully specified (and whose state names have been simplified – this is discussed later in this document).

 

 

 

 

 

 

 

 

 

 


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Remove unreachable states from the automaton

It is possible for an automaton to have unreachable states.  In fact, this can happen quite frequently when generating a random automaton (depending on how you specify the generation parameters).

 

Invoke this option to remove any unreachable states.  Clearly, doing this will not change the language accepted by an automaton, even if any of the removed states are accepting states, since the removed states could not affect a computation.

 

Figure 3 depicts an automaton which has an unreachable state (q2).

 

 

 

 

 

 

 

 

 

 

 

 


Remove non-determinism from an NFA-Lambda

An automaton is said to be non-deterministic when one or both of the following conditions are true:

 

·         For any state / letter pair, there are two or more outbound transitions on that state / letter pair.

 

·         The automaton has one or more Lambda transitions.  A Lambda transition is a transition between two given states that happens with no input at all.  i.e. transitioning is not contingent upon having read in a certain character from the input string.

 

Conversely, an automaton is said to be deterministic if neither of these conditions are true.

 

Invoke this option to convert a non-deterministic automaton to an equivalent deterministic automaton.  This will not change the language accepted by the automaton.  The algorithm RLM uses to accomplish this is taken from [1].

 

The term “non-deterministic” is used because a point may be reached during the processing of an input string where a “choice” exists on which state to transition to next.  As an example, consider the automaton in Figure 4.  If during a computation the automaton finds itself in state q1 and a ‘b’ is read in from the input string, there are two possibilities: transition to state q1 or transition to state q3.  The presence of multiple possible transitions here makes the automaton non-deterministic.

 

A brief discussion is in order regarding why the presence of Lambda transitions causes an automaton to be classified as non-deterministic.  In practice, a state which has a Lambda transition leaving it will almost always have at least one other transition leaving it, whether it be another Lambda transition or a transition on a letter.  As such, there is a “choice” to be made, regarding the next transition, once this state has been entered.  There is the somewhat degenerate case where an automaton has Lambda transitions but is not non-deterministic in the strictest sense of the term.  This situation exists when, for each Lambda transition, that Lambda transition is the one and only transition leaving the state from which it originates.  In this case, there is no “choice” of the next transition to be made when in this state.  This situation was referred to as “degenerate” since such states are simply “pass throughs” and could easily be removed.

 

To simplify terminology, an automaton that meets the conditions described in the previous paragraph is still classified as non-deterministic even though, in the strictest sense, it is deterministic.  Making this exception allows all automata with Lambda transitions to be classified as non-deterministic.

 

Non-deterministic automata are capable of representing exactly the regular languages.  Since this is also true of deterministic automata, it must also be true that for any non-deterministic automaton, there must be an equivalent deterministic automaton.  Computer Science has shown that this is indeed the case.

 

A note is in order regarding the state names of the automata resulting from this operation.  The algorithm for removing non-determinism can be thought of as exploring all possible paths through an automaton for input strings simultaneously.  As such, it becomes necessary to generate states in the new automaton which represent, for example, being in either state q1 or q2 in the old automaton (as you might imagine, this sometimes results in a combinatorial explosion of states in the new automaton).  States in the new automaton are named to indicate which combination of states in the old automaton they represent.  So, for example, you may see a state named {q1, q2, q5}.  This is just a normal state with a slightly complex name attached to it.  What the name is saying is that, if you wish, you can conceptually think of this state as representing a point in the computation where, in the old automaton, you could have been in either state q1, q2 or q5.  This method of naming states while removing non-determinism from an automaton conveys useful information and yields insight into the underlying algorithm.  This is why it is done.  Nevertheless, this feature can also result in state names that are cumbersome and difficult to read.  A feature is discussed later in this document which may be used to simplify state names that become too long or unwieldy.

 

Figure 4 depicts a non-deterministic automaton (the transitions which make it non-deterministic are bolded).  Figure 5 depicts an equivalent deterministic automaton (whose state names have been simplified – this is discussed later in this document).  As a regular expression, the language accepted by these automata is:

 

(a U b)*bb(a U b)* U (ab U b)* U b*a(b+a)*

 

Note that Figure 4 depicts Lambda transitions by the Greek letter of the same name (l).  This is in keeping with the notation used in [1].  As a side note, this is just one of the notational conventions the literature of this subject uses to represent the concept of a “free” transition that does not consume an input symbol.


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Minimize a DFA

An automaton is said to be minimal if there is no other automaton with fewer states which represents exactly the same language.  As a side note, a given regular language has exactly one minimal automaton that represents it.  In other words, if, say, five states is the smallest number of states that can be used to represent a given language, it is not possible to have two or more different five-state automata that represent that language.  This is a result of Computer Science and it facilitates the comparison of representations of regular languages for equality.  This will be discussed later in this document.

 

Invoke this option to reduce an automaton to its minimal number of states.  This will not change the language accepted by the automaton.  The algorithm RLM uses to accomplish this is taken from [1].

 

It is important to note that the minimization algorithm only works properly when applied to an automaton that is both deterministic and fully specified.  As such, this option may be invoked only if these conditions are met.

 

The algorithms that remove non-determinism from, complete the specification of, minimize and simplify the state names of an automaton (along with the algorithms to convert regular expressions and regular grammars to automata) are meant to be used together to produce a standardized representation of a regular language.  These steps will transform all possible representations of a given regular language to exactly the same form (this is elaborated on in “Comparing Regular Languages”).  Since the minimization step comes after removing non-determinism and completing the specification of the automaton, automata that are non-deterministic, not fully specified or both are considered, by definition, to be non-minimal.

 

As with the algorithm to remove non-determinism, this algorithm results in elongated state names.  The minimization algorithm works by detecting and combining equivalent states.  The state resulting from combining, say, states q2 and q4 would be named {q2, q4}.  Again, this is just a normal state with a slightly complex name attached to it.  This method of naming states while minimizing an automaton conveys useful information and yields insight into the underlying algorithm.  This is why it is done.  Nevertheless, this feature can also result in state names that are cumbersome and difficult to read.  A feature is discussed later in this document which may be used to simplify state names that become too long or unwieldy.

 

Figure 6 depicts an automaton which is not minimal (states q1 and q2 are equivalent since they are both non-accepting, both lead to q3 on ‘a’ and both lead to q4 on ‘b’).  Figure 7 depicts an equivalent minimal automaton (whose state names have been simplified – this is discussed later in this document).  State q1 in the new automaton (Figure 7) represents states q1 and q2 in the old automaton (Figure 6) having been combined.  As a regular expression, the language accepted by these automata is:

 

(a U b)a


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Reverse the language accepted by the automaton

The reversal of a language L is defined to be a language which contains the reversal of every string in L.  For example, ab becomes ba, b*dc becomes cdb*, etc…

 

Invoke this option to reverse the language accepted by an automaton.

 

Note: This operation adds a new state, which serves as the start state, named “NEW_START_STATE”.  An automaton with a state by this name cannot be reversed until state names are simplified (this will be discussed in detail later in this document, but it will suffice for our purposes here to note that doing this will cause the state to be renamed).  Otherwise, duplicate state names would be introduced; this is not allowed.  A direct consequence of this is that if you wish to reverse an automaton twice (resulting in an automaton equivalent to the one with which you started), you will need to simplify state names in between the reversals.

 

Figure 8 depicts a simple automaton which accepts only the string “ab”.  Figure 9 depicts the reversal of this automaton (with state names simplified – this is discussed later in this document).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Complement the language accepted by a DFA

The complement of a language is defined, in terms of set theory, to be:

 

S* - L

 

Here, S represents the alphabet, S* represents all possible strings over that alphabet and L represents the set of all strings accepted by the automaton.  So, in words, the complement of a language L is all strings over the alphabet which are not in L.

 

Invoke this option to complement the language accepted by an automaton.  The algorithm RLM uses to accomplish this is taken from [1].

 

As a side note, this algorithm works by simply switching the accepting and non-accepting states.  This procedure works only when the automaton it operates on is deterministic and fully specified.  As such, this option may be invoked only if these conditions are met.

 

There is a ramification of the switching of accepting / non-accepting states that needs to be noted.  If an automaton with an error state (a state named “ERROR”) is complemented, this will result in the state named “ERROR” being an accepting state.  This is normally not permitted.  Specifically, an automaton definition file that defines an automaton with such a state will not load.  As such, an automaton with such a state may not be saved either.  It will be necessary to simplify the state names of the automaton (this will cause the offending state to be renamed) before it may be saved.

 

Figure 10 depicts the complement of the automaton in Figure 2.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Simplify the state names of the automaton

As described earlier in this document, the algorithm to remove non-determinism from an automaton and the algorithm to minimize an automaton result in elongated state names.  Though this is done for a reason, it can cause state names to become long and unwieldy.  In addition, completing the specification of an automaton adds a state named “ERROR” and reversing the language accepted by an automaton adds a state named “NEW_START_STATE”.  The presence of states by these names, though added for a reason, can cause problems during subsequent operations (as described earlier in this document).  The solution to these problems is to simplify the state names of the automaton.

 

Invoke this option to rename the states of an automaton in a standardized manner.  The following conventions are used:

 

The states are named q0 – qn where n + 1 is the total number of states in the automaton.  q0 is always the start state.  In addition, state names are assigned such that, if the automaton is deterministic, the same automaton always results regardless of which isomorph of the original automaton was started with.  That is, two automata that are identical except for state names will be changed into exactly the same automaton.  This will facilitate the comparison of representations of regular languages for equality.  This is discussed later in this document.

 

It is important to note, as stated above, that an automaton MUST be deterministic in order for the result of this name-simplification operation to be consistent, predictable, repeatable and well-defined.  In other words, two non-deterministic automata that are identical except for state names will NOT necessarily reduce to the exact same automaton.  Fortunately, this is not an issue since all non-deterministic automata can be converted to equivalent deterministic automata.

 

Simplify the regular expression

Invoke this option to simplify a regular expression.

 

RLM is capable of making 12 separate simplifications to a regular expression.  Note that this feature is NOT intended to reduce a regular expression to its “minimal” form or to convert it to some standardized form.  Rather, the simplifications are all rather elementary in nature and are intended to make the expression “clean” and to remove nonsensical (but legal) constructs the user may introduce.  The following simplifications are performed:

 

1.        Replace Lambda* / Lambda+ with Lambda

 

2.        Remove Lambda when concatenated with anything else

 

3.        Remove Lambda from a union if that union contains any subexpressions with * applied to them

 

4.        Combine a union of unions into a single larger union

 

5.        Order the subexpressions of a union in lexicographic order

 

6.        Make the following substitutions:

·         a*a* à a*

·         a*a+ / a+a* à a+

·         a+a+à aa+

 

7.        Remove duplicate entries in a union

 

8.        Make the following substitutions:

·         (a*)* / (a+)* / (a*)+à a*

·         (a+)+à a+

 

9.        Replace aa* / a*a with a+

 

10.     Combine a concatenation of concatenations into a single larger concatenation

 

11.     Replace a+a with aa+

 

12.     Remove excess parentheses

 

In the above list of simplifications, ‘a’ represents any arbitrarily complex subexpression.  It does not have to be a single character (though it certainly may be).

 

The above simplifications are applied repeatedly as a group until an expression is generated which has already been generated or which is equivalent to the original expression.  This is necessary since the application of one simplification may result in an expression to which another simplification could be applied.  Note that we don’t simply keep going until the same expression is generated twice in a row.  Instead, as stated above, we stop when an expression is generated which has already been generated or which is equal to the original expression.  In this way, if we get into a situation where we’re “oscillating” between several expressions, we won’t loop forever.  Basically, this can be thought of as cycle detection.

 

Simplification 5 is necessary for more than aesthetic reasons.  Without doing this, the comparison to previously generated expressions described in the previous paragraph would not be effective.  For example, “a U b” and “b U a” would not be seen as equivalent.  With this simplification, “b U a” gets converted to “a U b” which, obviously, is equivalent to the first subexpression.

 

Simplification 11 may appear odd at first glance, but there is a very good reason for it.  Multiple applications of this simplification, along with simplifications 6 and 9, allow expressions like the following to be simplified:

 

aa*aaa+a+a*a*aaa*a*a+


 

This expression simplifies to:

 

aaaaaaaa+

 

Without simplification 11, the a+ subexpressions that appear would not be able to be brought adjacent to each other.  Once these subexpressions are adjacent to each other, simplification 6 can be applied.

 

Convert an NFA-Lambda to a regular expression

Invoke this option to create a regular expression in memory which represents the same regular language as the automaton currently in memory.  The algorithm RLM uses to accomplish this is taken from [1], but a slight optimization has been made.

 

This operation may only be performed if the alphabet of the automaton consists only of single lowercase characters.

 

The condition above is necessary since the alphabet of the automaton maps directly to the letters used in composing the regular expression.  Regular expressions may use only lowercase characters since ‘U’ is used to represent union.  It’s much less confusing and much more straightforward to say that only lowercase characters may be used in regular expressions instead of saying that lowercase and all uppercase characters except for ‘U’ may be used.  This is the reason for this restriction.

 

A given regular language can often be represented by many different regular expressions.  Therefore, you may notice that if you convert an automaton to a regular language, perform manipulations on that automaton (such as removing non-determinism or minimizing) and then convert it to a regular expression again, the two regular expressions will likely appear very different even though they are, in fact, equivalent.  The automata in Figures 4 and 5, when converted to a regular expression, provide a dramatic example of this.  The automata in Figures 6 and 7, when converted to a regular expression, provide a less dramatic (but much clearer) example of this.

 

Convert the regular expression to an NFA-Lambda

Invoke this option to create an automaton in memory which represents the same regular language as the regular expression currently in memory.  The algorithm RLM uses to accomplish this is taken from [1].  Before performing the conversion, RLM makes a working copy of the regular expression (this is done so that the regular expression the user knows to be in memory remains unchanged) and applies the simplifications described earlier in this document to this working copy.  It is this regular expression which is then converted to an automaton.  This point is noted for the reader who is familiar with the procedure described in [1] for performing this conversion.  The resulting automaton may not be what you expect since it will not have been generated from the regular expression you know to be in memory.  Nevertheless, the generated automaton will be equivalent to that regular expression.

 

Convert an NFA to a regular grammar

Invoke this option to create a regular grammar in memory which represents the same regular language as the automaton currently in memory.  The algorithm RLM uses to accomplish this is taken from [1].

 

This operation may only be performed if the following conditions are all met:

 

·         The alphabet of the automaton consists only of single lowercase characters.  This is necessary since the alphabet of the automaton maps directly to the terminals of the regular grammar.  A regular grammar’s terminals may only be single lowercase characters, hence this restriction.

 

·         The automaton may not have any Lambda transitions

 

·         The automaton may not have more than 26 states.  This is necessary since each state in the automaton results in the creation of one variable in the grammar.  Since variables are always represented by single uppercase letters and since only 26 such letters exist, an automaton may have no more than 26 states if it is to be converted to a regular grammar.

 

Convert the regular grammar to an NFA

Invoke this option to create an automaton in memory which represents the same regular language as the regular grammar currently in memory.  The algorithm RLM uses to accomplish this is taken from [1].

 

Display the automaton

Invoke this option to display the automaton on the screen.  In addition to the automaton itself, the input transition table for the automaton will be displayed as well as a list of unreachable states (if any).  The input transition table is similar to the transition table except that it has all Lambda transitions removed.  This is accomplished by showing all states that can be reached from a given state with some combination of Lambda transitions followed by an alphabet letter followed by another combination of Lambda transitions.  As a simple example, consider the automaton that has states q0, q1 and q2 with a Lambda transition from q0 to q1 and a transition on ‘a’ from q1 to q2.  The transition table would be:

 

(q0, Lambda) à {q1}

(q1, a) à {q2}

 

whereas the input transition table would be:

 

(q0, a) à {q2}

(q1, a) à {q2}

 

For an automaton with no Lambda transitions, the transition table and the input transition table are identical.  Also, states with nothing but sequences of 0 or more Lambda transitions to be followed from them (i.e. it’s not possible, via Lambda transitions, to reach a state that has a non-Lambda transition leaving it) will not appear in the input transition table.  This is analogous to a state with no transitions out of it at all not appearing in the transition table.

 

Input transition tables are used during the removal of non-determinism.

 

Display the characteristics of the automaton

Invoke this option to display a select set of characteristics about an automaton.

 

The following characteristics are shown:

 

·         Whether or not the automaton is deterministic

 

·         Whether or not the automaton is fully specified

 

·         Whether or not the automaton has any Lambda transitions

 

·         Whether or not the automaton is minimal

 

·         Whether or not the automaton has any unreachable states (i.e. does it form a strongly connected graph with respect to the start state?).  If the automaton has unreachable states, they will be listed when the automaton is displayed.  Additionally, unreachable states are included, in comment form, in the .nfa file generated when RLM is used to save an automaton to disk.

 

·         Whether or not the language accepted by the automaton is finite (i.e. are there any cycles between the start state and any of the accepting states?)

 

·         Whether or not the automaton is convertible to a regular expression (an automaton may be converted to a regular expression as long as the condition in “Convert an NFA-Lambda to a regular expression” is met)

 

·         Whether or not the automaton is convertible to a regular grammar (an automaton may be converted to a regular grammar as long as the conditions in “Convert an NFA to a regular grammar” are met)

 

Display the regular expression

Invoke this option to display the regular expression on the screen.  The expression is broken up into multiple lines, if necessary, so that no single line is greater than 80 characters in length.

 

Display the parse tree of the regular expression

Invoke this option to display a textual representation of a parse tree for the regular expression.

 

Any time a regular expression is brought into memory, a parse tree is generated for that expression.  This is done to facilitate simplifying a regular expression and converting it to an automaton.  To perform these operations by directly manipulating the regular expression text itself would be exceedingly difficult.  Since this data structure is in memory, a feature has been added to RLM to make it visually accessible to the user.  Viewing the parse tree of a complex regular expression can greatly aid the user in comprehending its structure.

 

The parse tree for the regular expression Lambda U (ab)* U b+ is shown here:

 

U

                Lambda

                *

                                .

                                                a

                                                b

                +

                                b

 

The symbols used here are all self-evident except, perhaps, ‘.’.  This symbol denotes concatenation.

 

The user can quickly tell that, at a high level, this regular expression consists of three subexpressions unioned together.  The user may then look closer at any of the subexpressions if desired.

 

Display the regular grammar

Invoke this option to display the regular grammar on the screen.  A note needs to be made here about how multiple productions for a single variable are displayed.  Take the following as a case of a variable having three productions:

 

S aS

S aA

S a

 

This is how these productions would appear in the regular grammar definition file (though the use of white space may differ).  When displayed on the screen, these productions will be consolidated into a single line by separating the source from the targets with “à” and by separating the targets from each other with the ‘|’ character as follows:

 

S à aS | aA | a

 

This is the standard manner in which productions are displayed textually.  However, it is important to note that this notational convenience may NOT be used in a regular grammar definition file.  Instead, all productions must be specified individually as shown earlier.

 

Save the automaton to disk

Invoke this option to save an automaton to disk.  You will be prompted for a file name (to which you may optionally prepend a path) which must have a .nfa extension.  In addition to the automaton itself, the input transition table for the automaton will be saved as well as a list of unreachable states (if any).  These will be included in comment form in the generated .nfa file.  Input transition tables are discussed briefly in the section “Display the Automaton”.

 

Save the regular expression to disk

Invoke this option to save a regular expression to disk.  You will be prompted for a file name (to which you may optionally prepend a path) which must have a .exp extension.

 

The expression is broken up into multiple lines, if necessary, so that no single line is greater than 80 characters in length.

 

Save the regular grammar to disk

Invoke this option to save a regular grammar to disk.  You will be prompted for a file name (to which you may optionally prepend a path) which must have a .grm extension.

 

Quit

Invoke this option to quit RLM and return to the command line.


A Special Case: Converting Between Regular Expressions and Regular Grammars

 

RLM does not provide a way to convert directly between regular expressions and regular grammars.  To accomplish this, you must first convert to an automaton as an intermediate step.  The following items should be noted as they may impact your ability to accomplish this:

 

·         It is always possible to convert from a regular expression or a regular grammar to an automaton.

 

·         An automaton may be converted to a regular expression as long as the condition in “Convert an NFA-Lambda to a regular expression” is met.

 

·         An automaton may be converted to a regular grammar as long as the conditions in “Convert an NFA to a regular grammar” are met.

 

These items have the following implications for converting between regular expressions and regular grammars:

 

·         When converting from a regular expression to a regular grammar, the intermediate automaton must be able to be minimized to 26 or fewer states.

 

·         Note that converting a regular expression to an automaton results in an automaton with many Lambda transitions.  As such, this automaton cannot be converted directly to a regular grammar.  It must first have non-determinism removed to get rid of the Lambda transitions.

 

·         A regular grammar may always be converted to a regular expression since a regular grammar’s alphabet is subject to the same restrictions as a regular expression’s alphabet.


Comparing Regular Languages

 

There is an easy way to compare two regular language representations for equivalence using RLM.  To do this, perform the steps shown below for both languages.  Note that some steps may not actually execute.  For example, if an automaton is deterministic, step 2 will have no effect.  This is alright – just go ahead and perform all steps as shown.

 

1.        If the regular language is represented as a regular grammar or a regular expression, convert it to an automaton.  If the regular language is already represented as an automaton, skip to the next step.

 

2.        Remove non-determinism from the automaton

 

3.        Complete the specification of the automaton (this will not be necessary if the previous step actually ran – removing non-determinism always results in a fully specified automaton; otherwise, it MAY be necessary)

 

4.        Minimize the automaton

 

5.        Simplify the state names of the automaton.  This will ensure that states are named in a consistent, predictable, repeatable and well-defined manner.

 

6.        Save the automaton to disk using some unique name

 

Once both automata have been saved to disk, do a text compare of the two files.  In DOS, this is done with “fc”.  In UNIX / Linux, this is done with “diff”.  For example, enter the following if at a DOS prompt:

 

fc first.nfa second.nfa

 

or the following if at a UNIX / Linux prompt:

 

diff first.nfa second.nfa

 

(Here, “first.nfa” and “second.nfa” are the file names chosen in step 6 above.)

 

If the two regular languages are equivalent, the only difference between the two files will be the comment on the very first line indicating the file name.


File Format Descriptions

 

Automata, regular expressions and regular grammars are specified by the user in a text file.  Specifics of each of these three file types will be discussed below.  But first, traits common to all three file types will be discussed.

 

White space may be used in any manner.  This allows formatting in any way you wish for maximum readability.  White space is defined as any combination of space, tab, carriage return and line feed.

 

Comments may be inserted anywhere in the file.  A comment is begun with /* and ended with */.  Note that these comment delimiters MUST be separated from any surrounding text by white space.  If this is not done, comments will not be interpreted as such and the file will not load.  For example, the following will not parse properly:

 

/*This is a comment.*/

 

However, this will parse properly:

 

 /* This is a comment. */

 

Finally, definition files for the example automaton, regular expression and regular grammar that are shown in the following three sections are included with your RLM distribution.  The associated file names are indicated in comment form in the examples below.

 

Finite Automata

A text file containing an automaton definition must have a .nfa extension.  An example of an automaton definition is shown below.  This is the definition of the automaton shown in Figure 4.

 

/* figure_4.nfa */

 

STATES:

   q0

   q1

   q2

   q3

   q4

   q5

 

ALPHABET:

   a

   b

 

TRANSITION_TABLE:

   q0 Lambda q1

   q0 Lambda q2

   q1 a q1

   q1 b q1

   q1 b q3

   q2 a q5

   q2 b q2

   q3 b q4

   q4 a q4

   q4 b q4

   q5 b q2

 

START_STATE:

   q0

 

ACCEPTING_STATES:

   q2

   q4

   q5

 

The following is a complete list of the checks made by RLM while validating an automaton definition file.  Each of these conditions must be satisfied for the file to load.

 

1.        No states may be repeated

 

2.        No alphabet letters may be repeated

 

3.        No transitions may be repeated

 

4.        No accepting states may be repeated

 

5.        No state may have a Lambda transition to itself

 

6.        A transition may not reference non-existent states

 

7.        A transition may not reference non-existent alphabet letters

 

8.        The start state must be an existing state

 

9.        The accepting states must be existing states

 

10.     An alphabet letter named “Lambda” may not exist.  This is predefined to represent the null string and is automatically available for use.  Note that a capitalization-based variation of “Lambda” may not exist either.  Although alphabet letters are case-sensitive (as are state names) and this is therefore possible, this is not allowed in order to prevent confusion between, for example, a user-defined alphabet letter named “lambda” and the null string “Lambda”.

 

11.     If a state named “ERROR” exists, it must have transitions only to itself.  This is necessary since completing the specification of an automaton requires the presence of a state named “ERROR” which meets this criteria.  If this rule were not enforced, it would not be possible to complete the specification of an automaton which has a state named “ERROR” with transitions to states other than itself.

 

12.     If a state named “ERROR” exists, it must not be an accepting state.  This is necessary since completing the specification of an automaton requires the presence of a state named “ERROR” which meets this criteria.  If this rule were not enforced, it would not be possible to complete the specification of an automaton which has an accepting state named “ERROR”.

 

13.     No state name that is a capitalization-based variation of “ERROR” may exist.  Although state names are case-sensitive (as are alphabet letters) and this is therefore possible, this is not allowed in order to prevent confusion between, for example, a user defined state named “error” and an RLM-added state named “ERROR” (as the result of completing the specification of an automaton).

 

14.     The five keywords shown in the sample file above  (STATES:, ALPHABET:, TRANSITION_TABLE:, START_STATE: and ACCEPTING_STATES:) must all appear exactly once and they must appear in the order shown.  Additionally, each of these keywords MUST be separated from any surrounding text by white space.  For example, the following will not parse properly:

 

START_STATE:q0

 

However, this will parse properly:

 

START_STATE: q0

 

15.     All strings defining state names and alphabet letters must contain only alphanumeric characters.  Additionally, though it is not shown in the example above, it is possible for alphabet “letters” to be many characters long.  This is not recommended, however, since this will preclude converting the automaton to a regular expression or a regular grammar.

 

16.     There must be at least one state defined

 

17.     There must be exactly one start state

 

18.     The number of strings following the “TRANSITION_TABLE:” keyword must be a multiple of 3.  The first string in each group must refer to an existing state and represents the state the transition is leaving.  The second string in each group must refer to an existing alphabet letter and represents the letter the transition occurs on.  The third string in each group must refer to an existing state and represents the state the transition enters.

 

Regular Expressions

A text file containing a regular expression definition must have a .exp extension.  A simple example of a regular expression definition is as follows:

 

/* sample.exp */

 

Lambda U a U ab U (ba)* U (aba)+ U a*

 

The following is a complete list of the checks made by RLM while validating a regular expression definition file.  Each of these conditions must be satisfied for the file to load.

 

1.        The text of the regular expression must be composed of only white space, lowercase characters, “Lambda” or one of the following special characters: ( ) * + U

 

2.        Parenthesis structures must be properly formed and balanced.  The following are examples of erroneous use of parentheses:

 

·         ()                             Empty

·         (ab(cd)*(+             Improper structure

·         ((ab)*                     Unbalanced

 

3.        The * and + operators may only be applied to a single character, “Lambda” or a parenthized expression.  A consequence of this is that a** is not a legal expression.  This should be expressed as (a*)* instead (though this is somewhat nonsensical).

 

4.        The union operator (U) may only be preceded by the following: a – z, “Lambda”, *, + and )

 

5.        The union operator (U) may only be followed by the following: a – z, “Lambda” and (

 

Regular Grammars

A text file containing a regular grammar definition must have a .grm extension.  An example of a regular grammar definition is shown below.  This grammar is equivalent to the automaton shown in Figure 5.

 

/* sample.grm */

 

TERMINALS:

   a

   b

 

VARIABLES:

   S

   A

   B

   C

   D

   E

   F

   G

   H

 

START_SYMBOL:

   S

 

PRODUCTIONS:

   S Lambda

   S aA

   S bB

   A Lambda

   A aC

   A bB

   B Lambda

   B aA

   B bG

   C aC

   C bD

   D aC

   D bE

   E Lambda

   E aF

   E bE

   F Lambda

   F aF

   F bE

   G Lambda

   G aH

   G bG

   H Lambda

   H aF

   H bG

 

The following is a complete list of the checks made by RLM while validating a regular grammar definition file.  Each of these conditions must be satisfied for the file to load.

 

1.        No variables may be repeated

 

2.        No terminals may be repeated

 

3.        No productions may be repeated

 

4.        All terminals must be single lowercase characters

 

5.        All variables must be single uppercase characters

 

6.        The start symbol refers to an existing variable

 

7.        The four keywords shown in the sample file above  (TERMINALS:, VARIABLES:, START_SYMBOL: and PRODUCTIONS:) must all appear exactly once and they must appear in the order shown.  Additionally, each of these keywords MUST be separated from any surrounding text by white space.  For example, the following will not parse properly:

 

START_SYMBOL:S

 

However, this will parse properly:

 

START_SYMBOL: S

 

8.        There must be at least one variable defined

 

9.        There must be exactly one start symbol

 

10.     All productions must be of one of three forms:

 

·         <variable>             <terminal><variable>

·         <variable>             <terminal>

·         <variable>             Lambda

 

Note that a production of the second form does not appear in the example above.  An example production for this case would be:

 

A a

 

In the notationally convenient form described earlier in this document, this would be expressed as:

 

A à a

 

It bears repeating, however, that this notation may NOT be used in a regular grammar definition file.

 

11.     Productions may reference only existing terminals and variables


Bibliography

 

[1]           Sudkamp, T. [1997], Languages and Machines, Second Edition, Addison Wesley, 0-201-82136-2

 

[2]           HopCroft, J., Motwani, R., Ullman, J. [2001], Introduction to Automata Theory, Languages, and

Computation, Second Edition, Addison Wesley, 0-201-44124-1