@q file: errhdling.w @> @q% Copyright Dave Bone 1998 - 2015@> @q% /*@> @q% This Source Code Form is subject to the terms of the Mozilla Public@> @q% License, v. 2.0. If a copy of the MPL was not distributed with this@> @q% file, You can obtain one at http://mozilla.org/MPL/2.0/.@> @q% */@> @** Error detection and handling.\fbreak Let's review how this can be done. Within a grammar's production there are points where an invalid symbol could arrive. If one does not program for it, the parser will go kapout. So what are the options open to a grammar writer? First there is a ``{\bf failed}'' directive in the ``fsm'' construct that will field aborted parses. It is the last chance to deal with errors in a rather insensitive way. If there are many contexts within the grammar that could go wrong then this approach is too insensitive to be specific about the context's error point. Though the errant current token is available to report on, what was the inapproproiate context that threw it? Well u could try to figure it out from the remnants on the parse stack. To deal with specific error points, the \QUEshift{}, \ALLshift{}, and \INVshift{} symbols can catch errant tokens, or one can be very specific in specifiying the errant T to catch. This last option can be very daunting when one has 500+ T to deal with and lets be honest not really appropriate. This was why i introduced the meta-terminals \QUEshift{} and \ALLshift. To catch a rogue and associate syntax directed code to handle the situation, these symbols MUST be within prefix subrules where they are the last symbol in the subrule's symbol string. What does this mean? Having a string of symbols where these catch T symbols are burried within a larger symbol string means the subrule's containing these symbols will not be executed as its sentence has not been completely recognized. For example:\fbreak \INDENT{.4in}{\subrule{} a \QUEshift{} b --- will not handle the error at the \QUEshift{} point} \INDENT{.4in}{\subrule{} a Rqueshift b --- will catch the problem} \INDENT{.7in}{Rule Rqueshift \subrule{} \QUEshift{}...} \INDENT{.9in}{will catch the error with appropriate syntax directed code directive } \fbreak Caution: The ranking of meta-terminal shifts: 1 and a 2 and a 3 ---\QUEshift{}, \INVshift{}, \ALLshift{}\fbreak The \QUEshift{} symbol is checked first for its presence within the current parse state followed by the \INVshift{} symbol as it is normally used to get out of a quasi-ambigous parse. The \ALLshift{} aka wild shifter is the last to be checked in the parse state. It is their presence within the parse state that activates their use. The \QUEshift{} is an error statement and was my reason to put it at the head of the conditional shifts. So watch your shifts as this could catch u like me. Remove 1 of the 2 competing shift symbols: \ALLshift or \INVshift. For the moment i have not issued an error message on this situation. @^ To do error message conditional shift ranking both \ALLshift{} and \INVshift{} in state@> @^ To do error message conditional shift \INVshift{} takes precedence over \ALLshift{}@> \fbreak \fbreak Dictate no 1: Last symbol in subrule's symbol string must be the catcher in the Error\fbreak Make sure your error catch point has \QUEshift{} or \ALLshift{} as its last symbol within the symbol string and let your syntax directed code decree the error escape route to be taken. Yeah that's fine but what if the symbol string to be recognized contains many catch points? Just make each symbol string segment a separate rule with the error code catch point being the last symbol in the string competing with its legitimate accepted T symbols and use these rules within another rule's subrule as part of its symbol string to be recognized! The lr algorithm is a collection of various symbol string configurations per state in various accepted T points along their parsing. So by transitive closure these prefix rules get included in the state to be recognized along with the other similar prefix symbols. When the prefix rule's ```rhs'' boundary is recognized, depending on the error catcher used, the reduce will fire either in good form or as an error. \fbreak \fbreak What to do when an error is detected?\fbreak For now i have not thought out error correction strategies though i am marginally aware of the backtracking techniques. I will now discuss current programming options open to the grammar writer. Depending on the context, the thread could abort which is the most drastic. This takes place when no error catching is programmed and \O2 issues a runtime message on the aborted grammar with its run stack goodies. This might be okay to get things going but isn't too appropriate within a production environment. Well the catch points have 2 programming options available:\fbreak \INDENT{.4in}{1) return an error token back to the calling grammar and stop parsing of the active grammar} \INDENT{.4in}{2) abort the parse and field it using the ``failed'' directive to return an error T} Point 1 should be your main course of action. That is both macros |RSVP| and |RSVP_FSM| return a T back to the calling grammar through the accept queue facility as if the parse was successfull. This is what point 2 does using the |RSVP_FSM| macro as its execution is within the ``fsm'' context of the grammar and not the reducing rule. The calling grammar can then field this returned T specifically or use the two meta-terminal \QUEshift{} or \ALLshift{} to deal with them. They are allowed in any subrule symbol string context: thread calls where its returned T can be one of these symbols, and the regular subrule symbol string. \fbreak \fbreak Pinpointing where the error occured in the source file\fbreak Built into \O2 is the facility to tag each T with its approriate source file's GPS --- filename, line number, and character position. These co-ordinates are used to print out the errant source line with an arrow underlining the errant source token. So when an error T is created, use of the |set_rc| and variants allows one to pinpoint the error T against the GPS's source file T. Have a read on ``Abstract symbol class for all symbols'' --- |CAbs_lr1_sym|. \fbreak \fbreak Some subtleties on making the errant T fire off the error catching syntact directed code.\fbreak Let me pose a question: What happens when the errant T is not in the lookahead set to reduce that subrule? Well it will not get executed! Ugh. This is just not acceptable Dave. Well to the rescue is the \QUEshift{} symbol. It is not in the token stream but represents an errant situation. So where is this errant T placed? When one enters the subrule's syntact directed code segment, all its subrule's elements have been shifted onto the parse stack where this last errant symbol is represented by \QUEshift{}. But the \QUEshift{} symbol {\bf does not advance past the errant T} as in regular parsing. So what does this mean? The current errant T is also the lookahead symbol for the reduction. But wait what if this T is not in the lookahead set to reduce this subrule. Well i made this type of reduce a lr(0) context: no lookahead symbol required to reduce the subrule. To get at the current elements on the parse stack, \O2 emits within each subrule's c++ code the stack frame with each subrule's symbol string assigned to ``|sf->pxx__|'' where xx is the symbol's string position. This is the difference to \ALLshift: \ALLshift{} depends on the lookahead set to reduce. Now what then is the advantage to using \ALLshift? One can test its under-its-hood T's enumerate value and then take error action or stop use of the \ALLshift{} facility that allows the grammar to continue parsing up to the ``start rule''. As it's a wild symbol shifter, it really lowers the grammar's parse tables sizes and eases the grammar writer's typing. \fbreak \fbreak Dictate no 2: Games on returning the new lookahead T back to the calling grammar\fbreak U can play games with resetting the new lookahead T that is passed back with its |RSVP| T companion within the accept queue. This is what happens when just 1 T is returned: the lookahead T is the parse stream continue point and also its contents to set the calling parser's current token to continue with. As an aside why use the returned lookahead's T contents instead of just resetting the continue T from the token stream's container using the lookahead token position? Well u could also remap the current token into another T type due to say a symbol table remapping --- like Pascal and its ``const-id'', ''function-id'' as described in the railroad diagrams of ``The Pascal Reference Manual''. The remapping facility is open for use via the ``Table lookup functor'' facility. The following methods adjust the parser's token stream:\fbreak \INDENT{.4in}{|override_current_token_pos(symbol,position)|} \INDENT{.4in}{|override_current_token(symbol)|} \INDENT{.4in}{|reset_current_token(position)|} In a dual competing threads situation where each grammar have accepted their parse and are returning their booty to the calling grammar, the calling grammar must use arbitration to select the T gift and sets its parse stream accordingly and the balance in the ``accept queue'' are so-to-speak thrown away. Of course the {\bf arbitration} facility is programmed by the compiler writer when 2 or more successfull threads are returning their booty back to the calling grammar. Normally this does not occur as there is just one thread that will report its findings but this city is built on rock and nondeterminism. So a subset / superset competition, or an accept and error combo is quite acceptable and for the arbitrator's choosing. Forgotten arbitration code will be regurgitated by the \O2 library in message form for your fixing. The one caveat to watch for is: What is the current token and its position in the parse stream when it enters the subrule's syntax directed code? \QUEshift{} still has the errant T as its current T and to reset back to the previous T u only subtract 1 from the current token position. \ALLshift{} demands 2 be subtracted as the current T is the new lookahead T. So u've been warned. \fbreak \fbreak Some comments on stopping a parse by syntax directed code:\fbreak Apart from the don't do anything approach, the grammar writer can talk to the parser and dictate his intentions. The 2 methods open are abort-the-parse or stop-parsing. The abort-the-parse action allows the thread to stop without any T returned to the caller grammar or use the {\bf failed} directive to last-chance return an error T back to the caller. The stop-parsing approach returns a T back to the user but does not want to continue the complete parse through to its ``start rule''. It just short-circuits the overall grammar's parsing action. Remember that if the parse has been successfull ``why complete the parsing thru to start-rule?''. Depending on your local grammar logic this might be the most expedient way to program. Here are the 2 methods to do this:\fbreak \INDENT{.4in}{|set_abort_parse(true)|} \INDENT{.4in}{|set_stop_parse(true)|} What about the reducing of this subrule? Well it occurs, as entry into the syntax directed code that contains the grammar writer's code to execute these statements are kosher reducing conditions. So why the ``abort-parse'' versus ``stop-parse'' difference. ``stop-parse'' should contain the |RSVP| macro that enters the returned T into the calling grammar's ``accept-queue''. The ``abort-parse'' normally does not contain this action. \fbreak \fbreak Warning no 3: if \ALLshift{} being used, don't forget to turn it off.\fbreak This symbol is voracious: eats and eats everything in its path. So u can arrive at trying to eat the ``end-of-the-parse-stream'' ``|eog|'' symbol forever... \O2 guards against this but is rather abrupt in its message to the grammar writer and stopping of the parse immediately. So u'll see in some the suggested grammars |set_use_all_shift_off| method being called to get out of this perpetual motion and possiblely continue up the parse chain to the ``start rule''. Here is a list of some \O2 grammars having error handling and premature stopping of a parse to learn from. \INDENT{.4in}{1) |o2_lcl_opts.lex| and called thread |o2_lcl_opt.lex| --- command line parser} \INDENT{.4in}{2) |la_express.lex| --- |set_abort_parse(true)| thread's la expression parser} \INDENT{.4in}{3) |c_string.lex| --- semantic example stopping a parse and programmed fsa} Point 1 gives an example of how the ``failed'' directive in the called thread |o2_lcl_opt.lex| is programmed and ``|set_stop_parse(true)|'' use in the calling grammar |o2_lcl_opts.lex| of a monolitic grammar. |pass3.lex| and point 2 give more examples on monolithic use to aborting. Point 3 also shows programming use of the ``|set_abort_parse(true)|''. For the really curious, why not use the find/grep/xargs combo to settle your appetite against \O2's grammars. \fbreak \fbreak The last word, amen and happy parsing.\fbreak Remember that the normal flow of errors should be placed into the ``error queue'' and then post processed to report its findings. |ADD_TOKEN_TO_ERROR_QUEUE| and its variant |FSM_ADD_TOKEN_TO_ERROR_QUEUE| allow u to do this. |pass3.lex| gives lots of examples and \O2's program shows its way of post-verbing the troubles. And with all this error stutter, each grammar does a post-execution grammar cleanup on current parsing for the next round of their calling. Again what does this mean? A semi-abort was done just to stop its execution leaving the grammar in an abort state. But each grammar does a resetting to a clean slate for its next round of calling either by ``procedure call'' if no nesting calls of itself is occuring or by the heavy thread call. Hygiene is important so the cat washes itself for the next eating.