Searching \ for 'Command Parsing' in subject line. ()
Make payments with PayPal - it's fast, free and secure! Help us get a faster server
FAQ page: www.piclist.com/techref/index.htm?key=command+parsing
Search entire site for: 'Command Parsing'.

Truncated match.
PICList Thread
'Command parsing'
1997\02\07@014245 by Ed Todd

picon face
I have some multilevel jump table macros that I use for command parsing, it
puts all jumps in one table.  Example of expanded code, leaving out the
range checking tests (command byte >7, etc):

JUMPTAB   ADDWF    PC,F
;     8 command byte 1 entries
                 GOTO      INVALIDCMD                 ; 0 = invalid
                 GOTO      INITCMD                        ; 1 = INIT
                 GOTO      CFGCMD                       ; 2 = configure
                 GOTO      CMD3
                 GOTO      CMD4
                 GOTO      CMD5
                 GOTO      CMD6
                 GOTO      CMD7
;     4 subcommand entries for INIT
                 GOTO      INITYPE0                      ; 0
                 GOTO      INITYPE1                      ; 1
                 GOTO      INITYPE2                      ; 2
                 GOTO      INITYPE3                      ; 3
;     4 subcommand entries for CFG
                 GOTO      CFGTYPE0                      ; 0
                 GOTO      CFGTYPE1                      ; 1
                 GOTO      CFGTYPE2                      ; 2
                 GOTO      CFGTYPE3                      ; 3
JUMPOFFSET  SET   0
       .......
       .......
COMMANDENTRY                                          ; entry point for
command analysis
                 MOVLW   JUMPOFFSET
                 ADDWF   CMDBYTE,W
                 GOTO      JUMPTAB
JUMPOFFSET  SET   JUMPOFFSET+8           ; 8 entries in table
       ........
       ........
INITCMD
                 MOVLW   JUMPOFFSET
                 ADDWF   SUBCMDBYTE,W
                 GOTO      JUMPTAB
JUMPOFFSET  SET   JUMPOFFSET+4           ; 4 entries in table
       ........
       ........
CFGCMD
                 MOVLW   JUMPOFFSET
                 ADDWF   SUBCMDBYTE,W
                 GOTO      JUMPTAB
JUMPOFFSET  SET   JUMPOFFSET+4           ; 4 entries in table



 <spam_OUTedtoddTakeThisOuTspamsni.net>    Ed Todd


'Command Parsing'
1998\10\21@082149 by tjaart
flavicon
face
Has anyone come up with an efficient way of parsing
incoming text-type commands on a PIC?

For example a PIC connected to some sensor or PC
that controls it via text-based commands. You have no
control over the protocol, so you can't use numeric
commands with a lookup table.

You have to distinguish between, say, 25 text
commands varying in length between 2 characters
and 15 characters.

I am not looking for code, but rather strategies.

--
Friendly Regards

Tjaart van der Walt
.....tjaartKILLspamspam@spam@wasp.co.za

|--------------------------------------------------|
|                WASP International                |
|R&D Engineer : GSM peripheral services development|
|--------------------------------------------------|
|SMS tjaartspamKILLspamsms.wasp.co.za  (160 chars max)|
|     http://www.wasp.co.za/~tjaart/index.html     |
|Voice: +27-(0)11-622-8686  Fax: +27-(0)11-622-8973|
|          WGS-84 : 26¡10.52'S 28¡06.19'E          |
|--------------------------------------------------|

1998\10\21@090802 by James Cameron

flavicon
face
Tjaart van der Walt wrote:
> I am not looking for code, but rather strategies.

Use the Forth, Luke.

Dictionary of words.  Sets of tables, of course, with a vector at the
end of the table to the execution of the word, and another vector to the
next word.  A "find" word, which, when given a delimited text stream
will scan the dictionary chain until either a match is found or
dictionary exhausted.

--
James Cameron                                    (.....cameronKILLspamspam.....stl.dec.com)
Digital Equipment Corporation (Australia) Pty. Ltd. A.C.N. 000 446 800

1998\10\21@094137 by tjaart

flavicon
face
James Cameron wrote:

> Tjaart van der Walt wrote:
> > I am not looking for code, but rather strategies.
>
> Use the Forth, Luke.
>
> Dictionary of words.  Sets of tables, of course, with a vector at the
> end of the table to the execution of the word, and another vector to the
> next word.  A "find" word, which, when given a delimited text stream
> will scan the dictionary chain until either a match is found or
> dictionary exhausted.

I am looking for some or other super-strategy that would cut instruction
cycles and word size. Andy Kunz suggested a state-machine or
switch-based approach. I am indeed currently using a combination of these
two approaches, but I still feel it is bulky (I do my parsing in the ISR).

It would be impossible to scan a 'dictionary' on every character or marker
(like <CR>) because of time constraints.

I am sure there must be some look-up table/state machine combination
that would be faster than a switch/state machine approach. I've been
banging my head for a day trying to get such a solution.

--
Friendly Regards

Tjaart van der Walt
EraseMEtjaartspam_OUTspamTakeThisOuTwasp.co.za

|--------------------------------------------------|
|                WASP International                |
|R&D Engineer : GSM peripheral services development|
|--------------------------------------------------|
|SMS tjaartspamspam_OUTsms.wasp.co.za  (160 chars max)|
|     http://www.wasp.co.za/~tjaart/index.html     |
|Voice: +27-(0)11-622-8686  Fax: +27-(0)11-622-8973|
|          WGS-84 : 26¡10.52'S 28¡06.19'E          |
|--------------------------------------------------|

1998\10\21@101558 by paulb

flavicon
face
James Cameron wrote:

> Tjaart van der Walt wrote:
>> I am not looking for code, but rather strategies.

> Dictionary of words.  Sets of tables, of course, with a vector at the
> end of the table to the execution of the word, and another vector to
> the next word.  A "find" word, which, when given a delimited text
> stream will scan the dictionary chain until either a match is found or
> dictionary exhausted.

 Generally correct.  If you are talking about lots of words, then a
table match is better than sequential/ alternative hard-coded "find"
routines.  As you can't do vectoring on the PIC however, the execution
entry would have to be a GOTO instruction (not a CALL).  If you want a
CALL of course you simply CALL the routine which performs the table
execute.

 The Forth directory uses a linked STRUCT list as described (pattern,
link_next_word,execution_addr) and uses hi-bit tagged pattern words.  It
is probably easier for a PIC to use zero-terminated paterns and because
the table is a homogenous structure, you don't need the links to find
the next pattern; they are adjacent.

 Point to start of table; look for a match.  If match found, point past
the terminal zero to the GOTO and jump to that directly (same code as
table read, but CALLed from a different part of code as on return, it
must exit the routine).  If no match found, scan to the null terminator
if not there already, skip one entry (execution GOTO) and start with the
next pattern.

 Sounds easy!  I started with the thought of using separate tables for
pointers to each pattern and the execution GOSUBs, but the code is
identical to read a byte or execute, so the table might as well be
interleaved.
--
 Cheers,
       Paul B.

1998\10\21@103154 by paulb

flavicon
face
Tjaart van der Walt wrote:

> I am looking for some or other super-strategy that would cut
> instruction cycles and word size.

 Oh, I *see*!  You mean a hash table?  You scan the word and calculate
a CRC then use the CRC to index a match table which points to one
particular entry, then see if this entry actually matches.  It may be
good enough to assume if it has the same CRC and the same initial two
letters, it *is* the same, and only keep those two letters in the match
table.

 Again, some FORTH versions only matched the length and first three
letters!  The CRC is a bit more critical.  You may need to try a few
different hash seeds to make sure all words have a unique hash, and you
will want to use only enough hash bits to match the table size.  You
waste a few hash table entries, but win in execution speed.
--
 Cheers,
       Paul B.

1998\10\21@105833 by tjaart

flavicon
face
Paul B. Webster VK2BZC wrote:

> Tjaart van der Walt wrote:
>
> > I am looking for some or other super-strategy that would cut
> > instruction cycles and word size.
>
>   Oh, I *see*!  You mean a hash table?  You scan the word and calculate
> a CRC then use the CRC to index a match table which points to one
> particular entry, then see if this entry actually matches.  It may be
> good enough to assume if it has the same CRC and the same initial two
> letters, it *is* the same, and only keep those two letters in the match
> table.
>
>   Again, some FORTH versions only matched the length and first three
> letters!  The CRC is a bit more critical.  You may need to try a few
> different hash seeds to make sure all words have a unique hash, and you
> will want to use only enough hash bits to match the table size.  You
> waste a few hash table entries, but win in execution speed.

I've played with this, but I am a bit uncomfortable with it, as some of thetext may contain variable
data. (Basically you reset the CRC with every
marker (like carriage return) you get, and start anew. You only have to
check the CRC value on each non-marker character)

A few of the messages may look like this :

Door 1 is Closed<CarriageReturn>
Door 2 is Open<CarriageReturn>
Oven Temperature is 200dC<CarriageReturn>
No Operator Present<CarriageReturn>
Danger : Boiler May Explode<CarriageReturn>

If other timing critical routines are present, the program cannot spend
too much time going through strings of characters. On the other hand, a
certain degree of integrity must be maintained.

If MessageType is the return value for your main routine, the path
that the state machine follows for the first two states, might look
something like this :

   char1  char3
------+-------------> MessageType = 1
     |
     +------+------> MessageType = 2
     |      |
     |      +------> MessageType = 3
     |      |
     |      +------> MessageType = 4
     |
     +-------------> MessageType = 5
     |
     +------+------> MessageType = 6
            |
            +------> MessageType = 7

--
Friendly Regards

Tjaart van der Walt
@spam@tjaartKILLspamspamwasp.co.za

|--------------------------------------------------|
|                WASP International                |
|R&D Engineer : GSM peripheral services development|
|--------------------------------------------------|
|SMS KILLspamtjaartKILLspamspamsms.wasp.co.za  (160 chars max)|
|     http://www.wasp.co.za/~tjaart/index.html     |
|Voice: +27-(0)11-622-8686  Fax: +27-(0)11-622-8973|
|          WGS-84 : 26¡10.52'S 28¡06.19'E          |
|--------------------------------------------------|

1998\10\21@120557 by Harold Hallikainen

picon face
If other timing critical routines are present, the program cannot
>spend too much time going through strings of characters. On the other
hand,
>a certain degree of integrity must be maintained.


       Can the interrupt routine just receive stuff to a buffer and have
the "main loop" evaluate the stuff?

Harold



Harold Hallikainen
RemoveMEharoldTakeThisOuTspamhallikainen.com
Hallikainen & Friends, Inc.
See the FCC Rules at http://hallikainen.com/FccRules and comments filed
in LPFM proceeding at http://hallikainen.com/lpfm



___________________________________________________________________
You don't need to buy Internet access to use free Internet e-mail.
Get completely free e-mail from Juno at http://www.juno.com
or call Juno at (800) 654-JUNO [654-5866]

1998\10\21@120602 by Harold Hallikainen

picon face
       Thinking back on the way the old Microsoft 6800 Basic interpreter
I licensed years and years ago tokenized incoming text, which is kinda
similar, it just had a word list with zero flags at the end of each word.
It scanned the list, starting at the top, looking for a match, and
advancing a counter as each word was passed.  The resulting count when
there was a match (with the msb set) was the token.  Here, it would be
the argument into a jump table.  This linear search is not very fast...
A faster linear search would use a linked list where a link at the
beginning of each word pointed to the next.  Actually, instead of the
address of the next word, it could be the length of the current word,
which could then let you find the next word and would also eliminate the
need for a terminator for the word, since the length byte lets you find
the end.
       Another way might be to use some sort of hash function to
uniquely code each word back down to some reasonable size for a standard
jump table.
       Another way might be to only look at the number of characters
necessary to uniquely identify the word.


Harold



Harold Hallikainen
spamBeGoneharoldspamBeGonespamhallikainen.com
Hallikainen & Friends, Inc.
See the FCC Rules at http://hallikainen.com/FccRules and comments filed
in LPFM proceeding at http://hallikainen.com/lpfm


___________________________________________________________________
You don't need to buy Internet access to use free Internet e-mail.
Get completely free e-mail from Juno at http://www.juno.com
or call Juno at (800) 654-JUNO [654-5866]

1998\10\21@122302 by tjaart

flavicon
face
Harold Hallikainen wrote:

> If other timing critical routines are present, the program cannot
> >spend too much time going through strings of characters. On the other
> hand,
> >a certain degree of integrity must be maintained.
>
>         Can the interrupt routine just receive stuff to a buffer and have
> the "main loop" evaluate the stuff?

The main loop can't guarantee that the buffer is not overwritten whilst
processing it, and it could also be too slow. I never use more than one
watchdog tag in my code, so it would also make my main loop (where
I tag the dog) a bit shaky...

I am experimenting now with a suggestion from Chip Weller to have
more than one hash function. I'm not sure how I'll handle variable
data, but it it fast enough to warrant a serious look...

--
Friendly Regards

Tjaart van der Walt
TakeThisOuTtjaartEraseMEspamspam_OUTwasp.co.za

|--------------------------------------------------|
|                WASP International                |
|R&D Engineer : GSM peripheral services development|
|--------------------------------------------------|
|SMS RemoveMEtjaartspamTakeThisOuTsms.wasp.co.za  (160 chars max)|
|     http://www.wasp.co.za/~tjaart/index.html     |
|Voice: +27-(0)11-622-8686  Fax: +27-(0)11-622-8973|
|          WGS-84 : 26¡10.52'S 28¡06.19'E          |
|--------------------------------------------------|

1998\10\21@134433 by Peter L. Peres

picon face
On Wed, 21 Oct 1998, James Cameron wrote:

> Tjaart van der Walt wrote:
> > I am not looking for code, but rather strategies.
>
> Use the Forth, Luke.
>
> Dictionary of words.  Sets of tables, of course, with a vector at the
> end of the table to the execution of the word, and another vector to the
> next word.  A "find" word, which, when given a delimited text stream
> will scan the dictionary chain until either a match is found or
> dictionary exhausted.

And where do you store the tables assuming you want some program space to
be left over ? A 25 english word command set with 6 chars/command == 150
words of program space, about 1/3 of a smaller PIC. With the hash tables
you can store 25 english words in 27 program words ;) The hash code is
shorter than 10 words in any case.

Peter

1998\10\21@134439 by Peter L. Peres

picon face
On Wed, 21 Oct 1998, Tjaart van der Walt wrote:

> I've played with this, but I am a bit uncomfortable with it, as some of
> thetext may contain variable data. (Basically you reset the CRC with
> every marker (like carriage return) you get, and start anew. You only
> have to check the CRC value on each non-marker character)

I've played with this too, quite a bit ;)

The key to parsing, is called lexing ;) Never try to match a phrase, only
words. Sometimes you can cheat by matching a large piece of static text at
once, but it is not such a great idea. To peel out variables, you will
HAVE to store them somewhere. You will be able to do this, because you
will lex them as words. A number in the 1st char is usually a trigger for
a numerical lexeme. You need to convert the whole number and store it
somewhere, and read other words until a full phrase match occurs and the
parser reaches a final state. You do not try to make sense of a variable
immediately.

Using the hash table lexer I described in my other posting, each of your
phrases will match a path describing a phrase, or not. Only when you
finish a complete phrase that fits your grammar will you be able to decide
whether you got garbage or data. (Reaching an error terminal state is also
a way of exiting).

To implement this, you'd need two state machines in the character
receiver. The outer one runs through word-states to match phrases in the
grammar, the inner one runs through hash cycles on a temp buffer and
matches words from the table. It only causes the outer machine to advance
when it reaches a terminal state (word matched or failed). It can also
store or convert (a) number(s).

The outer machine will run through states described by a table that marks
valid phrases in the grammar. It sometimes needs to make decisions etc.
The inner one runs through the hash algorythm and matches a word against
the hashed word table when finished (end of word, punctuation etc).

The inner machine is a lexer and the outer machine a parser.

What you are trying to achieve is a small brother of yacc and bison
(parser generators, usually for UNIX). For this reason I'd recommend you
look at how one of these is implemented for ideas. Hint: the lexer works
almost as described in my other posting, using hashes, and the parser is
an interpreter for a simple parsing language whose program is stored in a
table. The parser language is stored as pairs of matched lexeme-next
position. The next position can be a terminal. That's when you say bingo
and exit the code with a certain (unique or not) result code.

The code that executes after this, uses the output of the parser (its exit
code) as a mark of a properly matched phrase or error. Variables already
converted by the lexer (in this case) must be picked up from temporary
registers.

A properly built machine can understand things like:

The oven is very hot.
The oven is very very hot.
The oven is very ... hot.
                ^^^ many more very's here

The time is 2030 GMT.
                ^^^ignore this whatever it may be, e.g. IST, EET etc.

and understand and convert things like:

Widget A is at 10 degrees and 8 under.

where 10 is saved to var[0] and 8 to var[1] by the lexer, etc, etc.

In despite of the apparent complexity of all this, it is VERY fast code.
The average read character causes 3 flag-branches, an XOR and shift, an
increment and a return from interrupt. The longest case is, when the
machine reaches a terminal state. At the end of words, a table lookup
occurs (to map hash->index). There are ways to speed up the lookup but not
easily on a PIC (btree etc). Just put often used words first.

Sheesh, this was long. I won't do this again.

Peter

1998\10\21@134442 by Peter L. Peres

picon face
On Wed, 21 Oct 1998, Tjaart van der Walt wrote:

> Has anyone come up with an efficient way of parsing
> incoming text-type commands on a PIC?
>
> For example a PIC connected to some sensor or PC
> that controls it via text-based commands. You have no
> control over the protocol, so you can't use numeric
> commands with a lookup table.
>
> You have to distinguish between, say, 25 text
> commands varying in length between 2 characters
> and 15 characters.
>
> I am not looking for code, but rather strategies.

;) Ouch, bin there myself.

There are two ways out:

1. Use only one-letter codes. The state machine for this is trivial. You
can decide to 'eat' whitepsace, punctuation etc. This allows users and
programs some latitude on CR/LF etc. Many terminal emulators can be
programmed to issue such codes as macros. This works very well and keeps
the code space in the PIC small (no tables). Each letter can be mapped to
one of 62 start addresses using a lookup table after bound checking.

2. Use a small buffer that stores the hash of the received word. You can
use any hash function as long as it gives a unique hash for each word in
your alphabet. On a PIC, using an 8-bit shift and XOR on a single temp
register and adding the length should yield a unique 8-bit number for
almost every word you want to use. Then you map the hash onto a table that
stores hash values. The index of the table where you find your hash is
your word's index. The table is best generated using a C program or Perl
or awk or such, outputting a table of words (numbered) and a table of
assembly (DB ..). The rest you write yourself (actually I have a shell
script that does all this from a list of words - I just wonder where that
backup is ?). You place often used words first. The hasher in the PIC is
fed by a state machine that resets to 0 at whitespace, punctuation or
whatever suits you, such as word length. It calls your routine(s) whenever
it finds a valid word, and calls an error routine whenever it finds an
unknown word. The hasher in the code generator script checks that each
word hashes to something unique, and complains if not. One has a few
simple solutions to fix that: shift the other way, increment/decrement
before shifting etc.

There is another way, that stores the length too, but the
one-byte-per-word approach is pretty good if it can sustain your
dictionary ;)

sorry for the longish post ;)

Peter

1998\10\21@135825 by Peter L. Peres

picon face
On Wed, 21 Oct 1998, Harold Hallikainen wrote:

>         Thinking back on the way the old Microsoft 6800 Basic interpreter
> I licensed years and years ago tokenized incoming text, which is kinda
> similar, it just had a word list with zero flags at the end of each word.
>  It scanned the list, starting at the top, looking for a match, and
> advancing a counter as each word was passed.  The resulting count when
> there was a match (with the msb set) was the token.  Here, it would be

Yes, I'd trust M$ to use such a horror in a commercial product. After all,
they don't pay for the EPROM to store the tables. YOU do. Or did. Maybe
they could not afford yacc (which was already more than mature by then).

>         Another way might be to use some sort of hash function to
> uniquely code each word back down to some reasonable size for a standard
> jump table.

Yes !

>         Another way might be to only look at the number of characters
> necessary to uniquely identify the word.

Yes ! And the above, combined into a single hash. That's the way to go.

Peter

1998\10\21@160637 by Dwayne Reid

flavicon
face
Peter L. Peres wrote:

>2. Use a small buffer that stores the hash of the received word. You can
>use any hash function as long as it gives a unique hash for each word in
>your alphabet. On a PIC, using an 8-bit shift and XOR on a single temp
>register and adding the length should yield a unique 8-bit number for
>almost every word you want to use.

Hmmm . . .

You have me interested!  I know NOTHING about hash functions (I'm a hardware
type from way back and have never had the opportunity to take computing
science courses).  Tell me more!  Any online FAQs or examples?  PIC examples?


>sorry for the longish post ;)

Gosh!  What is there to be sorry about?  This is good stuff!

dwayne




Dwayne Reid   <dwaynerEraseMEspam.....planet.eon.net>
Trinity Electronics Systems Ltd    Edmonton, AB, CANADA
(403) 489-3199 voice     (403) 487-6397 fax

1998\10\21@164938 by John Payson

flavicon
face
part 0 1973 bytes
>         Thinking back on the way the old Microsoft 6800 Basic interpreter
> I licensed years and years ago tokenized incoming text, which is kinda
> similar, it just had a word list with zero flags at the end of each word.
>  It scanned the list, starting at the top, looking for a match, and
> advancing a counter as each word was passed.  The resulting count when
> there was a match (with the msb set) was the token.  Here, it would be

Yes, I'd trust M$ to use such a horror in a commercial product. After all,
they don't pay for the EPROM to store the tables. YOU do. Or did. Maybe
they could not afford yacc (which was already more than mature by then).

<<

Actually, the technique Microsoft used at least on the 6502 Basic interpreters
wasn't too bad; the ROM storage requirement was not excessive--one byte per
character of a keyword; I can't see how they'd get much better than that.  On
the Commodore machines, it even had an interesting (albeit documented) quirk:
since the scanner looked for (typed_char xor table_char) to be 128 at the end
of a valid keyword, typing a shorter word but setting bit 7 of the last char-
acter would allow "abbreviations" of keywords.

If you want ugly, you should see some of the stuff I coded to format output
strings using the CCS compiler; for example, how do you like this:

int ct;

void do_out(char ch)
{
 if (!(ch & 0x20)) ct--;
 if (!ct) putch(ch);
}

void output_weekday(int day)
{
 ct = day;
 do_out("SundayMondayTuesdayWednesdayThursdayFridaySaturday");
}

void output_month(int month)
{
 ct = month;
 do_out("JanuaryFebruaryMarchAprilMayJuneJulyAugust");
 do_out("SeptemberOctoberNovemberDecember");
}

Using an array of strings would be nicer if the compiler could handle it
cleanly, but at least at the time the CCS compiler had trouble fitting
large arrays in code-space; splitting the output by contrast was a relative
piece of cake.

1998\10\21@171925 by William Chops Westfield

face picon face
Hashing is nice.  Reminds me of the basic programs where we'd only check the
first character of the user's response to the typical "yes or no" question,
so we'de acccept input like "yeah", "yup", "not on your life", and so on.

A hash is a one-way function that compresses a large amount of input data
into a small number of output values.  Checksums and CRCs are both fine hash
functions.  A "good" hash  produces outputs that are unique within the input
data space - for each possible expected input, you get a different output.
A "great" hash has (in addition) the same number of possible outputs as it
has expected inputs.

Suppose you want to recognize keywords "add", "sub", "mul", "div", and
"print" to do the simple 4-banger rpn calculator.  Let's let the hash
function be simply adding the characters of each token together, and then
we'll take as many bits as necessary to come up with unique values (assuming
that this will be possible.)  Let's see:

(gdb) print /x 'a'+'d'+'d'
$1 = 0x129
(gdb) print /x 's'+'u'+'b'
$2 = 0x14a
(gdb) print /x 'm'+'u'+'l'
$3 = 0x14e
(gdb) print /x 'd'+'i'+'v'
$4 = 0x143
(gdb) print /x 'p'+'r'+'i'+'n'+'t'
$5 = 0x22d

Terriffic - the low order 4 bits are 9, a, e, 3, and d - nice and unique
with no additional manipulation, and we can check for out 5 keywords with 5
(byte) compares or a small jump table - no slow string manipulation or bulky
keyword storage needed.  (actually, these turn out to be unique withing 3
bits, and can't be within two bits, so it worked really well!)  Of course,
this algorithm means that "dda", "dad", and "c" will all be synonyms for
"add" as well - that's what you put up with in a hash algorithm.  (more
careful algorithms will use the hash for speed, and compare against the
actual string for accuracy - but that doesn't solve the storage issues with
strings.)  Besides, think how much fun your customers will have learning how
to talk to your device in complete gibberish!

BillW

1998\10\22@014310 by tjaart

flavicon
face
Dwayne Reid wrote:

> Peter L. Peres wrote:
>
> >2. Use a small buffer that stores the hash of the received word. You can
> >use any hash function as long as it gives a unique hash for each word in
> >your alphabet. On a PIC, using an 8-bit shift and XOR on a single temp
> >register and adding the length should yield a unique 8-bit number for
> >almost every word you want to use.
>
> Hmmm . . .
>
> You have me interested!  I know NOTHING about hash functions (I'm a hardware
> type from way back and have never had the opportunity to take computing
> science courses).  Tell me more!  Any online FAQs or examples?  PIC examples?
>
> >sorry for the longish post ;)
>
> Gosh!  What is there to be sorry about?  This is good stuff!

Right you are!
The input on this topic has been really good. The approach I am
experimenting with, is
1) Four hash functions are implemented.
4) I keep the last three characters received, so I can copy these when I
  have a hit, or constantly check for a marker (<CR><LF> in my case).
2) All hash functions are cleared on a marker (<CR><LF>)
3) I only compare one hash function on every character received. If it
   matches, I also check for the other three.

Here are my functions :

  RRF(RX_Hash_3);
  RRF(RX_Hash_4);
  RX_Hash_1 ^= RX_Character;
  RX_Hash_2 += RX_Character;
  RX_Hash_3 ^= RX_Character;
  RX_Hash_4 += RX_Character;

Here is my current strategy :

On character received :
------------------------
1) Calculate Hash Functions
2) Check for Marker :
           If marker present, clear hash functions
3) Compare Hash Function 1 against List :
           If hit, compare functions 2,3,4 :
                   If all three compare, Bob's my uncle.


So far, so good.

--
Friendly Regards

Tjaart van der Walt
EraseMEtjaartspamwasp.co.za

|--------------------------------------------------|
|                WASP International                |
|R&D Engineer : GSM peripheral services development|
|--------------------------------------------------|
|SMS RemoveMEtjaartEraseMEspamEraseMEsms.wasp.co.za  (160 chars max)|
|     http://www.wasp.co.za/~tjaart/index.html     |
|Voice: +27-(0)11-622-8686  Fax: +27-(0)11-622-8973|
|          WGS-84 : 26¡10.52'S 28¡06.19'E          |
|--------------------------------------------------|

1998\10\22@022222 by Dr. Imre Bartfai

flavicon
face
Hi,

if the commands itself are fixed, you can generate a hash value from the
indivual values (e. g. checksum or CRC). This values can be placed in a
lookdown table (PicBasic terminology) and the commands are to be hashed
also before processing. However, it is not true parsing... But it can help
if the commands does not vary (or if the fixed portion has some
separator).

Imre

1998\10\22@131144 by Peter L. Peres

picon face
On Thu, 22 Oct 1998, Tjaart van der Walt wrote:

> 1) Four hash functions are implemented.
> 4) I keep the last three characters received, so I can copy these when I
>    have a hit, or constantly check for a marker (<CR><LF> in my case).
> 2) All hash functions are cleared on a marker (<CR><LF>)
> 3) I only compare one hash function on every character received. If it
>     matches, I also check for the other three.
>
> Here are my functions :
>
>    RRF(RX_Hash_3);
>    RRF(RX_Hash_4);
>    RX_Hash_1 ^= RX_Character;
>    RX_Hash_2 += RX_Character;
>    RX_Hash_3 ^= RX_Character;
>    RX_Hash_4 += RX_Character;

imho you are over-doing it a little bit. Concentrate on one hash function
and make it better. Since you have a 4 bytes/entry table, use a 31 bit
hash number. If you throw dictionary English words at that you'll spend a
few centureis typing until you hit something duplicate. I've used 13 bit
wide tables for 100-word dictionaries using only xor and shift with no
collisions. The (worst)  probability for a double hit is 100/(2^13-99) =
100/8193 ~= 1.3%. Note that hash-matching without comparing affects the
S/N ratio of the input (or the noise immunity and robustness of your
system).

Peter

1998\10\22@131148 by Peter L. Peres

picon face
On Wed, 21 Oct 1998, John Payson wrote:

> Actually, the technique Microsoft used at least on the 6502 Basic
> interpreters wasn't too bad; the ROM storage requirement was not
> excessive--one byte per character of a keyword; I can't see how they'd

Ah, it's called tokenized Basic. It was pioneered by Dartmouth (?) Tiny
Basic. Not Microsoft. Every small Basic machine seems to have used that at
one time or another, including Sinclair ZX80, 81, Spectrum, C64 etc. BTW
tokenized Basic is almost like Forth, once at the P-language level, except
the dictionary builds/does commands were clipped out early ;)

Peter

> get much better than that.  On the Commodore machines, it even had an
> interesting (albeit documented) quirk:  since the scanner looked for
> (typed_char xor table_char) to be 128 at the end of a valid keyword,
> typing a shorter word but setting bit 7 of the last char- acter would
> allow "abbreviations" of keywords.

Yes. Sinclair machines never let you type the command out, it was coded
into the keypress ;) Takes some getting used to.

Peter

1998\10\22@131151 by Peter L. Peres

picon face
On Wed, 21 Oct 1998, Dwayne Reid wrote:

> Hmmm . . .
>
> You have me interested!  I know NOTHING about hash functions (I'm a hardware
> type from way back and have never had the opportunity to take computing
> science courses).  Tell me more!  Any online FAQs or examples?  PIC examples?

I am not a software guy either, but I did do some reading up ;) Hash
functions are a very VERY important part of modern computing. They are
used in cryptography, security systems, parsing, databases and many other
things.  Hashing is and has been studied very seriously.

Any hash function takes an input of arbitrary length and size and computes
a number of a relatively small size from it. You can think of a hash
function as an inverse mapper for a quantity of white noise into a
pseudo-random generator state, where the prng is not perfect. This implies
that there is more than one input string (or word or image) that maps to
the same hash number. The trick is, to choose a hash function that maps
all the words of interest in your dictionary into unique hash numbers.
When this succeeds then the Perfect Hash of that particular dictionary,
using a certain function, is found. There is a Unix program called perf
(or GNU gperf)  that does just this from a list of words with output in C,
C++, Perl etc.  This is pretty useless in a PIC (the perf tables are
large).

I am not Dr. Math. but just about every kid knows that performing modulo
operations on a field of prime size(s) yields rather unpredictable
results, i.e. noise. Academics have studied the matter profusely, and
very similar operations form the basis of many mainstream cryptographical
systems (such as PGP and DES).

One of the simplest kinds of functions based on these principles are the
rotate/xor functions. They are weak vs. more elaborate schemes but they
are very efficiently implemented in small micros and in hardware. I'll
give an example (of a miserly hash function):

7 is a prime number, and it is slightly smaller than 8, the number of bits
in a byte, such as a PIC file register. Now, take an XOR operator to use
it to apply input data to the register. With each input data byte:

       ;; UNTESTED
       ; merge new (7bit) data into hash
       movf    DATA_SOURCE, W ; read data from somewhere, presumably @FSR
       xorwf   HASH, F ; xor into temp, also PSW.CY=0 (on a pic ?)

       ; rotate HASH left over 7 bits
       rlf     HASH, F ; 8th bit in carry, HASH.0=0 <- PSW.CY=0
       rlf     HASH, W ; 7th bit in PSW.CY but leave HASH alone
       btfsc   PSW,CY  ; if not set, skip
       incf    HASH, F ; carry: add 1 to HASH

Although this is a miserly hash function it beats any checksum of the same
length (but not a CRC polynomial probably). Other often used hash register
lengths are: 13 bits (16-3 = 2 registers), 17 bits (2 registers + carry
bit which must be saved), 23 (3 registers minus one bit) and 31 (four
registers minus one bit). The code above can be optimized in many ways.
One way is, to notice that xors repeat at the same position after a
certain number of shifts (7 in this case), and thus first xor every 7 th
byte in a buffer and then shift and xor the 7 intermediary results
together. This is not the way to go on a PIC imho.

When all the chars in a string are processed, HASH contains a number. The
length of the string is added to it sometimes. This is the hash of the
string. If you store the hash and then later have a string to be
recognized, you hash it in the same way, and then compare the numbers. If
the match, it might be the same string. The larger the hash number size
vs. the size of the string, the lower the chances of a false string being
matched. The maths behind this is very complex, so get a book with tables
or get an advanced course in integer corpus maths imho, to obtain workable
data for a project. Even so, getting the hashing function right implies a
trial-and-error process in which the hash function is tweaked until it
yields a perfect hash for a given set of strings. A simple way to tweak,
is to rotate right instead of left in the example above, for example.

An example of a good hash function is, md5.

Reading material on hashing can be found in most CS books, including
algorythm books and in C language tutorial books (ANSI C, K&R etc).

hope this helps,

Peter

1998\10\22@170252 by John Payson

flavicon
face
>>
> Actually, the technique Microsoft used at least on the 6502 Basic
> interpreters wasn't too bad; the ROM storage requirement was not
> excessive--one byte per character of a keyword; I can't see how they'd

Ah, it's called tokenized Basic. It was pioneered by Dartmouth (?) Tiny
Basic. Not Microsoft. Every small Basic machine seems to have used that at
one time or another, including Sinclair ZX80, 81, Spectrum, C64 etc. BTW
tokenized Basic is almost like Forth, once at the P-language level, except
the dictionary builds/does commands were clipped out early ;)
<<

I was referring in more detail to the storage mechanism used for the keywords,
rather than the general approach of tokenizing.  The speed of their algorithms
for compressing/decompressing were not nearly as fast as they could have been,
but were fast enough that it really didn't matter (when editing near the start
of a long program, the computer spent a lot more time doing the memory block
move than it did tokenizing.  And when listing a program, de-tokenizing was far
less of a limitting factor than the human's reading speed.

As for similarity to Forth, I certainly wouldn't go that far on Microsoft's
6502 BASICs; variables are stored by name necessitating a lookup, and constants
are stored in ASCII form (evaluated at runtime).  Oddly, the constant zero would
be evaluated faster if it was written as "." than if it was written as "0".

Atari's BASIC was much nicer in that particular regard--variables were stored as
a "variable" token followed by an index number, and constants were stored as a
"constant" token followed by the value.  Atari's one big failing, though, was th
at
the only numeric format they supported was 6-byte floating-point BCD!  Thus, if
a program contained a "goto 100" statement, the poor CPU would have to convert t
he
number to binary and then do the line number lookup.  Too bad Atari ruined what
would otherwise have been a very nice BASIC.

>>
> get much better than that.  On the Commodore machines, it even had an
> interesting (albeit documented) quirk:  since the scanner looked for
> (typed_char xor table_char) to be 128 at the end of a valid keyword,
> typing a shorter word but setting bit 7 of the last char- acter would
> allow "abbreviations" of keywords.

Yes. Sinclair machines never let you type the command out, it was coded
into the keypress ;) Takes some getting used to.
<<
I'm a little curious, though... since the unexpanded Sinclair had 1K RAM and a
screen of 32x24 characters, how was there room for ANYTHING?  Did you need extra
RAM to get the full screen size, or what?

Also, since the CPU was directly responsible for video generation (if I remember
right) did any games or anything do cute stuff with special display routines?


Attachment converted: wonderland:WINMAIL.DAT (????/----) (0001BA70)

1998\10\23@013555 by tjaart

flavicon
face
Peter L. Peres wrote:

{Quote hidden}

Is there a non-brute-force method of estimating the chances of a double hit? I wouldn't like to
spend too much time in 'hashing' or comparing. Currently
I only have to compare RX_Hash_1 (which is quite fast). I only test the other
functions after a hit on RX_Hash_1.

>  Note that hash-matching without comparing affects the
> S/N ratio of the input (or the noise immunity and robustness of your
> system).

What do you mean by this?
--
Friendly Regards

Tjaart van der Walt
RemoveMEtjaartspam_OUTspamKILLspamwasp.co.za

|--------------------------------------------------|
|                WASP International                |
|R&D Engineer : GSM peripheral services development|
|--------------------------------------------------|
|SMS RemoveMEtjaartTakeThisOuTspamspamsms.wasp.co.za  (160 chars max)|
|     http://www.wasp.co.za/~tjaart/index.html     |
|Voice: +27-(0)11-622-8686  Fax: +27-(0)11-622-8973|
|          WGS-84 : 26¡10.52'S 28¡06.19'E          |
|--------------------------------------------------|

1998\10\23@084851 by Peter L. Peres

picon face
On Fri, 23 Oct 1998, Tjaart van der Walt wrote:

> Peter L. Peres wrote:
> > imho you are over-doing it a little bit. Concentrate on one hash function
> > and make it better. Since you have a 4 bytes/entry table, use a 31 bit
> > hash number. If you throw dictionary English words at that you'll spend a
> > few centureis typing until you hit something duplicate. I've used 13 bit
> > wide tables for 100-word dictionaries using only xor and shift with no
> > collisions. The (worst)  probability for a double hit is 100/(2^13-99) =
> > 100/8193 ~= 1.3%.
   ^^^^^^^^ Oops! that's 100/8093, the final figure should be correct.

> Is there a non-brute-force method of estimating the chances of a double
> hit? I wouldn't like to spend too much time in 'hashing' or comparing.
> Currently I only have to compare RX_Hash_1 (which is quite fast). I only
> test the other functions after a hit on RX_Hash_1.

Your method will work, but a 31 bit hash will occupy the same space and be
less error prone. I can't prove this as I'm not up to date with the math,
but I feel it in my guts ;)

The non-brute force method is what I've used above: You assume that the
hash is perfect. The hash is encoded on N bits. You have a dictionary of K
words to hash. You assume that K-1 words were hashed without collision,
and use K-1 slots out of the 2^N available. Then you have 2^N-K+1 slots
left to put the Kth word into, which, still assuming the hash is perfect,
results in a probability of a hit of: 1/(2^N-K+1). I don't have the
figures for the project here but this souds right.

> >  Note that hash-matching without comparing affects the
> > S/N ratio of the input (or the noise immunity and robustness of your
> > system).
>
> What do you mean by this?

Well, you have the usual problem of a noise affected signal source and a
filter of a certain bandwidth. With a given S/N ratio (f.ex. operator
typing errors or radio channel random errors) a perfect filter will
recognize all the errors that do not map as another word in the
dictionary. A hash parser that is not backed by a comparison may let other
words through that hash to the same symbol. The bandwidth of the 'filter'
is widened. You can express this as a worsening of the S/N at the input,
and calculate all the input words of a given length that map to all
certain hash codes (the pass codes) (this is trivial). Then see how many
input codes can really pass the filter and report this vs how many pass
with a direct perfect match (no hash). The result is the amount of
worsening of the S/N at your input.

Peter

1998\10\23@085057 by Peter L. Peres

picon face
On Thu, 22 Oct 1998, John Payson wrote:

> Yes. Sinclair machines never let you type the command out, it was coded
> into the keypress ;) Takes some getting used to.
> <<
>
> I'm a little curious, though... since the unexpanded Sinclair had 1K RAM
> and a screen of 32x24 characters, how was there room for ANYTHING?  Did
> you need extra RAM to get the full screen size, or what?

;) The Sinclair engineers who cooked that machine up were pure geniuses. I
do not dare guess what else they did. It's like this:

The ZX80 had a Z80A at 3.5 MHz, 4k or ROM, and 11 of RAM on board. Also 6
(six!) more bog standard LS TTL chips. Out of this, it made video (the Z80
was fetching nops in a certain area while it generated addresses to read
the ram into a LS165 (?) shift register), black and white and not very
standard, but ok for a TV, read a membrane keyboard and had a tape
cassette data storage interface. imho it is the most cunning
minimalistical design to ever hit the consumer market in the entry-level
home computer domain, and it's darn hard to beat even today (I did do an
evil prototype using a NEC6145 OSD, a PIC, a seeprom, and a membrane
keyboard, and that's IT ;), but nothing came of it eventually ).

The ROM contained a tiny basic version that used tokenized statements. You
did not type in statements, instead, each key had a few statements
assigned to it (printed on it). Each statement was stored as a token code
in ram (1 byte), with some bits left over used as flags ;) Strings and
arrays had 1 character names. Variables had up to 5 character names if I
remember well. Math was 16 bit signed integer only.

A fiendishly cunning software mechanism allowed the program and the video
memory to share the same tiny space with the stack squeezed somewhere
between them.  When you had a longer program the screen would grow smaller
(less lines displayed) ;). There were 2 ways to run a program: 'fast' and
'slow'. In 'fast' mode the processor gave up screen refresh and it
blurred. In 'slow' mode there was screen output and the processor ran the
program during the blanking periods and in the empty lines on the screen
;)

In theory the full screen took 768 bytes, leaving 256 bytes for stack and
BASIC program, but with the fiendish mechanism described, you could easily
end up with 768 bytes program and 256 screen instead !

I have great respect for whoever made that system, hard and soft ;) I also
think that this or another similar design should be revived as a kit.

The ZX81 kept the same principles but added a better BASIC in 8k ROM, with
floating point. It still had 2 x 2114 RAMs (1k) but most serious people
bought expanders, up to 64k (paged). It had an ASIC that allowed the CPU
to spend more time in the user program, but still had fast and slow modes.
It also had some graphics.

The Spectrum also kept the same principles, but switched to a full graphic
screen, with soft character generator, an ASIC, no slow mode, 16k ROM and
48k of RAM (expandable by paging). It had 256*192*16+8+blink color memory
mapped graphics and a separate border color control byte.

> Also, since the CPU was directly responsible for video generation (if I
> remember right) did any games or anything do cute stuff with special
> display routines?

Certainly ;) Inventing new bit-blitting modes for games, using (then)
un-documented Z80 opcodes was a favorite passtime for many. There was
also C Pascal, Lisp, assembler, disassembler and Prolog for the Spectrum,
all on cassette tapes. The Spectrum ROM source code was circulated as
reference material. It had been published as a book, in whole, with
comments ;)

John, would you mind stopping this, it seems to come with each mail of
yours:
> begin 600 WINMAIL.DAT
> M>)\^(A(5`0:0" `$```````!``$``0>0!@`(````Y 0```````#H``$(@ <`
> M& ```$E032Y-:6-R;W-O9G0@36%I;"Y.;W1E`#$(`0F `0`A````.#DP,T)!
- much more of the same snipped -

Peter

1998\10\23@095820 by tjaart

flavicon
face
Peter L. Peres wrote:

> > >  Note that hash-matching without comparing affects the
> > > S/N ratio of the input (or the noise immunity and robustness of your
> > > system).
> >
> > What do you mean by this?
>
> Well, you have the usual problem of a noise affected signal source and a
> filter of a certain bandwidth. With a given S/N ratio (f.ex. operator
> typing errors or radio channel random errors) a perfect filter will
> recognize all the errors that do not map as another word in the
> dictionary. A hash parser that is not backed by a comparison may let other
> words through that hash to the same symbol. The bandwidth of the 'filter'
> is widened. You can express this as a worsening of the S/N at the input,
> and calculate all the input words of a given length that map to all
> certain hash codes (the pass codes) (this is trivial). Then see how many
> input codes can really pass the filter and report this vs how many pass
> with a direct perfect match (no hash). The result is the amount of
> worsening of the S/N at your input.

...so it is back to the brute-force method of checking. I'll write a programto do it over a week-end
or something...

Thanks.

--
Friendly Regards

Tjaart van der Walt
EraseMEtjaartspamspamspamBeGonewasp.co.za

|--------------------------------------------------|
|                WASP International                |
|R&D Engineer : GSM peripheral services development|
|--------------------------------------------------|
|SMS RemoveMEtjaartKILLspamspamsms.wasp.co.za  (160 chars max)|
|     http://www.wasp.co.za/~tjaart/index.html     |
|Voice: +27-(0)11-622-8686  Fax: +27-(0)11-622-8973|
|          WGS-84 : 26¡10.52'S 28¡06.19'E          |
|--------------------------------------------------|

1998\10\23@102516 by Peter L. Peres

picon face
On Fri, 23 Oct 1998, Tjaart van der Walt wrote:

> > > >  Note that hash-matching without comparing affects the
> > > > S/N ratio of the input (or the noise immunity and robustness of your
> > > > system).
-snipped-

> ...so it is back to the brute-force method of checking. I'll write a
> programto do it over a week-end or something...

Tjaart, if you have a probability of zero to slip an error string through,
or about 1/2^31, with less computing cost, which one would you choose,
assuming that this is a non-military non-life supporting application (for
which PICs are taboo anyway) ?

Peter

1998\10\23@143023 by Thomas McGahee

flavicon
face
Tjaart,
In one of your messages you said:

> A few of the messages may look like this :
>
> Door 1 is Closed<CarriageReturn>
> Door 2 is Open<CarriageReturn>
> Oven Temperature is 200dC<CarriageReturn>
> No Operator Present<CarriageReturn>
> Danger : Boiler May Explode<CarriageReturn>

The way I would go about handling these sort of
text strings is to reduce each to its simplest
form that still contains all the relevant information.

For the examples given you could reduce the info
down to a single tokenized value and a numeric
section. The process would look something like this:

Reset the numerical register(s) (set to 00 or 0xff).
Reset the numeric pointer.
Then enter loop that does checks:
If Character is space, ignore (or use to create Hash Value).
If Character is Numeric, store (and update numeric pointer).
If Character is Alphabetic, use to create Hash Value.
If Character is <CarriageReturn> then PROCESS
Loop until done.

PROCESS
Hash Value is then used to vector to appropriate
routine. Routine can then extract any necessary
numerical data from the numerical register(s)
RETURN (or JUMP to main)

Note that the Hashing routine can be somewhat
simplified if only the first character (or first n
characters) of a word are included in the Hash.
You should always ensure that the Hash algorithm
results in a one-to-one correspondence.

Hope this helps.
Fr. Tom McGahee

More... (looser matching)
- Last day of these posts
- In 1998 , 1999 only
- Today
- New search...