Searching \ for '[PIC]:How to make a dictionary/notepad using PIC' in subject line. ()
Make payments with PayPal - it's fast, free and secure! Help us get a faster server
FAQ page: www.piclist.com/techref/microchip/devices.htm?key=pic
Search entire site for: 'How to make a dictionary/notepad using PIC'.

Exact match. Not showing close matches.
PICList Thread
'[PIC]:How to make a dictionary/notepad using PIC'
2001\04\09@235429 by linyichao

flavicon
face

Hello,
       I am going to make a dictionary machine applying PIC micro. I plan to use Micro-p + LCD + EEPROM. But I don't know  how to arrange effectively storage structure of data of word and its explanation in EEPROM. Anyone could shed light on it?

Thanks in advance,
Regards,
Martin Lin

2001\04\10@011601 by Jinx

face picon face
> Hello,
> I am going to make a dictionary machine applying PIC micro. I plan
> to use Micro-p + LCD + EEPROM. But I don't know  how to arrange
> effectively storage structure of data of word and its explanation in
> EEPROM. Anyone could shed light on it?

About a year ago there was a discussion about using a PIC to
compress text based on the frequency rankings of letters. This
will help you save EEPROM space and make searches faster

Can't remember what the thread was called and a quick look in the
archives didn't find it, sorry. It's there somewhere, perhaps one of
the participants is still around

--
http://www.piclist.com#nomail Going offline? Don't AutoReply us!
email spam_OUTlistservTakeThisOuTspammitvma.mit.edu with SET PICList DIGEST in the body


2001\04\10@030710 by James Newton

face picon face
Which archives are you searching? <GRIN>

I entered "text compression" at
http://www.piclist.com/techref/postbot.asp
and in two clicks I was at:
www.piclist.com/techref/method/compress/embedded.htm
which is linked directly from
http://www.piclist.com/techref/microchip/memory.htm

James Newton, PICList Admin #3
.....jamesnewtonKILLspamspam@spam@piclist.com
1-619-652-0593 phone
http://www.piclist.com

{Original Message removed}

2001\04\10@035847 by Roman Black

flavicon
face
> > I am going to make a dictionary machine applying PIC micro. I plan
> > to use Micro-p + LCD + EEPROM. But I don't know  how to arrange
> > effectively storage structure of data of word and its explanation in
> > EEPROM. Anyone could shed light on it?


It really depends whether you need to compress new
data with the PIC. This will be difficult and slow
using eeprom due to the time accessing it and the
recursive nature of compression techniques. And
the limited multiplication/division ability if PICs.

It will be MUCH faster if you can just store the
already-compressed data in the eeprom, then use a
lookup table of the most common strings.

Since you only need about 26+26+10+15 =77
actual characters for text, each byte can have
have another 179 symbols.

Assuming you compress the text on your PC, you
scan through all the text and find the most
common 179 strings. like _and is a four letter
string and _the is a four letter string that
also fits on the front of many words. Not just
compile the 179 most common strings, but you need to
choose them by weighting, ie length x freq

ie:
_common (7 chars, occurs 30 times)
so 7x30 goes to 30 symbols = 180 saved

_and (4 chars, occurs 50 times)
so 4x50 goes to 50 symbols = 150 saved

So even though _and occurs more often, the
_common string would save more total bytes and
needs to be weighted higher in your final
selection of the 179 symbols you will use.
Solway covers this quite well in his "Bigtext"
software. This byte-symbol system is very fast
to decompress (perfect for PIC) and can offer
excellent compression.
-Roman

--
http://www.piclist.com#nomail Going offline? Don't AutoReply us!
email listservspamKILLspammitvma.mit.edu with SET PICList DIGEST in the body


2001\04\11@154506 by David Cary

flavicon
face
Roman Black <.....fastvidKILLspamspam.....EZY.NET.AU> on 2001-04-10 02:56:18 AM
had some really good points and presents some good ideas.

Most data compression and decompression algorithms don't use any multiplies or
divides, so it's no big deal that PICs have slow multiplies.

The *fastest* data compression routine I've seen is RLE (which is useless on
English text).

The next fastest data compression routine I've seen is:
 LZRW1A
 http://www.ross.net/compression/lzrw1a.html
(compression source code; also decompression source code is here).
but this needs a block of "history" RAM when decompressing. The bigger this
block, the better the compression. Does the PIC have enough RAM for "adequate"
compression ?

Some other ``small'' data compression ideas were discussed on Deja (now Google
Groups)
under subject headings ``compression on a PDA'' and ``Compressing the Bible for
a PDA''.

http://groups.google.com/groups?q=%22Compressing+the+Bible+for+a+PDA%22&hl=en&lr=&safe=off&btnG=Google+Search&meta=site%3Dgroups%26group%3Dcomp.compression.*
http://groups.google.com/groups?q=%22compression+on+a+PDA%22&hl=en&lr=&safe=off&btnG=Google+Search&meta=site%3Dgroups%26group%3Dcomp.compression.*

(be sure to view the whole thread).

I hear that ``spell checkers'' compress their lists of spelling words by storing
a few bits indicating how many letters this word had in common with the start of
the previous word (zero, for the first word (starting with A), the first word
starting with B, etc), a few bits indicating how much to increment the next
letter (unless the previous word didn't have a next letter), followed by the
rest of the word.
For example,
 ex
 fear
 feared
 fearing
 fed
 feed
(26 bytes, not counting end-of-word indicator)
would be encoded
 <0> e x
 <0> <1> e a r
 <4> e d
 <4> <4> ng
 <2> <3>
 <2> <1> d
(If you just put the letters into bytes, and the numbers into nybbles, rather
than something more efficient, this gives 15 bytes, not counting end-of-word
indicator).

I was playing with a very simple compression algorithm based on the letter
frequencies in
 www.piclist.com/techref/method/compress/embedded.htm
and
 www.piclist.com/techref/method/compress/etxtfreq.htm
(which really needs a "up" link to
 www.piclist.com/techref/method/compress.htm
)
that decoded like this:

 100: space
 101: e
 110: t
 111: n
 0100: r
 0101: o
 0110: a
 0111: i
 00xxxx_xxxx: all other (8 bit) letters.

I think I could decode this with a pretty compact program on a PIC.
I think I could even *encode* this with a small subroutine on the PIC.
It has the feature that I can encode *any* sequence of raw bytes, so I can send
funky control codes.
If I didn't need all possible 8 bit values
I could combine this with Roman Black's idea
and further decode some of those ``unused'' 10 bit symbols
(obviously I don't need the ones for ``_etnroai'')
into letter pairs, triplets, or entire words.

True Morse code compresses almost as well as the above scheme -- if I use "1"
for dot, "01" for dash, "00" for end-of-letter, I get
00 : space
100 : e
0100 : t
01100 : n
101100 : r
01010100 : o
10100 : a
1100 : i
11100 : s
...
for the most common letters.

--
David Cary

--
http://www.piclist.com hint: To leave the PICList
EraseMEpiclist-unsubscribe-requestspam_OUTspamTakeThisOuTmitvma.mit.edu


2001\04\12@092002 by Roman Black

flavicon
face
David Cary wrote:

{Quote hidden}

This is a clever system and would give easy encoding
and decoding, but will it really compress that much??
Consider if the 8 most common characters you picked
account for 50% of the text, the other 50% needs 10bits,
so 50x3.5 + 50x10 = 675bits==100 chars. So it is
6.75 bits/char, not really that good a compression??
Although it IS excellent for using a very small lookup
table, as most other decompressors require a large
table.

But how would you go for capitalisation, etc...
:o)
-Roman

--
http://www.piclist.com hint: PICList Posts must start with ONE topic:
[PIC]:,[SX]:,[AVR]: ->uP ONLY! [EE]:,[OT]: ->Other [BUY]:,[AD]: ->Ads


2001\04\16@185455 by David Cary

flavicon
face
Dear Roman Black,

Roman Black <fastvidspamspam_OUTEZY.NET.AU> on 2001-04-12 08:17:41 AM replied,

...
>>   100: space
>>   101: e
>>   110: t
>>   111: n
>>   0100: r
>>   0101: o
>>   0110: a
>>   0111: i
>>   00_xxxx_xxxx: all other (8 bit) letters.
...
{Quote hidden}

Say I restrict myself to 7 bit characters (changing the last line to
 00xxx_xxxx
).

If the text I'm compressing is mostly-uppercase, then those most-frequent 8
letters will I'm compressing will be uppercase. Capital A will be represented by
the code
 B'01_10' ; A
and then lowercase a will be represented by
 B'00_110_0001' ; a
.

If the text I'm compressing is mostly-lowercase, then
 B'00_100_0001' ; A
 B'01_10' ; a

The letter frequencies at
 www.piclist.com/techref/method/compress/embedded.htm
and
 www.piclist.com/techref/method/compress/etxtfreq.htm
seem to imply that the top 4 characters (including space) account for about 1/2
of typical English text.

Then 100 characters (on average) of text compress to
 3*1/2 + 9*1/2 bits/char = 150 + 450 bits/100 chars = 6.00 bits/char.

Um... not really that good is it. Ah well -- the best possible
one-letter-at-a-time compression ( Huffman compression ) can't get any better
than about 4 bits/char on English text.

Other popular one-letter-at-a-time compression schemes are ``base 40'' and ``
Zork Standard Code for Information Interchange (ZSCII)'' (see
 rdrop.com/~cary/html/data_compression.html#short
for details). They pack 3 letters plus a flag bit into 2 bytes (5 bits/byte,
plus overhead to handle capital/lowercase shifts).

The really good text compressors (1.5 bits/char or less) have codes that
decompress into entire words or substrings.

Have you looked at LZRW1a ?
 LZRW1A
 http://www.ross.net/compression/lzrw1a.html
The program itself fits into a PIC easily. The standard size output buffer (16
KB) is a bit large for the PIC. I wonder how well it would compress if we
reduced its output buffer to, say, 256 bytes (using 64 bytes from each of the 4
banks of RAM in the PIC16F877) or even a mere 95 bytes (so the output buffer
fits into Bank 3 RAM). The *compression* code (before unrolling) takes less than
100 instructions in 68000 assembler (maybe twice that in PIC code ?). The
decompression code (before unrolling) takes less than 40 instructions in 68000
assembler (maybe twice that in PIC code ?).

--
David Cary

--
http://www.piclist.com hint: To leave the PICList
@spam@piclist-unsubscribe-requestKILLspamspammitvma.mit.edu


2001\04\16@190642 by Bill Westfield

face picon face
   >>   100: space
   >>   101: e
   >>   110: t
   >>   111: n
   >>   0100: r
   >>   0101: o
   >>   0110: a
   >>   0111: i
   >>   00_xxxx_xxxx: all other (8 bit) letters.
   ...
   >This is a clever system and would give easy encoding
   >and decoding, but will it really compress that much??

It might not do too badly.  One of the early compression programs (for
CP/M?)  did something like this, only they build the table (which is nice
tree that had easy rules for creating) dynamically based on the content of
the actual file being compressed (and saved the table along with the
compressed data.)  (All the more recent algorithms, starting with LZ/LZW, do
much better, since they examine multi-byte sequences as well.)

BillW

--
http://www.piclist.com hint: To leave the PICList
KILLspampiclist-unsubscribe-requestKILLspamspammitvma.mit.edu


2001\04\16@221503 by rich+piclist

flavicon
face
FWIW, you can compress sorted word lists by encoding the number of common
prefix characters from the previous word. That is, you can compress the
list:

apple
applicant
application
applied
apply

to

apple
\4icant
\7tion
\5ed
\4y

which looks like almost 50% right there.

There's even a gnu program to do this the name of which unfortunatly my
memory is too small to contain.

--
http://www.piclist.com hint: To leave the PICList
RemoveMEpiclist-unsubscribe-requestTakeThisOuTspammitvma.mit.edu


2001\04\17@052915 by Roman Black

flavicon
face
David Cary wrote:

{Quote hidden}

I agree totally, good text compression needs to compress
strings. But since many words are common and repeated you
can get excellent compression using common text strings that
occur in most English language texts.

According to Solway you can get compression to 33% using
standard text strings, and better than 25% from building a
custom string table. From memory, he allowed one byte for
every symbol, which uses about 80 for all upper/lower case,
numerals, puctuation, then about 170 spare symbols for
the most "compressive" strings.

Being byte based it decompresses very quick, basically
a lookup table entry for each character/string. Perfect for
decompression using a PIC, but I would want to compress it
on something faster...

I remember something about early testing of compression
systems for the Bible, since there are only about 20,000
different words in the book it gave great compression just
to use a 2-byte var for every word. I think they ended up
using 128 byte values for the 128 most common words, and
32,000 2-byte values for all the other words and phrases.
That's compression to about 15%, and extremely simple
and quick for decompression and searching.
-Roman

--
http://www.piclist.com hint: PICList Posts must start with ONE topic:
[PIC]:,[SX]:,[AVR]: ->uP ONLY! [EE]:,[OT]: ->Other [BUY]:,[AD]: ->Ads


More... (looser matching)
- Last day of these posts
- In 2001 , 2002 only
- Today
- New search...