There are a series of problems when trying for text compression on an embedded controller:

- Can't have big tables. I'm trying for 64 or fewer entries. I can easily combine the upper and lower case letters with a shift code and I'm using some huristics to guess when shifting should happen with out needing to send the code: After a period, ! or ?, expect the next letter to be shifted; one code means shift just the next letter but; if you have to shift two letters in a row, stay shifted untill you see the shift code again. I havent really settled on a way to handle symbols but I'm thinking about haveing more than one table so that I can take advantage of the fact that numbers are more likely to be A) grouped, B) follow a $ or a % or #. Since I have to use 8 bits in the main table, and need only 6, I'll use the upper 2 to shift to a new table when we hit certain symbols. The symbole "table" will actually be the end of the full table (change the lookup offset).
- Can't have big code. I've come up with a nasty hack on Huffman encoding that (I think) will simplify the length encoding of the frequency translated data.

I wrote some QBASIC routines to test the proof of concept:

This one produces a variable length string of "1"s and "0"s that effeciently
encodes numbers from 0 to 31 with smaller numbers haveing shorter lengths.
The end of each string is marked with a "00x" or a "11x" and no other part
of one value will have this pattern. Basically the 00 or 11 encodes the second
bit of the value and the x is the LSb. The higher order bits are encoded
by a series of "01"s or "10"s that are equal in number to the bit values.
I know it sounds like this could be done more effeciently, but I could not
find a way that did not involve A) a very big chunk of code or table in the
PIC or B) a less than ideal distribution of lengths at the lower values.

FUNCTION LenEnc$ (in) REM PRINT #2, RIGHT$(" " + STR$(in), 5); " "; out$ = "" oldbit = INT((in + 2) / 4) MOD 2 out$ = RIGHT$(STR$(oldbit), 1) WHILE in > 3 IF oldbit = 0 THEN out$ = out$ + "1" oldbit = 1 ELSE out$ = out$ + "0" oldbit = 0 END IF in = in - 4 WEND IF oldbit = 0 THEN out$ = out$ + "0" ELSE out$ = out$ + "1" END IF out$ = out$ + RIGHT$(STR$(in MOD 2), 1) LenEnc$ = out$ END FUNCTION

This code produces the following encodings for 0 to 31:

0 000 1 001 2 110 3 111 4 1000 5 1001 6 0110 7 0111 8 01000 9 01001 10 10110 11 10111 12 101000 13 101001 14 010110 15 010111 16 0101000 17 0101001 18 1010110 19 1010111 20 10101000 21 10101001 22 01010110 23 01010111 24 010101000 25 010101001 26 101010110 27 101010111 28 1010101000 29 1010101001 30 0101010110 31 0101010111

If you look at the length of common huffman codes, this seems to compare favorably. I would like to find one that more closely matches the lengths to the actually frequency of the letters. According to a variety of sources Space, E, and T are from 10% to 15% of common english, then N, R, O, A, and I are about 7.5% followed by a steady decline from S at 6.1% through D at 4.2% to H at 3.4%. After this L, H, C, F, P, U and M are all about 2.5% to 3.5%. If you graph that data,

*************** E************* T********* N******** R*******. O*******. A*******. I*******. S****** D**** L***. H***. C*** F*** P*** U**. M**. Y** G*. W*. V*. B* X. Q. K J Z

You can see that the pattern of 4, 3 bit codes followed by 4, 4 bit codes, etc... matches pretty darn well.

The decoding routine looks like heck in QBASIC, but once I get it into the PIC it should be very small.

FUNCTION LenDec (bits$) FOR i = 1 TO LEN(bits$) IF lastbit THEN accum = accum + VAL(MID$(bits$, i, 1)) bits$ = MID$(bits$, i + 1) LenDec = accum lastbit = false firstbit = true EXIT FUNCTION ELSE IF firstbit THEN firstbit = false 'not anymore accum = 0 ELSE IF oldbit = VAL(MID$(bits$, i, 1)) THEN lastbit = true accum = accum + (VAL(MID$(bits$, i, 1)) * 2) ELSE accum = accum + 4 END IF 'delta bit END IF 'first bit oldbit = VAL(MID$(bits$, i, 1)) END IF 'last bit NEXT i END FUNCTION

Anyway, the basic idea is:

- normalize everything to 32 values by inserting the shift code or table change codes
- Lookup each value in the frequency table to get the table index
- length encode the value
- ship it to the PIC
- length decode the value
- look it up in the table, minding the table shift codes
- reverse the shifting and
- output the byte

Comments:

- /techref/method/compress/etxtfreq.htm english text frequencies (letter frequencies, pair frequencies, most common words, etc.)+

Code:

file: /Techref/method/compress/embedded.htm, 5KB, , updated: 2016/12/13 07:48, local time: 2024/4/14 01:56, |

©2024 These pages are served without commercial sponsorship. (No popup ads, etc...).Bandwidth abuse increases hosting cost forcing sponsorship or shutdown. This server aggressively defends against automated copying for any reason including offline viewing, duplication, etc... Please respect this requirement and DO NOT RIP THIS SITE. Questions?<A HREF="http://www.piclist.com/techref/method/compress/embedded.htm"> Text compression for embedded controllers</A> |

Did you find what you needed? |

PICList 2024 contributors:
o List host: MIT, Site host massmind.org, Top posters @none found - Page Editors: James Newton, David Cary, and YOU!
* Roman Black of Black Robotics donates from sales of Linistep stepper controller kits. * Ashley Roll of Digital Nemesis donates from sales of RCL-1 RS232 to TTL converters. * Monthly Subscribers: Gregg Rew. on-going support is MOST appreciated!
* Contributors: Richard Seriani, Sr. |

## Welcome to www.piclist.com! |

.