Searching \ for 'Text companding / remote updates via email' in subject line. ()
Make payments with PayPal - it's fast, free and secure! Help us get a faster server
FAQ page: www.piclist.com/techref/index.htm?key=text+companding
Search entire site for: 'Text companding / remote updates via email'.

Truncated match.
PICList Thread
'Text companding / remote updates via email'
1999\11\16@184824 by eplus1

flavicon
face
Has anyone had experience or have ideas about compressing / decompressing
English text with a PIC?

I have an application that would benefit from having the PC that is sending
data to the PIC compress text and then having the PIC expand it. The text is
just standard English words, phrases, sentences etc...

To make things more complex, the compressed text cannot use non-printable
characters, i.e. we get only 96 possible values per byte or only one of the
following patterns in each byte. 3 x 2^5 = 96

001x xxxx
010x xxxx
011x xxxx

BCD numbers pack into this fairly nicely, 9 x 9 = 81 < 96 so each byte still
holds two BCD digits. But 26 letters plus shift plus space don't seem to map
well. <GRIN> (this is, of course, because they map perfectly) I was thinking
that some Huffman compression system may exist that works well here? Easy to
de-compress on a PIC is important as well.

If you have good text compression / expansion code that produces binary, I
can translate that into printable only but thought I would mention it on the
chance that someone else has hit this before.

If this sounds like a really weird application, think about this: if you
wanted to make a device that could be updated via an email.... without
running an application on the remote email users machine... just send the
email and tell the user to connect the device and copy the email out...
Anyone who has struggled with update files that get munched in transit or
when the user on the other end can't figure out how to run your app or
winzip (or doesn't want to or can't etc...) will see the utility here. And,
of course, the smaller the email the better...

James Newton spam_OUTjamesnewtonTakeThisOuTspamgeocities.com phone:1-619-652-0593
http://techref.homepage.com NOW OPEN (R/O) TO NON-MEMBERS!
Members can add private/public comments/pages ($0 TANSTAAFL web hosting)
PIC/PICList FAQ: http://204.210.50.240/techref/default.asp?url=piclist.htm

1999\11\16@191511 by Wagner Lipnharski

picon face
James Newton wrote:
{Quote hidden}

Hey James... :)  why in the hell are you saying "English" text?... just
because french, portuguese, spanish and few others <g> use lots of ˆ  Ž
‡ ’ — œ  ‹ – › and some other nice chars?  hehe,  remember that the
byte code 0111-1111 could be used to flag the next 8 bits as aa
non-compressed character... so you could use all the rest of 255
combinations of chars. Obviously for each "special" character out of
your "English" <g> 95 chars, you will spend 13 bits... but I am sure it
will still compressing lots.  IBM used in some old machines, lots of 6
bits code chars, at that time, a bit was made by gold... :)... and by
the way, you forgot the 000x-xxxx combination, you can go up to 128
combinations, not only 96... :)... (4 x 2^5)...

When you said "English" text, I thought you would talk about word and
letters grouping data bases... easy to do... using a byte code as "FFh"
as flag, you can use the next two bytes to index a data base (table of
words or groups of letters) from the most used in that specific
language. So there is no compression at all, just a fast speed
indexing.  You can find statistics about the 1000 most used words in
English by region, etc.  For example, the word "combination" could be
just "FF 05 93", 3 bytes... you could define FE as a flag for packets
that use 3 bytes coding, and so on... think about it.

Wagner

1999\11\16@221114 by Peter Tiang

flavicon
face
You can try using Huffman encoding.
The idea is to represent the highest occuring characters
with least number of bits.

Highest occuring characters for English text would be R,S,T,L,N etc.


Regards,
Peter Tiang

{Original Message removed}

1999\11\17@043029 by Russell McMahon

picon face
>You can try using Huffman encoding.
>The idea is to represent the highest occurring characters
>with least number of bits.
>Highest occurring characters for English text would be R,S,T,L,N etc.

The most common character in the English language is " " :-)
This is worth noting when you are using Huffman coding or any other code
which depends on symbol frequency.
"e" features very high also.

Look at apps like eg PKZIP which has a number of algorithms depending on
structure.of source data.

Consider using less bits per character and packing the "words" into bytes.
Baudot teletypes used 5 bits (32 symbols) which allows 26 letters plus 2
shift characters (shift to set a, shift to set B) giving a total of 48
working characters.
Baudot didn't use upper/lower case AFAIR but there is enough room in 48
chars to do so. This becomes perhaps inefficient IF you have lots of mixed
case eg "McMahon" :-) but is good for normal cpitalisation.

eg    "Hello, my name is Russell."
= 25 characters = 26 bytes normally. Using above scheme this become
But
where > means use second set (capitals and punctuation etc)

= ">h<ello my name is >r<ussell>.<"

=  31 x 5 bits = 155 bits = 193/8 = 20 bytes
Actually, I'm a bit scrambled here as I haven't well thought out the upper
case handling. Still, some indication of modest gains possible.
Max poss = 37.5% less than raw text.
Not good enough.

I think that about 2:1 is achievable without really fancy efforts.

Try Huffman.

     Russell McMahon
_____________________________

>From other worlds - http://www.easttimor.com
                               http://www.sudan.com

What can one man* do?
Help the hungry at no cost to yourself!
at  http://www.thehungersite.com/

(* - or woman, child or internet enabled intelligent entity :-))

1999\11\17@084132 by Jim Hartmann

flavicon
face
Buy "The Data Compression Book" by Mark Nelson.  Excellent book on the
subject and includes source code and disk.

1999\11\17@103551 by eplus1

flavicon
face
<BLOCKQUOTE Author="Wagner Lipnharski">why in the hell are you saying
"English" text?... just because French, Portuguese, Spanish and few others
<g> use lots of ` h i m s z g c q u and some other nice chars?
hehe.</BLOCKQUOTE>

Yup! I do be a raving redneck and y'all damn ferners better start a'learnin
'Merican talk!
No.no.no.NO.NNNOOOOOO! Sheesh! You have to be P.C. everywhere now days! I
only said it was "standard English text" so that all would understand what
the frequency of characters was likely to be. ` is much less likely than e
(or space) in this application.

<BLOCKQUOTE Author="Wagner Lipnharski">
and by the way, you forgot the 000x-xxxx combination, you can go up to 128
combinations, not only 96... :)... (4 x 2^5)...</BLOCKQUOTE>

Nope. 000x-xxxx is not printable. The are ASCII control codes and 00011010
in particular will cause all kinds of hell if the customer does a
COPY email.txt COM1
instead of a
COPY email.txt COM1 /B
I wont even get into email systems and the null character.... It really has
to be 3 * 2^5.

<BLOCKQUOTE Author="Wagner Lipnharski">you can use the next two bytes to
index a data base (table of words or groups of letters) from the most used
in that specific language. </BLOCKQUOTE>

Certainly I could get good compression by tokenizing each word against the
dictionary, but would I not have to store the dictionary in the remote
device for de-compression? I was hopeing for some sort of math transform
that would result in characters being "reconstituted" on the PIC end without
any need for a lookup table or at least a small lookup table.

Maybe I can find patterns in the statistical data on letter frequency, like
for example "the most common characters have a pattern of 0xx1 010x so
compress 0xx1 010x and 0yy1 010y into 00xx xyyy."

Any other thoughts?

James Newton .....jamesnewtonKILLspamspam@spam@geocities.com phone:1-619-652-0593
http://techref.homepage.com NOW OPEN (R/O) TO NON-MEMBERS!
Members can add private/public comments/pages ($0 TANSTAAFL web hosting)
PIC/PICList FAQ: http://204.210.50.240/techref/default.asp?url=piclist.htm

1999\11\17@111612 by Tracy Smith

picon face
--- James Newton <eplus1spamKILLspamsan.rr.com> wrote:
{Quote hidden}

Actually, two BCD character won't map. There are ten
digits not 9 in BCD.

> perfectly) I was thinking
> that some Huffman compression system may exist that
> works well here? Easy to
> de-compress on a PIC is important as well.

Huffman is good (perfect?) except for one tiny
problem:
memory. I presume if you chose to implement a static
table then perhaps it'd work. But then it would no
longer be "optimal". However, if you know your
characters and there distribution then perhaps a
static table would work just fine.

How many characters do you "need"? I recall something
I believe Andy Warren posted a long time ago (where's
Andy?) that may be applicable here. It wasn't really
compression per se, but it made a more optimal use of
the available bits. Basically it involved sending
three characters in two bytes. Obviously you can't
send three 8-bit chunks in 16-bits, so you have to
reduce the character set size. The optimal character
set size is the largest cube that fits into 16-bits.
This happens to be 40: 40^3 = 64000 which is less than
2^16. This would give you the 26 letters of the
alphabet (only upper case or lower case), the numbers
0-9, and four punctuation characters of your choice.

The encoding and decoding process is essentially an
exercise in changing bases. For example, if you mapped
your 40 characters into the numbers 0-39, then you'd
encode them by:

x = a1 * 40^2 + a2 * 40^1 + a3 * 40^0
 = a1 * 1600 + a2 * 40 + a3

Where a1, a2, and a3 are the three encoded characters.
The decoding process goes like:

a1 = x mod 40^2
x = x - a1 * 40^2
a2 = x mod 40
x = x - a2 * 40
a3 = x

Like I said, this is not compression, it's a
re-arrangement of sorts.

Now in your application, you don't have the full 16
bits to work with. In fact, you only have:
log2(96) = 6.58 bits
or about 6 and a half bits.

If you concatenate two of your "bytes" you get about
13 bits, which could encode 3 characters from a set of
20 characters. If you concatenate three bytes, you'd
have about 18.5 bits which would allow you to encode 4
characters from a set of 30.

So going back to my question, "How many characters do
you need?" Call this K. The number of characters n
encoded into m bytes is:

K^n <= 96^m

or

 m >= n*log(K)/log(96)

m and n are integers, so you have a simple Diophantine
equation.

As a couple of examples:

K=20: (we know the answer from above)

has the solutions
m=2,n=3
m=3,n=4
m=4,n=5
m=4,n=6

K=30:
m=3,n=3
m=3,n=4
m=4,n=5
m=5,n=6

K=40:
m=3,n=3
m=4,n=4
m=5,n=5
m=5,n=6

As you can see, when K approaches 96 there are
diminishing returns.

For Andy's case, you have:

m >= n*log(40)/log(2^16)

m=1,n=1
m=1,n=2
m=1,n=3
m=2,n=4
m=2,n=5
m=2,n=6

Where m is the number of 16-bit words and n is the
number of base 40 characters.

Neat.

.lo

=====

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com

1999\11\17@115649 by Dan Creagan

flavicon
face
It has been said once in this thread, but I would like to weigh in also.
"The Data Compression Book" by Mark Nelson, ISBN 1558514341.  I have the
first edition (not the ISBN that I just quoted) and it is invaluable.
Whatever we say here, we cannot match the richness and breadth that Mark
gives you. The book is fairly cheap ($37 or so USD from Amazon) and it has a
disk of C code routines (mine was 5 1/4 - yours will be 3 1/2).  You should
be able to figure out a compression technique that optimizes your situation
from this text.  Put the book on the shelf next to your algorithm book. Like
the algorithm book, you won't use it everyday, but it will never go out of
date and it will eventually pay back its price many fold.

Dan


{Original Message removed}

1999\11\17@134425 by eplus1

flavicon
face
<BLOCKQUOTE Author="James Newton">BCD numbers pack into this fairly nicely,
9 x 9 = 81 < 96 = 3*2^5 </BLOCKQUOTE>
<BLOCKQUOTE Author="Tracy Smith">Actually, two BCD character won't map.
There are ten digits not 9 in BCD. </BLOCKQUOTE>

AAARRRGGGHHHH! I can't believe I did that! Tried to get 0-99 out of 0-96.
That's what I get for trying to work when I already have a headache. Thanks
for saving me from a time wasting brain fart.

<BLOCKQUOTE Author="Tracy Smith">  m >= n*log(K)/log(96) </BLOCKQUOTE>

If I understand this correctly, you are rounding the result (the number of
bytes required) up so for my digit packing problem:

K=10
n=2, m=2 (1.009)
n=3, m=2 (1.513)
n=4, m=3 (2.017)
n=5, m=3 (2.522)
n=6, m=4 (3.026)
n=7, m=4 (3.531)

The good n's are the ones that result in the m's having the largest possible
fractional portion. In this case, packing 2 or 4 decimal digits would be a
bad idea since this would waste most of the last byte.

Can this also be used to calculate the number of bytes required to store
some number of bits? For example, one digit from a character set of 2^15 =
32768

K=32768
n=1, m=3 (2.278)

K=65536
n=1, m=3 (2.430)

Since I happen to know that the vast majority of the numbers I will have to
transmit will be less than 96 and in fact in the range -47 to +48 I will
probably use one byte per number and just multiply values together to
"escape" for larger numbers when needed. I will have a control byte with 2
free bits (range 01 to 11) that I can use to specify how many value bytes
are coming. Maybe I can combine different type of values to get a better
pack

01; small  signed          1 byte  -47 to  +48
10; small fraction         1 bytes   0 to  .96
11; large unsigned         2 bytes   0 to 9216

so a -99999.9 would be a 11 times a 01 followed by a 10. A -.99 would be a
01 times a 10.

<BLOCKQUOTE Author="Tracy Smith">If you concatenate three bytes, you'd have
about 18.5 bits which would allow you to encode 4 characters from a set of
30. So going back to my question, "How many characters do you need?"
</BLOCKQUOTE>

I actually need all 96 printable characters, but the point is that some of
these 96 will be used more often than others of these 96.  I was hopeing for
some sort of math transform that would result in characters being
"reconstituted" on the PIC end without any need for a lookup table or at
least a small lookup table.

Maybe I can find patterns in the statistical data on letter frequency, like
for example "the most common characters have a pattern of 0xx1 010x so
compress 0xx1 010x and 0yy1 010y into 00xx xyyy and use 010x xxxx for the
next most common 32 and 011x xxxx except for 0111 1111 for another 31 common
symbols then use 0111 1111 and a second byte for the other 25 uncommonly
used symbols." If this was possible then a lot of text would pack 2:1 with
some at 1:1 and a few at 1:2.

James Newton .....jamesnewtonKILLspamspam.....geocities.com phone:1-619-652-0593
http://techref.homepage.com NOW OPEN (R/O) TO NON-MEMBERS!
Members can add private/public comments/pages ($0 TANSTAAFL web hosting)
PIC/PICList FAQ: http://204.210.50.240/techref/default.asp?url=piclist.htm

1999\11\17@142548 by eplus1

flavicon
face
<BLOCKQUOTE Author="Peter Tiang">You can try using Huffman
encoding.</BLOCKQUOTE>

Thanks for the suggestion. The only problem with Huffman is that a table
must be defined at the remote end to decompress the text. In the limited
memory of a PIC this may be prohibitive.

I was hopeing for some sort of math transform that would result in
characters being "reconstituted" on the PIC end without any need for a
lookup table or at least a small lookup table.

Maybe I can find patterns in the statistical data on letter frequency, like
for example "the most common characters have a pattern of 0xx1 010x so
compress 0xx1 010x and 0yy1 010y into 00xx xyyy and use 010x xxxx for the
next most common 32 and 011x xxxx except for 0111 1111 for another 31 common
symbols then use 0111 1111 and a second byte for the other 25 uncommonly
used symbols." If this was possible then a lot of text would pack 2:1 with
some at 1:1 and a few at 1:2.

BTY there is an awesome page on Huffman with a Java animation of a file
being analyzed, frequency tree growth and an encoding table being built at:

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/huffman.html

James Newton EraseMEjamesnewtonspam_OUTspamTakeThisOuTgeocities.com phone:1-619-652-0593
http://techref.homepage.com NOW OPEN (R/O) TO NON-MEMBERS!
Members can add private/public comments/pages ($0 TANSTAAFL web hosting)
PIC/PICList FAQ: http://204.210.50.240/techref/default.asp?url=piclist.htm

1999\11\17@175117 by Harold M Hallikainen

picon face
On Wed, 17 Nov 1999 11:41:10 +0800 Peter Tiang <petertiangspamspam_OUTPD.JARING.MY>
writes:
>You can try using Huffman encoding.
>The idea is to represent the highest occuring characters
>with least number of bits.
>
>Highest occuring characters for English text would be R,S,T,L,N etc.
>

       Could also do MNP compression, which, I believe is included in
all current modems (MNP or similar compression).

Harold


Harold Hallikainen
@spam@haroldKILLspamspamhallikainen.com
Hallikainen & Friends, Inc.
See the FCC Rules at http://hallikainen.com/FccRules and comments filed
in LPFM proceeding at http://hallikainen.com/lpfm

___________________________________________________________________
Get the Internet just the way you want it.
Free software, free e-mail, and free Internet access for a month!
Try Juno Web: dl.http://www.juno.com/dynoget/tagj.

1999\11\23@110250 by jamesnewton
picon face
Wow! Russell McMahon, Wagner Lipnharski, Peter Taing, Jim Hartman, Dan
Creagan and especially Dave V and Scott Dattalo, Thanks for the great info!
Special thanks to Tracy Smith for saving me from myself. I haven't taken the
time to check up on MNP compression as suggested by Harold Hallikainen but I
will.

I've got a system close to working. This is a big post, if you are not
interested in text compression / expansion for a PIC - sorry! Drop me a note
and I'll not post such big things in the future.

There are a series of problems when trying for text compression on an
embedded controller:

1. Can't have big tables. I'm trying for 64 or fewer entries. I can easily
combine the upper and lower case letters with a shift code and I'm using
some heuristics to guess when shifting should happen with out needing to
send the code: After a period, ! or ?, expect the next letter to be shifted;
one code means shift just the next letter but; if you have to shift two
letters in a row, stay shifted until you see the shift code again. I haven't
really settled on a way to handle symbols but I'm thinking about having more
than one table so that I can take advantage of the fact that numbers are
more likely to be A) grouped, B) follow a $ or a % or #. Since I have to use
8 bits in the main table, and need only 6, I'll use the upper 2 to shift to
a new table when we hit certain symbols. The symbol "table" will actually be
the end of the full table (change the lookup offset).

2. Can't have big code. I've come up with a nasty hack on Huffman encoding
that (I think) will simplify the length encoding of the frequency translated
data.

I wrote some QBASIC routines to test the proof of concept:
This one produces a variable length string of "1"s and "0"s that efficiently
encodes numbers from 0 to 31 with smaller numbers having shorter lengths.
The end of each string is marked with a "00x" or a "11x" and no other part
of one value will have this pattern. Basically the 00 or 11 encodes the
second bit of the value and the x is the LSb. The higher order bits are
encoded by a series of "01"s or "10"s that are equal in number to the bit
values. I know it sounds like this could be done more efficiently, but I
could not find a way that did not involve A) a very big chunk of code or
table in the PIC or B) a less than ideal distribution of lengths at the
lower values.

FUNCTION LenEnc$ (in)
REM PRINT #2, RIGHT$("   " + STR$(in), 5); " ";
out$ = ""
oldbit = INT((in + 2) / 4) MOD 2
out$ = RIGHT$(STR$(oldbit), 1)
WHILE in > 3
 IF oldbit = 0 THEN
  out$ = out$ + "1"

  oldbit = 1
 ELSE
  out$ = out$ + "0"

  oldbit = 0
  END IF
 in = in - 4
 WEND
IF oldbit = 0 THEN
 out$ = out$ + "0"
ELSE
 out$ = out$ + "1"
END IF
out$ = out$ + RIGHT$(STR$(in MOD 2), 1)

LenEnc$ = out$

END FUNCTION

This code produces the following encodings for 0 to 31
 0 000
 1 001
 2 110
 3 111
 4 1000
 5 1001
 6 0110
 7 0111
 8 01000
 9 01001
10 10110
11 10111
12 101000
13 101001
14 010110
15 010111
16 0101000
17 0101001
18 1010110
19 1010111
20 10101000
21 10101001
22 01010110
23 01010111
24 010101000
25 010101001
26 101010110
27 101010111
28 1010101000
29 1010101001
30 0101010110
31 0101010111

If you look at the length of common Huffman codes, this seems to compare
favorably. I would like to find one that more closely matches the lengths to
the actually frequency of the letters. According to a variety of sources
Space, E, and T are from 10% to 15% of common English, then N, R, O, A, and
I are about 7.5% followed by a steady decline from S at 6.1% through D at
4.2% to H at 3.4%. After this L, H, C, F, P, U and M are all about 2.5% to
3.5%. If you graph that data,

***************
E*************
T*********
N********
R*******.
O*******.
A*******.
I*******.
S******
D****
L***.
H***.
C***
F***
P***
U**.
M**.
Y**
G*.
W*.
V*.
B*
X.
Q.
K
J
Z

You can see that the pattern of 4, 3 bit codes followed by 4, 4 bit codes,
etc... matches pretty darn well.

The decoding routine looks like heck in QBASIC, but once I get it into the
PIC it should be very small.
FUNCTION LenDec (bits$)
FOR i = 1 TO LEN(bits$)
 IF lastbit THEN
  accum = accum + VAL(MID$(bits$, i, 1))
  bits$ = MID$(bits$, i + 1)
  LenDec = accum
  lastbit = false
  firstbit = true
  EXIT FUNCTION
 ELSE
  IF firstbit THEN
   firstbit = false 'not anymore
   accum = 0
  ELSE
   IF oldbit = VAL(MID$(bits$, i, 1)) THEN
    lastbit = true
    accum = accum + (VAL(MID$(bits$, i, 1)) * 2)
   ELSE
    accum = accum + 4
    END IF 'delta bit
   END IF 'first bit
  oldbit = VAL(MID$(bits$, i, 1))
  END IF 'last bit
 NEXT i
END FUNCTION

Anyway, the basic idea is:
1. normalize everything to 32 values by inserting the shift code or table
change codes
2. Lookup each value in the frequency table to get the table index
3. length encode the value
4. ship it to the PIC
5. length decode the value
6. look it up in the table, minding the table shift codes
7. reverse the shifting and
8. output the byte

Any comments, advice, ideas, etc.. very much appreciated. I have devoted a
page to this at:
http://204.210.50.240/techref/default.asp?url=method/compress/embedded.htm

James Newton KILLspamjamesnewtonKILLspamspamgeocities.com phone:1-619-652-0593
http://techref.homepage.com NOW OPEN (R/O) TO NON-MEMBERS!
Members can add private/public comments/pages ($0 TANSTAAFL web hosting)
PIC/PICList FAQ: http://www.piclist.com


'Text companding / remote updates via email'
1999\12\23@151354 by jamesnewton
picon face
Wagner is offlist so I have to through out a challenge to anyone bored over
the holidays:

Has anyone out there tried to encode binary data to text for transmission
through text only systems (like uuencode, binhex, base64, quote-printable,
etc...) using the entire 96 ASCII printable characters rather than the 64
(or less) that are used in these system?

If you think of a string of bytes as a base 256 number of multiple digits
(each digit has a value from 0 to 255 and there are bytes # of digits) and
you modulo that entire number by 96 then integer divide it by 96 until it is
zero, recording the modulo each time, you have a string of base 96 numbers
that you can transmit to the remote system. When the entire string is
received, you can multiply the last base 96 digit by 96 and add the next
then multiply that result by 96 and add the next until all the base 96
digits are used and then you will find that you have the original number or
string of bytes.

The point of this is that your "mime" encoding will increase the data size
by less if you use 37.5% of the possible values rather than just 25% of
them. The increase should be about 12.5% less, no? Well... no because on a
PIC (and everywhere else to some degree) there is a practical limit to how
large the number can be. By my calculations, limiting your self to 16 bytes
at a time drops the improvement down to about 7%.

I had hoped to avoid that by develop a system for doing it on the fly, by
knowing that a base 256 digit requires 8 bits and a base 96 digit requires
6.584962500721156 bits. If I put out accum mod 96 and set accum to int(accum
/ 96) then I have reduced the amount of data in the accum by 6.58... and
when I have less than that amount in the accum I need to get the next byte
of source data, set accum to accum * 256 + the byte and increase the
variable tracking the amount of data being carried by 8. I never have to
accumulate the entire input string. This produces a valid stream of data...
I've verified that. However, I just can't wrap my brain around a way to do
the same thing without having to accumulate the entire source data string
prior to outputting it on the remote end.

Questions:
A) Can this be done?
B) If not, why not?
C) If it can, is there any other reason besides extra work that you wouldn't
want to?

James Newton RemoveMEjamesnewtonTakeThisOuTspamgeocities.com phone:1-619-652-0593
http://techref.homepage.com NOW OPEN (R/O) TO NON-MEMBERS!
Members can add private/public comments/pages ($0 TANSTAAFL web hosting)

1999\12\29@150343 by James Newton

face picon face
Thanks for all the answers to the question I asked about:
<BLOCKQUOTE AUTHOR="James Newton">
Has anyone out there tried to encode binary data to text for transmission
through text only systems (like uuencode, binhex, base64, quote-printable,
etc...) using the entire 96 ASCII printable characters rather than the 64
(or less) that are used in these system?
</BLOCKQUOTE>

The answer is basically "NO, AND YOU DON'T WANT TO" but Adobe does use (and
document) a Base85 system that packs 4 bytes into 5 rather than the 3 into 4
achieved by the Base64 system.

Non of the pages I could find on the internet were related to the technical
details and selection criteria for the current encoding systems so I've
compiled a page about this at:
http://204.210.50.240/techref/default.asp?url=method/encode.htm

I've received some horror stories about data loss due to email gateways
dropping or changing less common printable symbols, including some notes on
adobe.com techsupport regarding the need to send pdfs via Base64 when using
AOL (the biggest ISP in the country)!

I will stick with something like Base64 using 64 symbols from the ranges:
0...9, A...Z, a...z and 1 or 2 other common symbols like +, / or =.

When trying to squeeze this into a PIC, (now On Topic <GRIN>) we can't use
big translation tables, so I've rearranged the Base64 code mapping. Base64
maps 0...25 to "A"..."Z" then 26...51 to "a"..."z" then 52...61 to "0"..."9"
and 62 and 63 to "+" and "/". I'm thinking I should map 0...9 to "0"..."9"
then 10...35 to "A...Z" etc.. so that we can just:

       get value
       if value is "+"
               value = 62
       elseif value is "/"
               value = 63
       else
               value -= "0"
               if value > 9
                       value -= "A" - "9"
                       if value > 35
                               value -= "a" - "Z"
                               end if
                       end if
               end if

This also makes it easier to read the encoded data manually during testing.
Something tells me I am going to have an uncanny grasp of which position
each letter holds in our alphabet before this is over. <GRIN>

I'll EOL rather than pad with "=" at the end (unless someone knows why I
shouldn't)

Text compression and decompression are working well (about 40% on standard
English text) with my tiny text compander. Can't release source but the
basis was documented in a previous post and at:
http://204.210.50.240/techref/default.asp?url=method/compress/embedded.htm

James Newton spamBeGonejamesnewtonspamBeGonespamgeocities.com phone:1-619-652-0593
http://techref.homepage.com NOW OPEN (R/O) TO NON-MEMBERS!
Members can add private/public comments/pages ($0 TANSTAAFL web hosting)

More... (looser matching)
- Last day of these posts
- In 1999 , 2000 only
- Today
- New search...