Acorn Arcade forums: Programming: Text compression
|
Text compression |
|
andrew (18:47 8/7/2004) ninj (12:29 9/7/2004) andrew (13:01 9/7/2004) monkeyson2 (13:41 9/7/2004) Loris (15:57 9/7/2004) ninj (16:31 9/7/2004) Loris (17:35 9/7/2004) andrew (12:56 10/7/2004) mavhc (13:10 10/7/2004) andrew (18:15 11/7/2004) ninj (18:46 12/7/2004) andrew (00:24 13/7/2004)
|
|
Andrew |
Message #56792, posted by andrew at 18:47, 8/7/2004 |
Handbag Boi
Posts: 3439
|
I know this may not be used v.often now but I'm trying out a method on the Beeb whereby I have a table of the bit representations of ASCII values for a limited set of characters. Then I'm intending to look for repeated stretches and compress that way. The only problem is I'll need a table of about 60 binary representations of ASCII values. Then the decompressed characters will be compared to the relevant ASCII lookup in the binary table say 8 characters at a time: 64 bytes input buffer. The table would be 60*8 bytes long so hopefully less than say, 2K, for the whole thing?
The main problem would seem to be it's going to be a bit slow in BASIC. Is there a simpler method of text compression?
[Edited by andrew at 19:50, 8/7/2004] |
|
[ Log in to reply ] |
|
ninjah |
Message #56802, posted by ninj at 12:29, 9/7/2004, in reply to message #56792 |
Member
Posts: 288
|
I suspect you'd get about the same efficiency from using a nibble based system (if that's the right way to describe it).
You can store the 15 most used letters using only a nibble. If the nibble==16, read the second one to get a further 15 characters. Anything else and you can read the next byte.
Or maybe it'd be more efficient to encode the 13 most popular characters using a nibble, then you could store a further 45 characters using two nibbles. That gives you space for all the letters (both upper and lower case), plus a few numbers.
Mind you, it's worth sampling some text. Dot and comma are actually likely to be far more common characters than Z or Q.
You can either use a fixed table (most useful if you're going to be compressing files of less than 1k), or you can spend 256 bytes in listing each of the characters in the order they're assigned to each nibble code.
Decompression I'd expect to be about the same speed, but compression should be a lot faster. |
|
[ Log in to reply ] |
|
Andrew |
Message #56808, posted by andrew at 13:01, 9/7/2004, in reply to message #56802 |
Handbag Boi
Posts: 3439
|
I hadn't thought of that. How much is a nibble? If it's half a byte then that's 32 characters possible isn't it and 64 in a byte? |
|
[ Log in to reply ] |
|
Phil Mellor |
Message #56810, posted by monkeyson2 at 13:41, 9/7/2004, in reply to message #56808 |
Please don't let them make me be a monkey butler
Posts: 12380
|
Depends on the ratio of upper/lower case as well. If you have very few upper case letters you could store everything in lower case and use a special character to toggle caps or indicate the next character is in upper case. This could halve the number of character codes that you need. |
|
[ Log in to reply ] |
|
Tony Haines |
Message #56814, posted by Loris at 15:57, 9/7/2004, in reply to message #56808 |
Ha ha, me mine, mwahahahaha
Posts: 1025
|
I hadn't thought of that. How much is a nibble? If it's half a byte then that's 32 characters possible isn't it and 64 in a byte? A nibble is 4 bits, can hold 0-15 (or rather, can have 16 different values) A byte is 8 bits, can hold 0-255 (or 256 different values) [1]
As Iain says, analyse the text you want to compress before deciding on a compression scheme. What text do you want to compress and for what purpose?
[1] In current useage, 8 bits is pretty standard, but it wasn't always so. |
|
[ Log in to reply ] |
|
ninjah |
Message #56815, posted by ninj at 16:31, 9/7/2004, in reply to message #56808 |
Member
Posts: 288
|
Good point, monkeyson. That might work quite well.
I hadn't thought of that. How much is a nibble? If it's half a byte then that's 32 characters possible isn't it and 64 in a byte? A nybble (sorry, I've been spelling it wrong before) is half a byte - 4 bits. That's 16 combinations. You don't get 32 in a byte though, since you need at least one combination of each nybble to say 'go look at the next one'. Hence you can get at least 30 characters in a byte, if you use one combination from each nibble to mean 'look at next'. I'd try something more as well though, such as (representing each nybble as a hex char):
0-C : chars 1-13 D, 0-E : chars 14-28 E, 0-E : chars 29-53 F, 0-E : chars 54-68 [gets 68 chars to a byte] D, F, 0-E : chars 69-83 E, F, 0-E : chars 84-98 F, F, 0-E : chars 99-113 D, F, F, 0-E : chars 114-128 E, F, F, 0-E : chars 129-143 etc. etc. (you can stop at two bytes if you limit yourself to visible chars)
Thinking about it, this effectively gives you monkeyson's approach more efficiently. Setting the first nibble to D-F shifts you into the next nibble. With monkeyson's approach, setting the first nibble to a special value shifts you into the next nibble (which you treat the same but upper-casing everything).
With the above approach, you can fit the lower case range into 0,0-D,C, and the upper case range into D,D-E,D. I.e a lower case char is on (mean) average 6 bits long (half are 4, half are wheras the upper-case chars are on average 8 bits long (they're all .
With monkeyson's approach (if I'm reading it right), the lower case chars would be as before (6 bits), while upper case would require an extra preceding nybble, making them 10 bits on average.
The code is fairly simple. You need a function to effectively read a nybble at a time then go:
char = getNybble() IF char < &D THEN: - char += 1 ELSE: - char = (char-&C) * 13: REM char becomes the base value temporarily - nybble = getNybble() - WHILE nybble >= &F: - - char += 55: REM Increase base value - - nybble = getNybble() - char += nybble+1 ENDIF PRINT CHR${char};
There's probably mistakes in there. Writing BASIC is making my head hurt too much. Note that I'm not allowing the 0 character, so you couldn't use null-terminated strings. A 0 byte would just represent the first character, twice. You could reserve some 8-bit long sequence, such as F-E, for the end signal, or store each string with their length.
For small samples of text (<5k) I'd expect the nybble approach to work better than a pattern based approach, but with more data, finding patterns becomes easier and more effective.
Edit: Been writing too much Python; forgot the ENDIF
[Edited by ninj at 19:48, 12/7/2004] |
|
[ Log in to reply ] |
|
Tony Haines |
Message #56818, posted by Loris at 17:35, 9/7/2004, in reply to message #56815 |
Ha ha, me mine, mwahahahaha
Posts: 1025
|
Or you could try something else entirely. I once had a good experience with a 'sliding dictionary' approach. The advantages of this is that you don't need a separate look-up-table, and the 'dictionary' remodels itself to the input as you go along.
The basic idea is that you copy strings from what you've already created.
Straight off the bat you should of course throw away any bits from the text you don't use. If it is plain text you probably don't use the top bit; Of course rare characters can be special-cased from an 'escape code' with a LUT (to an 8 bit character) if you want. You need some additional escape code(s) which can of course be never-used characters in-range. In ASCII there exist things like 'backspace' which you can appropriate for this purpose.
If you are using plain text the simplest approach would be to point to a word you'd previously written, following an escape code:
"Demo text waffle hi waffle text."
Could be encoded as:
"Demo text waffle hi <escape code>[01] <escape code>[03]."
Given the nature of text I'd consider having multiple escape codes, indicating whether the initial letter should be capitalised, whether the following punctuation should be included, if you want a space at the end and so on.
The size of your dictionary, that is how many bits you use for the look-back, should also be determined empirically.
A similar, slightly novel approach would embed 'bookmarks' in the string which would form a dictionary of pointers to useful strings. For which making a compresser would be fun. |
|
[ Log in to reply ] |
|
Andrew |
Message #56833, posted by andrew at 12:56, 10/7/2004, in reply to message #56815 |
Handbag Boi
Posts: 3439
|
Loris - I like the bookmark idea, might try that. You'll be one of the first to know what I want to use it for when ready I need lower case, upper case and I think 3 punctuation characters. On trying the repeated sequence of bits (actually bytes at the moment to test) method the best I got was about 70% compression so I'm thinking whether I could just try to compress the bits at least first. This won't be as good as the nibble method but I've got most of the code and tables done already
[Edited by andrew at 13:58, 10/7/2004] |
|
[ Log in to reply ] |
|
Mark Scholes |
Message #56835, posted by mavhc at 13:10, 10/7/2004, in reply to message #56833 |
Member
Posts: 660
|
Read old Micro Users, they had stuff on writing adventure games, and therefore compressing text.
One technique was to take the 15 most common letters, encode them into 1 nybble, with 16 being "use 2nd level", another 15 for those, and again 16 being "use 3rd level"
90% of stuff is 4bits, 8% 8bits, 2% 12/16 bits.
And your lookup table's not big. |
|
[ Log in to reply ] |
|
Andrew |
Message #56857, posted by andrew at 18:15, 11/7/2004, in reply to message #56835 |
Handbag Boi
Posts: 3439
|
I've done it 6 bits which should be useful - thanks. I could have done it in 5 bits and might with one bit to represent case-change but this is ok for now. |
|
[ Log in to reply ] |
|
ninjah |
Message #56891, posted by ninj at 18:46, 12/7/2004, in reply to message #56835 |
Member
Posts: 288
|
Read old Micro Users, they had stuff on writing adventure games, and therefore compressing text. Yeah, I meant to mention that I nicked the nybble idea from the old Micro User adventure game writing series, but I forgot as I neared the end of that last opus. I don't know if I implemented it in exactly the same way though. |
|
[ Log in to reply ] |
|
Andrew |
Message #56898, posted by andrew at 00:24, 13/7/2004, in reply to message #56891 |
Handbag Boi
Posts: 3439
|
Done the 5 byte method now which is quite efficient.
[Edited by andrew at 01:26, 13/7/2004] |
|
[ Log in to reply ] |
|
|
Acorn Arcade forums: Programming: Text compression |