So progress on my programming language, still codenamed H2D2 continues on
background threads :-) I've not given up, but I haven't made many posts
on the blog about it recently. It was probably getting boring anyway,
and recent work has been tidying things up and writing unit tests - which
does not make for interesting reading.
One thing that I wanted to try, was to compress the H2D2 bytecode so that
it doesn't take as much time to serialise a running program between machines.
I believe that Java JAR files are zipped, so I thought that I'd do something
similar with H2D2. But I also wanted something that I could write myself
that didn't take an age. Since I'm doing this for fun, it's nice to
start with a blank page and go from there. So anyway, my web wanderings
brought me to here.
That page talks about Run Length Encoding - something that I implemented
once myself back in the 1980s (or maybe the early 90s) - not that I have the
source code anymore. But I read about the RLE PackBits algorithm and thought
it might be a good way to go. Not a massive processing overhead and
reasonably simple to implement. I chose not to look at the example souce
code provided ... and just to write it from scratch myself. It sounded
like fun.
It took me about an hour whilst drinking my morning coffee to write
the compression code, and an hour or so in the evening to decompress the
data back to the original values (I also wrote some unit tests to check
for any obvious problems I could see). It seems to work OK: when I
compress H2D2 bytecode I'm getting about a 30% reduction. It's not as
good as zip compression obviously, but it will do for now.
I'm sure my code won't win any prizes for optimisation or style, but
at least it works. I guess it might also come in handy if you were trying
to squash some data on a microprocessor without needing a massive
library or lots of processing power. So here is an example C program:
// Example RLE compression with PackBits
// ...just an example, not production code :-)
#include <stdio.h>
#include <string.h> // for memset
void fgrow(char*, char*);
void fshrink(char*, char*);
int main(int argc, char *argv[])
{
if (argc==4 && strlen(argv[1])==2 && argv[1][0]=='/')
{
switch (argv[1][1])
{
case 'c':
printf("Compressing... ");
fshrink(argv[2], argv[3]);
printf("done!\r\n");
return 0;
case 'u':
printf("Uncompressing... ");
fgrow(argv[2], argv[3]);
printf("done!\r\n");
return 0;
}
}
printf("Usage: RLE { /c | /u } \r\n");
printf(" use the /c switch to compress or the");
printf(" /u switch to uncompress\r\n");
return 1;
}
void fgrow(char* inpath, char* outpath)
{
char ch=0, hdr=0;
FILE* op = fopen(outpath, "wb");
FILE* mp = fopen(inpath, "rb");
while (!feof(mp))
{
hdr=fgetc(mp);
if (feof(mp)) break;
if (hdr>=0) // copy bytes verbatim
{
for (; hdr>=0; hdr--)
{
ch = fgetc(mp);
fputc(ch, op);
}
}
else // copy a run of bytes
{
ch=fgetc(mp);
for (; hdr<=0; hdr++)
{
fputc(ch, op);
}
}
}
fclose(mp);
fclose(op);
}
void fshrink(char* inpath, char* outpath)
{
char buf[128];
memset(buf, 0, 128);
int n=0, hdr=0, runlength=0;
FILE* op = fopen(outpath, "wb");
FILE *mp = fopen(inpath, "rb");
while (!feof(mp))
{
buf[n++] = fgetc(mp);
if (feof(mp))
{
n--;
break;
}
if (n==128)
{
// buffer is full
hdr=127;
fputc(hdr, op);
fwrite(buf, 1, 128, op);
memset(buf, 0, 128);
n=0;
continue;
}
if (n >= 3 && buf[n-1]==buf[n-2] && buf[n-2]==buf[n-3])
{
// last three bytes are the same
hdr = n-4;
char symbol = buf[n-1];
char tmp = symbol;
if (n > 3)
{
fputc(hdr, op);
fwrite(buf, 1, n-3, op);
}
runlength=3;
while(!feof(mp) && tmp==symbol)
{
tmp = fgetc(mp);
n++;
runlength++;
}
hdr = 2-runlength;
fputc(hdr, op);
fputc(symbol, op);
// put the different byte back for next read
if (!feof(mp)) ungetc(tmp, mp);
memset(buf, 0, 128);
n=0;
}
}
// flush the buffer if not empty...
if (n > 0)
{
hdr = n-1;
fputc(hdr, op);
fwrite(buf, 1, n, op);
}
fclose(mp);
fclose(op);
}
Obviously this compression will only shrink files which have
runs of the same repeated byte value in them. Otherwise it may
increase the file size. You can try it from the command line and
pass /c as the first argument to compress a file and then use /u
to decompress the file back. The next two arguments need to be
the names of the input and output files respectively. This example
doesn't have any error trapping, so if the file fails to open, or
something like that it will go bang. But that would not be hard to add.
Of course, in my H2D2 code, I'm using the same algorithm, but
I've adapted the code to work on blocks of memory rather than on
files. But in H2D2 I have made a VFILE struct to work alongside the C
FILE pointers and I've added some functions that work the same as the C
IO functions but taking my VFILE pointer instead. This means that I can
easily swap code to operate either on disk or in memory...