For this example, consider a very large list of strings (and their lengths) that we need to persist to disk. Or, a very large array of these:
typedef struct {
char *data;
int len;
} data_t;
To make the problem non-trivial, we will consider the case of different length strings (in my tests I was using a million strings around 150 bytes in length — plus or minus a few to keep them from all being the same size).
Let’s get started
Here is a naive read/write for this data structure:
void naive_write(char* filename, data_t* data, int n)
{
FILE *f;
int i;
f = fopen (filename, "wb" );
fwrite(&n, sizeof(n), 1, f);
for (i=0;i<n;i++)
{
fwrite(&data[i].len, sizeof(data[i].len), 1, f);
fwrite(data[i].data, sizeof(char), data[i].len, f);
}
fclose(f);
}
data_t* naive_read(char* filename)
{
data_t* answer;
FILE *f;
int i, n;
f = fopen (filename, "rb");
fread(&n, sizeof(n), 1, f);
answer = malloc(n * sizeof(data_t));
for (i=0;i<n;i++)
{
fread(&answer[i].len, sizeof(answer[i].len), 1, f);
answer[i].data = malloc(sizeof(char) * answer[i].len);
fread(answer[i].data, sizeof(char), answer[i].len, f);
}
return answer;
}
We first write the number of strings, and then for each string we write its length followed by the string data. When reading, we read the number of strings (and malloc the array of strings) and then for each string we read its length (to malloc the string data itself) and then read the string. This implementation does many mallocs() and it does many freads(). It is, consequently, fairly slow.
With this data format on disk, though, this would be difficult to (easily) optimize much more. So, let’s change the layout.
Read more: { on programming and the internets }