Design of Column Based ARCHIVE Storage Engine

1. create and drop databases

First we set breakpoints on all the member functions of class ha_archive and Archive Storage which makes sure we can look at the execution flow of functions clearly.

We now input the following command in drizzle client to create a database named testdb:

[code lang=”shell”]
create database testdb;
Query OK, 1 row affected (0 sec)

Then type in:

[code lang=”shell”]
use testdb;
Database changed

And drizzle server stops at doGetTableNames() breakpoint .

Now, we type in:

[code lang=”shell”]
drop database testdb;
Query OK, 0 rows affected (17 sec)

Again, drizzle server stops at doGetTableNames() function. As we can see using and dropping databases operations are simple involved only one function doGetTableNames()

2. create tables

Now we create a table called students with three columns:

[code lang=”shell”]
Create table students(id int not null, name char(20) not null, phone char(12))
Query OK, 0 rows affected (1 min 20 sec)

It then stops at breakpoints:

Breakpoint 10, ArchiveEngine::doGetTableDefinition

Breakpoint 10, ArchiveEngine::doGetTableDefinition

Breakpoint 11, ha_archive (construction function of ha_archive class)

Breakpoint 23, ArchiveEngine::doCreateTable

And a file named students.arz is created in the data directory as shown below:

[code lang=”shell”]
liao@localhost testdb]$ ls -la
drwxrwx–x 2 liao liao 4096 04-05 15:25 .
drwxrwxr-x 3 liao liao 4096 04-05 15:14 ..
-rw——- 1 liao liao   25 04-05 15:14 db.opt
-rw-rw—- 1 liao liao  227 04-05 15:25 students.arz

And we know that doCreateTable() actually creates the students.arz data file using azopen() call shown below:

[code lang=”c” toolbar=”false”]

if (azopen(&create_stream, name_buff, O_CREAT|O_RDWR,AZ_METHOD_BLOCK) == 0)
error= errno;
goto error2;


So, this is the procedure of creation of data files in current row based ARCHIVE storage engine in Drizzle. But now, what we are going to do is to develop a column based one. The primary advantage of a column oriented storage engine is that it makes it possible to read just the subset of columns rather than to read all of the data in a record. For this reason, we create one file for each column, and one file that contains the NULL bitmap of the table. That is, assuming one table contains N columns, we create N+1 files. For doCreateTable() function in the column based ARCHIVE, we shall create N+1 files by calling azopen().

For files to each column, we have file names:

table_name + column_name + ‘.’ + “ARZ”;

For bitmap file, we have:

table_name + ‘.’ + “ARZ”.

These ensure that there would not be two identical files. For table names and column names, we get them from member variables of table argument which contains the table name and all the column names in the **field pointer.


In drizzle client, we input:

drizzle> Insert into students values (1, ‘Tom’, NULL);

Query OK, 1 row affected (38 sec)

And drizzle server reaches the following breakpoints:

Breakpoint 11, ha_archive

Breakpoint 21, ha_archive::open

Breakpoint 17, ha_archive::get_share

Breakpoint 7, ArchiveEngine::findOpenTable

Breakpoint 14, ArchiveShare

Breakpoint 16, ArchiveShare::prime

Breakpoint 8, ArchiveEngine::addOpenTable

Breakpoint 49, ha_archive::create_record_buffer

Breakpoint 43, ha_archive::store_lock

Breakpoint 45, ha_archive::start_bulk_insert

Breakpoint 27, ha_archive::write_row

Breakpoint 19, ha_archive::init_archive_writer

Breakpoint 24, ha_archive::real_write_row

Breakpoint 26, ha_archive::pack_row

Breakpoint 25, ha_archive::max_row_length

Breakpoint 35, ha_archive::fix_rec_buff

Breakpoint 46, ha_archive::end_bulk_insert

The key is the write_row() function. Again, current ARCHIVE is row-based, so create_record_buffer() is called in open() to malloc a buffer to store one row:

  record_buffer= create_record_buffer(table->s->reclength +


Now, things changed. We need buffering for each column, and Brian mentioned that azip handles compressing lots of records at once much better then doing lots of little records between writes. So it’s important to choose a suitable buffer size for each column. Better compression means less I/Os. Based on these, we may first buffer the data of each column and then call azip library functions to do the compression processing and flush the compressed data. But how to fetch data of each column in a row? Again, we get the information of columns from table->**field pointer and copy the data to buffer of each column. When the buffer for each column is full, we do compressing and write the data to disk, while current ARCHIVE flush data only if a read occur. In all, Figure 1 illustrates the design of insertion.


Figure 1. Insertion for column based ARCHIVE storage engine

Another issue is every column needs cursor to the files for insertion because columns may have NULL values. Current ARCHIVE storage engine is row-based and rows are packed in a single file, so only one cursor in the file is fine. But now each column gets it’s own file, so for a table that has N columns as well as N cursors are needed.

4. select(table scan)

For table scanning, we try:

drizzle> select * from students;

Breakpoints are:

Breakpoint 43, ha_archive::store_lock

Breakpoint 44, ha_archive::info

Breakpoint 33, ha_archive::rnd_init

Breakpoint 20, ha_archive::init_archive_reader

Breakpoint 12, ha_archive::read_data_header

Breakpoint 38, ha_archive::rnd_next

Breakpoint 34, ha_archive::get_row

Breakpoint 37, ha_archive::get_row_version3

Breakpoint 36, ha_archive::unpack_row

Breakpoint 38, ha_archive::rnd_next       

Breakpoint 34, ha_archive::get_row        

Breakpoint 37, ha_archive::get_row_version3     

Current ARCHIVE supports no indexing (and it needn’t to). While doing a select operation, it scans the whole table, as can be seen from above. First it calls rnd_init() to prepare for the scanning, then calls rnd_next(), get_row(), get_row_version3() and unpack_row(). It continues on until reach the end of the table. 

For column-based ARCHIVE, we first scan the bitmap file then do a bitwise AND operation to examine whether there are NULL values in a row. To explain this, assume we have written a row (1, ‘Liao’, NULL) into table students, we actually wrote to 3 files. First, the NULL bitmap(001) to bitmap file, then the first column file(1), and then the 2nd column file(‘liao’). Now we first fetch bitmap(001) and examine the 3rd column of this row has NULL value, then we needn’t read the 3rd column file.

After all the specified columns are read, we unpack the row. Also, some cursors may be need if the table has NULL values in some column.

For operations that do selection on only some of the columns, we need only scan the specified column files.