Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV NEXT

FRAMES NO FRAMES

Skwish: a lightweight library for storing blobs on the file system

A lightweight library for storing blobs on the file system.

See:
Description

Packages
com.faunos.skwish	A lightweight library for storing blobs on the file system.
com.faunos.skwish.ext.http	An experimental read-only HTTP interface to skwish is provided here.
com.faunos.skwish.spi	A service provider interface is defined here.
com.faunos.skwish.sys	The basic file-based `Segment` structure is implemented here.
com.faunos.skwish.sys.filters	A collection of views on one or more underlying segments.
com.faunos.skwish.sys.mgr	The `SegmentStore` implementation.
com.faunos.util	Root package for custom utilities that don't necessarily belong in any application per se.
com.faunos.util.boot	Application start up utilities.
com.faunos.util.cc	Concurrency utilities.
com.faunos.util.dim.line	Utilities for one dimensional objects.
com.faunos.util.io	Utilities and implementation classes for various I/O operations.
com.faunos.util.io.file	`File` system utilities.
com.faunos.util.main	Command line parse utilities.
com.faunos.util.net	Facilities for non-blocking network I/O.
com.faunos.util.net.http	A skeletal implementation of a simple non-blocking HTTP server.
com.faunos.util.net.http.file	Support for serving files over HTTP.
com.faunos.util.test	Support for unit tests.
com.faunos.util.tree	Support for tree structures.
com.faunos.util.xml	XML utilities.

A lightweight library for storing blobs on the file system.

Overview

Skwish is a Java library for storing and retrieving blobs of arbitrary size--entries, in Skwish-speak. The entries (blobs) can contain arbitrary content: to Skwish, every entry is simply an uninterpreted byte sequence. Skwish maintains a simple, fast mapping from numeric entry IDs to entry contents. These entry IDs are determined by Skwish on entry (blob) insertion: the IDs are doled out in ascending order. An application, thus, must maintain the entry IDs somewhere else--typically in an index or database.

So the functionality provided is quite Spartan. This begs the question, "But what is it good for?" The idea is for Skwish to do one thing, blob storage management, and do it very well. While similar functionality may be found in many existing indexing and database tools (and indeed the file system itself), Skwish is designed to address more niche scenarios. Here are some examples:

The application manages a large number of records but does not use a full blown (e.g. relational) database implementation.
The application uses a database implementation, but needs to store BLOBs separately. This requirement may arise for performance reasons, or where the contents of the database is determined from a set of source entries (records) but which are to be maintained separately.
The application uses an indexing library (such as Lucene) that does not directly support storing the contents of what is being indexed.
The application needs all-or-nothing transactional semantics. For example, an application may need to commit a set of entries either en bloc or not at all.

More generally, if you find your application needs to maintain both offset and content files, where the offset file delineates the boundaries of entries within the content file, then this library hopefully provides a ready-made solution for that part of the problem. In fact, that's how this project came into being in the first place.

Basic Design

The elementary storage structure is called a segment. Given a segment, a user can add an entry (which is simply an uninterpreted byte sequence) and receive a corresponding entry ID. The returned entry ID can later be used as a key to retrieve the contents of the entry just added. The user does not control what the entry ID will be: rather, it is the segment itself that determines what the new entry ID will be. From the user's perspective, that's all Skwish does in a nut shell.

Append-only Semantics

Entries may only ever be added or removed from a segment; they cannot be modified. This append-only restriction on the behavior of the system, while limiting in some respects, is also quite liberating. For example, this design choice (not allowing updates) avoids many of the contention issues that arise in a concurrent environment. Also, it makes the implementation a good deal simpler.

As it turns out, it's easy to make an append-only system transactional.

Entry Lifecycle

Any entry, thus, has the following simple lifecycle:

Created: The entry is created (inserted/written) and is assigned an entry ID. The returned entry ID will forever be associated with the entry. While in this state, the entry can be read any number of times using the entry ID.
Deleted: An existing (created) entry is deleted and the associated entry ID will forever signify an entry that is not retrievable. The associated entry ID is said to be killed. The system also supports creating and deleting an entry (effectively killing an entry ID) in one step: killNext.

Keystone Operations

The append-only semantics of the system allow it to be relatively fail-safe. That is, if a running instance is abruptly terminated, chances are very good that the system will still be in a consistent state on restart. (I say, "chances are very good," because there is still a small window (the partial write of an 8 byte value denoting the entry count in a segment's index file) in which abrupt termination can result in data corruption.)

In an effort to minimize the number and duration of such windows of vulnerability, the library employs certain keystone operations. A keystone operation is an operation (typically short in duration) that makes a set of previously uncommitted operations effective: the previous operations have no effect on the data model while the keystone operation has not yet been performed.

Infrequent Delete Assumption

The implementation assumes that deletes are relatively infrequent. A deleted entry ID (the ID of a deleted entry) incurs anywhere from 2 to 8 bytes of storage overhead. (4 bytes is typical.) If the average (undeleted) entry size is small (say 32 bytes), this overhead begins to hurt. Here's a back-of-envelop estimate of the overhead.

Let the average existing (not deleted) entry size be s bytes.
Let the average number of deletes per existing entry be d.
Let the storage overhead per entry (existing and deleted) be w bytes. (This is the offset word width.)
Then, the percentage storage overhead for the entries in a segment h is given by

                h = w X (1 + d) ⁄ s

So as the average number of deletes per entry drops, or as the size of the typical entry increases, this storage overhead drops (percentage wise). Consider a case where our calculated overhead is 100%. Setting

                d = 7,  (7⁄8 of the entries are deleted)
                w = 4,  (typical, 4-byte word)
                s = 32, (average entry size)

we get

                h = 1   (100% overhead)

To put it yet another way, for the typical segment (4-byte offset word), as the number of deletes approaches a quarter of the total byte size of the [remaining, undeleted] contents of the segment, the storage overhead approaches 100%.

So if the application does not delete with reckless abandon, this assumption shouldn't be too limiting.

File Format

The base implementation of a segment (BaseSegment) involves just 2 files: a contents file consisting of the entries appended one after the other, and an offsets file that delineates offset boundaries for each entry in the contents file and which provides an index into that file. (Additionally, the offsets file supports encoding that an entry is deleted.) So the contents file has no particular structure--aside from the fact that entries are appended in ascending entry ID order. The index file is more interesting. It has the following structure:

        INDEX := HEADER, ENTRIES
        HEADER := OFFSETWIDTH, BASEID, BASEOFFSET, ENTRYCOUNT
        ENTRIES := <ENTRY>^{ENTRYCOUNT + 1}
        OFFSETWIDTH := BYTE
        BASEID := LONG
        BASEOFFSET := LONG
        ENTRYCOUNT := LONG
        LONG := <BYTE>⁸                                // (Big endian)
        ENTRY := <BYTE>^OFFSETWIDTH

The HEADER section of the index file is of fixed size; the ENTRIES section of the file is variable width. Given an entry ID id, the entry's offsets (into the contents file) are determined as follows:

Subtract the index's base ID (BASEID) from id to obtain its entry number entryNo in the index.
```
        entryNo = id - BASEID
```
Read the entryNo^th and (entryNo + 1)^th ENTRY and interpret these as a signed integral words (big endian byte order, OFFSETWIDTH bytes wide). Let's call these 2 values o₁ and o₂, respectively.
If o₁ < 0, then the specified entry is deleted (and there is no more work to be done).
If o₂ < 0, then the high bit of the high byte is set. Set this high bit to zero.
```
        o₂ := o₂ & 0x7F[<FF>^{OFFSETWIDTH - 1}]
```
where 0x7F[<FF>^{OFFSETWIDTH - 1}] is understood to mean the high bit complement bit field of an OFFSETWIDTH-wide word.
Finally, the offset boundaries of the entry with the given id (in the contents file) are determined by adding the index's base offset (BASEOFFSET) to o₁ and o₂.
```
        startOffset = BASEOFFSET + o₁
        endOffset   = BASEOFFSET + o₂
```

Implementation Notes

I'm going to pile up implementation notes here.
BF: Beware of a high expectations, as in "pile up" -- 8/18/08

On where files begin

The base implementation does not assume that the file representation of the index starts at the beginning of the file it is given. Instead, the implementation requires that the file be positioned at the start of the index before instantiation. This means you can pad the file with whatever you want. (Another index? if you're adventurous.)

Generally speaking, having code not assume that the file representation of an entity starts at the beginning of the file is a good thing. So, for example, we can later pad the index file with an implementation version number: the code continues to work if we first read the version number before constructing the index.. you get the picture.

On the current position of a file

To the extent possible, the implementation avoids using an underlying FileChannel's position field. For ordinary segments, for example, the only times the [entry/content] file's position matters is at instantiation (as described above) or when it is appending another segment to itself.

The point of this exercise, of course, is to support concurrent access.

Concurrency characteristics of `Segment`s

The file structure of a [base] segment (more specifically, its simple index) lends itself to a multiple readers / single writer model of concurrent access. Here's why.

The BaseSegment class is designed not to change state under a read operation. That class does not use FileChannel's position: it only ever uses java.io.FileChannel methods that do not affect the state of the underying channel and which are marked suitable for concurrent access in the FileChannel API. So we [think we] know the whole contraption is safe for concurrent read access--as long as nobody's concurrently doing any writing.

Now let's consider the windows of vulnerability if we allow one writer concurrently with many readers. They are:

While the writer is deleting an entry (by setting the high bit of the corresponding entry offset), and one or more readers are reading that offset.
While the writer is updating the ENTRY_COUNT field in the index and one or more readers are reading it.

The first case (involving the delete) does not ever cause a data consistency error. In fact, if the delete happens without specifying a purge (see Segment.delete(id, count, purge)), then there is no race condition whatsoever; if the purge parameter was set to true, then the reader may read a corrupted entry, however, the data model is not violated.

The second case (involving updating the entry count) is easy to fix, so long as we restrict concurrent to mean concurrent access from within the same process: all we need to do is maintain an in-memory copy that is protected by something like a memory barrier. The Java volatile keyword does the trick.

So, on closer inspection, as long as the writer doesn't immediately purge (overwrite) the entries it deletes (the default behavior, because its faster), a segment does support multiple readers concurrent with a single writer.

The next level up on the concurrency model is to allow multiple concurrent writers and readers. TxnSegment is designed to address that scenario.