|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
See:
Description
Packages | |
---|---|
com.faunos.skwish | A lightweight library for storing blobs on the file system. |
com.faunos.skwish.ext.http | An experimental read-only HTTP interface to skwish is provided here. |
com.faunos.skwish.spi | A service provider interface is defined here. |
com.faunos.skwish.sys | The basic file-based Segment structure is implemented here. |
com.faunos.skwish.sys.filters | A collection of views on one or more underlying segments. |
com.faunos.skwish.sys.mgr | The SegmentStore implementation. |
com.faunos.util | Root package for custom utilities that don't necessarily belong in any application per se. |
com.faunos.util.boot | Application start up utilities. |
com.faunos.util.cc | Concurrency utilities. |
com.faunos.util.dim.line | Utilities for one dimensional objects. |
com.faunos.util.io | Utilities and implementation classes for various I/O operations. |
com.faunos.util.io.file | File system utilities. |
com.faunos.util.main | Command line parse utilities. |
com.faunos.util.net | Facilities for non-blocking network I/O. |
com.faunos.util.net.http | A skeletal implementation of a simple non-blocking HTTP server. |
com.faunos.util.net.http.file | Support for serving files over HTTP. |
com.faunos.util.test | Support for unit tests. |
com.faunos.util.tree | Support for tree structures. |
com.faunos.util.xml | XML utilities. |
A lightweight library for storing blobs on the file system.
Skwish is a Java library for storing and retrieving blobs of arbitrary size--entries, in Skwish-speak. The entries (blobs) can contain arbitrary content: to Skwish, every entry is simply an uninterpreted byte sequence. Skwish maintains a simple, fast mapping from numeric entry IDs to entry contents. These entry IDs are determined by Skwish on entry (blob) insertion: the IDs are doled out in ascending order. An application, thus, must maintain the entry IDs somewhere else--typically in an index or database.
So the functionality provided is quite Spartan. This begs the question, "But what is it good for?" The idea is for Skwish to do one thing, blob storage management, and do it very well. While similar functionality may be found in many existing indexing and database tools (and indeed the file system itself), Skwish is designed to address more niche scenarios. Here are some examples:
More generally, if you find your application needs to maintain both offset and content files, where the offset file delineates the boundaries of entries within the content file, then this library hopefully provides a ready-made solution for that part of the problem. In fact, that's how this project came into being in the first place.
The elementary storage structure is called a segment. Given a segment, a user can add an entry (which is simply an uninterpreted byte sequence) and receive a corresponding entry ID. The returned entry ID can later be used as a key to retrieve the contents of the entry just added. The user does not control what the entry ID will be: rather, it is the segment itself that determines what the new entry ID will be. From the user's perspective, that's all Skwish does in a nut shell.
Entries may only ever be added or removed from a segment; they cannot be modified. This append-only restriction on the behavior of the system, while limiting in some respects, is also quite liberating. For example, this design choice (not allowing updates) avoids many of the contention issues that arise in a concurrent environment. Also, it makes the implementation a good deal simpler.
As it turns out, it's easy to make an append-only system transactional.
Any entry, thus, has the following simple lifecycle:
killNext
.
The append-only semantics of the system allow it to be relatively fail-safe. That is, if a running instance is abruptly terminated, chances are very good that the system will still be in a consistent state on restart. (I say, "chances are very good," because there is still a small window (the partial write of an 8 byte value denoting the entry count in a segment's index file) in which abrupt termination can result in data corruption.)
In an effort to minimize the number and duration of such windows of vulnerability, the library employs certain keystone operations. A keystone operation is an operation (typically short in duration) that makes a set of previously uncommitted operations effective: the previous operations have no effect on the data model while the keystone operation has not yet been performed.
The implementation assumes that deletes are relatively infrequent. A deleted entry ID (the ID of a deleted entry) incurs anywhere from 2 to 8 bytes of storage overhead. (4 bytes is typical.) If the average (undeleted) entry size is small (say 32 bytes), this overhead begins to hurt. Here's a back-of-envelop estimate of the overhead.
Let the average existing (not deleted) entry size be s bytes.h = w X (1 + d) ⁄ s
So as the average number of deletes per entry drops, or as the size of the typical entry increases,
this storage overhead drops (percentage wise). Consider a case where our calculated overhead is
100%. Setting
d = 7, (7⁄8 of the entries are deleted) w = 4, (typical, 4-byte word) s = 32, (average entry size)we get
h = 1 (100% overhead)To put it yet another way, for the typical segment (4-byte offset word), as the number of deletes approaches a quarter of the total byte size of the [remaining, undeleted] contents of the segment, the storage overhead approaches 100%.
So if the application does not delete with reckless abandon, this assumption shouldn't be too limiting.
The base implementation of a segment (BaseSegment
)
involves just 2 files: a contents file consisting of the entries appended one after the other, and an offsets
file that delineates offset boundaries for each entry in the contents file and which provides an index into
that file. (Additionally, the offsets file supports encoding that an entry is deleted.) So the contents file
has no particular structure--aside from the fact that entries are appended
in ascending entry ID order. The index file is more interesting. It has the following structure:
INDEX := HEADER, ENTRIES HEADER := OFFSETWIDTH, BASEID, BASEOFFSET, ENTRYCOUNT ENTRIES := <ENTRY>ENTRYCOUNT + 1 OFFSETWIDTH := BYTE BASEID := LONG BASEOFFSET := LONG ENTRYCOUNT := LONG LONG := <BYTE>8 // (Big endian) ENTRY := <BYTE>OFFSETWIDTH
The HEADER
section of the index file is of fixed size; the ENTRIES
section
of the file is variable width. Given an entry ID id
, the entry's offsets (into the contents
file) are determined as follows:
BASEID
) from id
to obtain its entry number
entryNo
in the index.
entryNo = id - BASEID
entryNoth
and
(entryNo + 1)th ENTRY
and interpret these
as a signed integral words (big endian byte order, OFFSETWIDTH bytes wide).
Let's call these 2 values
o1 and o2, respectively.
o2 := o2 & 0x7F[<FF>OFFSETWIDTH - 1]
where 0x7F[<FF>OFFSETWIDTH - 1]
is understood to mean
the high bit complement bit field of an OFFSETWIDTH-wide word.
id
(in the contents file) are determined by adding the
index's base offset (BASEOFFSET
) to o1 and
o2.
startOffset = BASEOFFSET + o1
endOffset = BASEOFFSET + o2
I'm going to pile up implementation notes here.
BF: Beware of a high expectations, as in "pile up" -- 8/18/08
The base implementation does not assume that the file representation of the index starts at the beginning of the file it is given. Instead, the implementation requires that the file be positioned at the start of the index before instantiation. This means you can pad the file with whatever you want. (Another index? if you're adventurous.)
Generally speaking, having code not assume that the file representation of an entity starts at the beginning of the file is a good thing. So, for example, we can later pad the index file with an implementation version number: the code continues to work if we first read the version number before constructing the index.. you get the picture.
To the extent possible, the implementation avoids using an underlying
FileChannel
's position
field. For ordinary segments,
for example, the only times the [entry/content] file's position matters is at
instantiation (as described above) or when it
is appending another segment to itself.
The point of this exercise, of course, is to support concurrent access.
Segment
sThe file structure of a [base] segment (more specifically, its simple index) lends itself to a multiple readers / single writer model of concurrent access. Here's why.
The BaseSegment
class is designed not to change state under a
read operation. That class does not use FileChannel
's position: it only ever
uses java.io.FileChannel
methods that do not
affect the state of the underying channel and which are marked suitable for concurrent access
in the FileChannel
API. So we [think we] know the whole contraption is
safe for concurrent read access--as long as nobody's concurrently doing any writing.
Now let's consider the windows of vulnerability if we allow one writer concurrently with many readers. They are:
Segment.delete(id, count, purge)
),
then there is no race condition whatsoever; if the purge
parameter was set to true
,
then the reader may read a corrupted entry, however, the data model is not violated.
The second case (involving updating the entry count) is easy to fix, so long as we restrict concurrent to mean concurrent access from within the same process: all we need to do is maintain an in-memory copy that is protected by something like a memory barrier. The Java volatile keyword does the trick.
So, on closer inspection, as long as the writer doesn't immediately purge (overwrite) the entries it deletes (the default behavior, because its faster), a segment does support multiple readers concurrent with a single writer.
The next level up on the concurrency model is to allow multiple concurrent writers
and readers. TxnSegment
is designed to address that scenario.
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |