The Log Structured Store
12. The Log Structured Store
The log structured store (LSS) exports an abstraction of a record-oriented repository. Records can be variable sized and indexed by record number (remember VSAM anyone?). The LSS also supports compression of the record payload.
In this description of the LSS, we will often talk about the
application level. That refers to the
abstraction exported by the LSS module, and in this context the
application is usually the pstore. However,
because LSS is defined as it's own module,
rs.db.lss
, user programs can be built
directly on top of the LSS.
12.1. File Format
An LSS volume consists of a sequence of tagged LSS Records (not to be
confused with the record abstraction exported by LSS). On disk, each
record starts with a record header. The header corresponds to the
structure
LSSRecordHeader
in
lssv3.h and is comprised of four words (native
byte order?) and describes the contents of the record.
See Figure 10.
The magic indicates what type of LSS Record is present, and is one of
DSEG
,
DATA
,
GAP
,
VOLF
,
ZIPA
,
COMM
,
*EOF
,
INDX
,
MIDX
, or
\LSS
.
Section 12.2
[Link to subsequent sect2 on details]
The recnum is used in different ways depending on the record type. For DATA records, it is the record number that is exposed to the client application (i.e., the pstore).
Figure 10. LSS Record Format ![]()
Figure 11. LSS File Format ![]()
An addressable offset in an LSS volume is a multiple of 16 bytes. A pointer to an LSS record is a 32-bit quantity. The top 4 bits denote the volume number (although an LSS repository may contain only 10 volumes, 0-9) and the low 28 bits specify the granule within the volume (each granule is 16 bytes). Hence, each volume may be up to 4GB in size and the total addressable disk space in LSS is 40GB.
12.2. Record Types
12.2.1.
DSEG
: Data Segment
A data segment record summarizes the records written since the previous data segment or EOF record. The length field is the total number of bytes in the data segment including the DSEG record.
The
DSEG
record was originally
designed to support traversing the LSS in the backwards direction, in
case a commit record was not found at the EOF. This works because a
GAP
record is inserted just before a DSEG to ensure
that the
DSEG
appears in the last granule of a
512-byte IO block. Hence, in the backwards direction, you see a
sequence of
*EOF
and
DSEG
records. However, this is not currently done because it was found to
be unreliable (probably because since a
DSEG
spans
multiple IO blocks, we could not be sure that the last IO block would
reliably contain a
DSEG
. It could be the middle of
some payload that happens to look like a
DSEG
. Of
course, that could be true for
*EOF
, too...)
12.2.2.
DATA
: Data Record
A data record stores the payload of an application-level record (e.g., a heap page from the pstore or an indirect page descriptor). The low 4 bits of the space word identifies the compression algorithm used to compress the payload. The high 28 bits is the number of granules occupied by the data record, not counting the record header (which, note, is 1 granule in size).
12.2.3.
GAP
: Granule Gap
A
GAP
record is used to provide padding so that
DSEG
and
*EOF
records can appear
in the last granule of an IO block.
12.2.4.
VOLF
: Volume File Name
A
VOLF
record contains the path to
other volumes. Although a
VOLF
is present
to describe the volume in which it appears, it is not used.
12.2.5.
ZIPA
: Compression Algorithm
A
ZIPA
record contains the string name
of a compression algorithm.
12.2.6.
COMM
: Commit Record
A
COMM
record represents a commit point
and stores all the information about the commit, including
a pointer to the master index record (
MIDX
)
which is used to map application-level record numbers to
LSS pointers. See Section 12.3 for
more details.
12.2.7.
*EOF
: End of File Marker
An
*EOF
record marks the end of the file. An
*EOF
record always appears at the end of an IO
block that also contains a commit record (
COMM
).
The presence of an
*EOF
record at the
end of the last IO block in a volume is how the LSS quickly finds the
most recent commit record in a volume. If the EOF record is not
present, then the fallback is to scan forward from the last known
commit record (as recorded in the volume header,
\LSS
) until a new commit record is found.
12.2.8.
INDX
: Index Record
An
INDX
record is a bucket in the hash
table which maps (application level) record numbers to
LSS record pointers. These records are loaded lazily into
IndexEntry
structures in memory,
where it is called a cache line.
12.2.9.
MIDX
: Master Index Record
A
MIDX
record contains the contents
of a hash table pointing to
INDX
records.
12.2.10.
\LSS
: Volume Header
A
\LSS
record appears only at the beginning
of a volume file, and describes the volume as a whole.
The payload of the
\LSS
record corresponds to the
LSSVolumeHeader
structure, and is
mostly constant except for the
last_cr_at
and
last_cr_generation
fields which
are updated occasionally to point to a recent commit record.
12.3. The LSS Commit Record
All the goodies... Metadata (generation #, version#, timestamp), bookmarks, diff list, midx pointer, prev CR ptr, VH fuel
12.4. LSS Startup Processing
How we find a commit record... how we find... (1) the last, if no gen# is specified, (2) a bookmarked one, (3) some other one (if gen# is specified), (4) one if EOF is messed up, (5) if there are multiple volumes
12.5. Commit Point Bookmarks
The LSS supports a concept of bookmarking particular checkpoints. The commit record maintains direct pointers to each of the bookmarked commit records in addition to the previous commit record.
Bookmarks are intended to support being able to easily roll back to particular significant points in history. For example, bookmark #1 is used by the pstore's online garbage collector to represent the snapshot against which the persistent traversal is taking place (since it employs a snapshot-at-beginning strategy, and multiple commits may happen between when the garbage collection cycle starts and when it completes).
12.6. Compression Algorithms
The payload of data records are compressed using one of
16 defined compression algorithms. The commit record contains
a table of 15 pointers to the compression algorithm descriptors,
which are
ZIPA
records. Algorithm 0 is always
the null algorith which denotes no compression.
12.7. Command-Line Tools
lssctl
13. How the PStore Uses the LSS
The persistent store stores three kinds of records in the underlying LSS.
-
Root. The root record, which is always record number 0 (zero), stores information about the pstore as a whole. The contents correspond to the structure
RStoreCommitInfo. -
Indirect Page Data. Certain kinds of indirect pages can contain information used to elaborate the actual objects at runtime. These record numbers are in the range
#x100-#xFFFFFF. -
Heap Page Data. The contents of each page of the persistent heap are stored in records. Each record corresponds to one page, starting at record
#x2000000.
13.1. Indirect Page Data
Currently, for example, symbols that are referenced by the store use indirect page data to store the string repr of the symbol. When a indirect page is referenced by a page that is being loaded,
13.2. Root Record
The contents of the root record (record number 0)
correspond to the structure
RStoreCommitInfo
in rstoret.h
and it basically just stores a pointer to the default, or "main",
allocation area.
13.3. Heap Page Data
Each page of data in the persistent heap is stored in it's own
record in the underlying LSS. Record numbers for heap pages
start at
#x2000000
(actually, in older
versions of RScheme they started at
#x1000000
[XXX which versions]).
Unlike in Texas, heap page data is explicitly serialized into the record, instead of being written directly to the underlying storage (with only the pointer rewriting transform). The serialized data is in turn compressed by LSS before being written to the disk. (See Figure 13)
The serialized data is a stream of words (sent to the LRU high-level compression model), and looks like:
Figure 12. Serialized Page Data ![]()
Figure 13. Flow of Data Between Memory Pages and Disk ![]()