The Log Structured Store

12. The Log Structured Store

The log structured store (LSS) exports an abstraction of a record-oriented repository. Records can be variable sized and indexed by record number (remember VSAM anyone?). The LSS also supports compression of the record payload.

In this description of the LSS, we will often talk about the application level. That refers to the abstraction exported by the LSS module, and in this context the application is usually the pstore. However, because LSS is defined as it's own module, rs.db.lss , user programs can be built directly on top of the LSS.

12.1. File Format

An LSS volume consists of a sequence of tagged LSS Records (not to be confused with the record abstraction exported by LSS). On disk, each record starts with a record header. The header corresponds to the structure LSSRecordHeader in lssv3.h and is comprised of four words (native byte order?) and describes the contents of the record. See Figure 10.

The magic indicates what type of LSS Record is present, and is one of DSEG , DATA , GAP , VOLF , ZIPA , COMM , *EOF , INDX , MIDX , or \LSS . Section 12.2 [Link to subsequent sect2 on details]

The recnum is used in different ways depending on the record type. For DATA records, it is the record number that is exposed to the client application (i.e., the pstore).

Figure 10. LSS Record Format
[General LSS Record Layout]
Figure 11. LSS File Format
[Example LSS File Layout]

An addressable offset in an LSS volume is a multiple of 16 bytes. A pointer to an LSS record is a 32-bit quantity. The top 4 bits denote the volume number (although an LSS repository may contain only 10 volumes, 0-9) and the low 28 bits specify the granule within the volume (each granule is 16 bytes). Hence, each volume may be up to 4GB in size and the total addressable disk space in LSS is 40GB.

12.2. Record Types

12.2.1. DSEG : Data Segment

A data segment record summarizes the records written since the previous data segment or EOF record. The length field is the total number of bytes in the data segment including the DSEG record.

The DSEG record was originally designed to support traversing the LSS in the backwards direction, in case a commit record was not found at the EOF. This works because a GAP record is inserted just before a DSEG to ensure that the DSEG appears in the last granule of a 512-byte IO block. Hence, in the backwards direction, you see a sequence of *EOF and DSEG records. However, this is not currently done because it was found to be unreliable (probably because since a DSEG spans multiple IO blocks, we could not be sure that the last IO block would reliably contain a DSEG . It could be the middle of some payload that happens to look like a DSEG . Of course, that could be true for *EOF , too...)

12.2.2. DATA : Data Record

A data record stores the payload of an application-level record (e.g., a heap page from the pstore or an indirect page descriptor). The low 4 bits of the space word identifies the compression algorithm used to compress the payload. The high 28 bits is the number of granules occupied by the data record, not counting the record header (which, note, is 1 granule in size).

12.2.3. GAP : Granule Gap

A GAP record is used to provide padding so that DSEG and *EOF records can appear in the last granule of an IO block.

12.2.4. VOLF : Volume File Name

A VOLF record contains the path to other volumes. Although a VOLF is present to describe the volume in which it appears, it is not used.

12.2.5. ZIPA : Compression Algorithm

A ZIPA record contains the string name of a compression algorithm.

12.2.6. COMM : Commit Record

A COMM record represents a commit point and stores all the information about the commit, including a pointer to the master index record ( MIDX ) which is used to map application-level record numbers to LSS pointers. See Section 12.3 for more details.

12.2.7. *EOF : End of File Marker

An *EOF record marks the end of the file. An *EOF record always appears at the end of an IO block that also contains a commit record ( COMM ).

The presence of an *EOF record at the end of the last IO block in a volume is how the LSS quickly finds the most recent commit record in a volume. If the EOF record is not present, then the fallback is to scan forward from the last known commit record (as recorded in the volume header, \LSS ) until a new commit record is found.

12.2.8. INDX : Index Record

An INDX record is a bucket in the hash table which maps (application level) record numbers to LSS record pointers. These records are loaded lazily into IndexEntry structures in memory, where it is called a cache line.

12.2.9. MIDX : Master Index Record

A MIDX record contains the contents of a hash table pointing to INDX records.

12.2.10. \LSS : Volume Header

A \LSS record appears only at the beginning of a volume file, and describes the volume as a whole. The payload of the \LSS record corresponds to the LSSVolumeHeader structure, and is mostly constant except for the last_cr_at and last_cr_generation fields which are updated occasionally to point to a recent commit record.

12.3. The LSS Commit Record

All the goodies... Metadata (generation #, version#, timestamp), bookmarks, diff list, midx pointer, prev CR ptr, VH fuel

12.4. LSS Startup Processing

How we find a commit record... how we find... (1) the last, if no gen# is specified, (2) a bookmarked one, (3) some other one (if gen# is specified), (4) one if EOF is messed up, (5) if there are multiple volumes

12.5. Commit Point Bookmarks

The LSS supports a concept of bookmarking particular checkpoints. The commit record maintains direct pointers to each of the bookmarked commit records in addition to the previous commit record.

Bookmarks are intended to support being able to easily roll back to particular significant points in history. For example, bookmark #1 is used by the pstore's online garbage collector to represent the snapshot against which the persistent traversal is taking place (since it employs a snapshot-at-beginning strategy, and multiple commits may happen between when the garbage collection cycle starts and when it completes).

12.6. Compression Algorithms

The payload of data records are compressed using one of 16 defined compression algorithms. The commit record contains a table of 15 pointers to the compression algorithm descriptors, which are ZIPA records. Algorithm 0 is always the null algorith which denotes no compression.

12.7. Command-Line Tools

lssctl

13. How the PStore Uses the LSS

The persistent store stores three kinds of records in the underlying LSS.

  1. Root. The root record, which is always record number 0 (zero), stores information about the pstore as a whole. The contents correspond to the structure RStoreCommitInfo .

  2. Indirect Page Data. Certain kinds of indirect pages can contain information used to elaborate the actual objects at runtime. These record numbers are in the range #x100 - #xFFFFFF .

  3. Heap Page Data. The contents of each page of the persistent heap are stored in records. Each record corresponds to one page, starting at record #x2000000 .

13.1. Indirect Page Data

Currently, for example, symbols that are referenced by the store use indirect page data to store the string repr of the symbol. When a indirect page is referenced by a page that is being loaded,

13.2. Root Record

The contents of the root record (record number 0) correspond to the structure RStoreCommitInfo in rstoret.h and it basically just stores a pointer to the default, or "main", allocation area.

13.3. Heap Page Data

Each page of data in the persistent heap is stored in it's own record in the underlying LSS. Record numbers for heap pages start at #x2000000 (actually, in older versions of RScheme they started at #x1000000 [XXX which versions]).

Unlike in Texas, heap page data is explicitly serialized into the record, instead of being written directly to the underlying storage (with only the pointer rewriting transform). The serialized data is in turn compressed by LSS before being written to the disk. (See Figure 13)

The serialized data is a stream of words (sent to the LRU high-level compression model), and looks like:

Figure 12. Serialized Page Data
[Serialized Page Data template]
Figure 13. Flow of Data Between Memory Pages and Disk
[Page] --words-- (serializer) --words-- (LRU model) --symbols-- (libz) --bytes-- [disk]
Post feedback about this page... Last modified 2012-12-23 13:23:51 UTC (1.79e+03 d ago)