The Persistent Object Store

Table of Contents

  1. Introduction
  2. Pointer Swizzling at Page-Fault Time
  3. Pivots
  4. The Page Lifecycle: How Objects Get Loaded
  5. Reachability Based Persistence: What Happens During Commit
  6. Closing a PStore
  7. Performance Issues
  8. Interaction with Transient Heap
  9. System Backups
  10. Garbage Collection
  11. Programmatic Interface
  12. The Log Structured Store
  13. How the PStore Uses the LSS
  14. Implementation Details
  15. Bibliography
  16. Aux Doc Notes

1. Introduction

RScheme's persistent object store module, located in rs.db.rstore , is a system that allows objects in persistent storage (i.e., on disk) to be mapped into memory for direct manipulation by an application program. The approach is based on the concept of pointer swizzling at page-fault time as described in Paul Wilson's Pointer Swizzling at Page-Fault Time.

2. Pointer Swizzling at Page-Fault Time

Pointer swizzling at page-fault time incrementally reserves additional virtual memory pages as objects from the store are loaded into memory. The incremental nature allows the system to access databases which are larger than the virtual memory of the machine (although not all at once. The working set of objects used by the application has to fit within virtual memory).

Furthermore, pointers are translated when a page is loaded, and from then on are represented as normal memory pointers. Hence, the application can traverse the loaded objects with no additional runtime performance penalty after the access that loads the object into memory.

3. Pivots

Pointers to "well-known" objects in transient land.

3.1. Builtin Pivots

Symbols (dynamic builtin pivots), most standard class objects.

3.2. Application Pivots

Application class objects. What happens if you miss one... classes... procedures (make-table)...

4. The Page Lifecycle: How Objects Get Loaded

VM pages that are part of the pstore subsystem are either unused , reserved , loaded , or dirty .

When a new pstore is opened, a virtual memory page is is reserved to provide a mapping for the root object. With that mapping in hand, the persistent address for the root object can be returned to the application as a virtual memory address. For example, suppose that the persistent address of the root object is {page 201, offset 3} (* These page numbers and offsets are illustrative and not typical of what you would actually find. In practice, persistent page numbers start at 33554432 (#x2000000) and transient pages are often mapped up near #x40000000 on Linux. Offsets are also typically larger and more widely spaced (according to the memory requirements of the objects in question))

Figure 1. Pages In Memory
[Six virtual memory pages, 2 loaded, 2 reserved, 2 unused]

Read faulting causes reserved->loaded of the targeted page and unused->reserved of newly referenced pages, write faulting causes loaded->dirty, and commit causes dirty->loaded.

5. Reachability Based Persistence: What Happens During Commit

High level walk through, with pointers to details. [Dirty pages are scanned, pointers swizzled, objects are copied.]

  • Dirty page scan. For each dirty page in memory, an attempt is made to write it to the underlying LSS. To do so, the page is scanned to convert transient pointers into persistent pointers. If no pointers outside the store are found, then the page is serialized and written to the apropriate LSS record.

  • Write Record 0. Once all the pages have been successfully written to the LSS, the root record (record 0) is written to the LSS.

  • LSS Commit. The LSS subsystem is committed to disk.

  • Reset transient state. After the LSS commits, various bits of transient state (such as the intergenerational pointer list) are reset.

  • Online GC work. If enabled, the online persistent garbage collection is given a chance to do some work.

5.1. Dirty Page Scan

Pages are scanned in two phases. The first phase translates transient pointers into persistent pointers (specifically, the pstore's page table entries associated with referenced pages are looked up). In the process, it becomes obvious whether there are any references to objects that are not in the store. If all references are to pages in the store, then the second phase commences, which is to open the underlying LSS record for write and serialize the data into it.

The second phase can reuse most of the translation work done by the first phase because during the first phase the translations are stored in a temporary buffer.

During the second phase, any object marked as having intergenerational pointers is unmarked. This works because if the second phase is executing, all references on the page must be to persistent objects (or pivots, which are kept live by the presence in the pstore object's pivot table).

5.2. Thread System Interaction

In general, the system does not lock out other threads during commit processing. A thread switch may happen during commit processing while the pstore is handling a dirty page that was found to have non-pstore pointers. If during that time another thread runs and it writes into the store, then when the pstore thread wakes back up it will traverse that page as well (if the page had been written out, then it would now be marked as dirty and would be rescanned). Hence, all other thread activity against the pstore will be incorporated into the commit point until the moment (which is determined atomically) that all dirty pages can be written out without finding any external pointers.

In general, since the pstore thread is copying objects into the pstore during the dirty page scan, it can be difficult for other threads to get a consistent view. At any thread switch point, an object being stored into the pstore by another ("mutator") thread may lose object identity with the one in the pstore.

For example, suppose there are two threads, A and B. A calls pstore-commit. Meanwhile, B is writing into the store like so:

(define (b-thread-activity)
  (let ((x (list 1 2 3)))
    (store-in-pstore-index 'my-list x)
    (assert (eq? x (get-from-pstore-index 'my-list)))
    (set-car! x 5)))

The assertion ② may or may not succeed. If a thread switch occurred at ① and the page containing the pointer to the list (1 2 3) was written out (thereby copying the list into the store), then ② will fail because retrieving the index pointer will return the new, copied-into-the-store pointer which will be different from the transient heap pointer that is the value of the x binding.

Even if the assertion ② succeeds, by the time the program reaches ③, another thread switch may have occurred, in which case updating the car of the list x may be moot if the list has already been copied to the store.

If multiple threads are writing to the store in the presence of an asynchronous commit (which is a wierd situation anyway: what is the meaning of a consistent state when things are being updated asynchronously?), then the program must ensure that side effects are happening only to persisted objects. The above fragment could be rewritten to as follows:

(define (copy->ps x)
  (copy-into-pstore 
    (default-allocation-area *pstore*)
    x))

(define (b-thread-activity)
  (let ((x (copy->ps (list 1 2 3))))
    (store-in-pstore-index 'my-list x)
    (set-car! x 5)))

By ensuring that the list is persistent before using it, the program can be sure that it isn't accessing the transient copy. [Although note: copy-into-pstore is not recursive. Only the first pair gets eagerly persisted in this example.]

6. Closing a PStore

[Where does this section belong?]

7. Performance Issues

There are many performance issues to consider...

Although no additional runtime penalty is incurred when accessing (reading) an object that has been mapped into memory, there are fault penalties associated with (1) writing to an object on a page that is clean, (2) reading an object on a non-loaded page that has a lot of pointers, ...

8. Interaction with Transient Heap

PStore looks like the "oldest generation". Inter-generational pointers are tracked until a commit, then flushed.

9. System Backups

Compared to most system files, the underlying log-structured store has a peculiar behavior. From the filesystem's point of view, the LSS consists of a small number of big files, one of which (the "tip" volume) is frequently appended to. If properly arranged, this can mesh nicely with the system's underlying backup facility. Otherwise, it can thrash the backup system.

(The idea is to manage multiple volumes just right: compact the youngest volume just before the backup goes to minimize the footprint of the youngest volume in backups. Alternatively, do so after to maximize the data capture without having touched the next-oldest generation. This concept generalizes to multiple generations.)

Post feedback about this page... Last modified 2012-12-23 13:23:51 UTC (1.55e+03 d ago)