The Persistent Object Store
Table of Contents
- Introduction
- Pointer Swizzling at Page-Fault Time
- Pivots
- The Page Lifecycle: How Objects Get Loaded
- Reachability Based Persistence: What Happens During Commit
- Closing a PStore
- Performance Issues
- Interaction with Transient Heap
- System Backups
- Garbage Collection
- Programmatic Interface
- The Log Structured Store
- How the PStore Uses the LSS
- Implementation Details
- Bibliography
- Aux Doc Notes
1. Introduction
RScheme's persistent object store module, located in
rs.db.rstore
, is a system that allows
objects in persistent storage (i.e., on disk) to be mapped
into memory for direct manipulation by an application
program. The approach
is based on the
concept of pointer swizzling at page-fault time as described in Paul
Wilson's Pointer
Swizzling at Page-Fault Time.
2. Pointer Swizzling at Page-Fault Time
Pointer swizzling at page-fault time incrementally reserves additional virtual memory pages as objects from the store are loaded into memory. The incremental nature allows the system to access databases which are larger than the virtual memory of the machine (although not all at once. The working set of objects used by the application has to fit within virtual memory).
Furthermore, pointers are translated when a page is loaded, and from then on are represented as normal memory pointers. Hence, the application can traverse the loaded objects with no additional runtime performance penalty after the access that loads the object into memory.
3. Pivots
Pointers to "well-known" objects in transient land.
3.1. Builtin Pivots
Symbols (dynamic builtin pivots), most standard class objects.
3.2. Application Pivots
Application class objects. What happens if you miss one... classes... procedures (make-table)...
4. The Page Lifecycle: How Objects Get Loaded
VM pages that are part of the pstore subsystem are either
unused
,
reserved
,
loaded
, or
dirty
.
When a new pstore is opened, a virtual memory page is is reserved to provide a mapping for the root object. With that mapping in hand, the persistent address for the root object can be returned to the application as a virtual memory address. For example, suppose that the persistent address of the root object is {page 201, offset 3} (* These page numbers and offsets are illustrative and not typical of what you would actually find. In practice, persistent page numbers start at 33554432 (#x2000000) and transient pages are often mapped up near #x40000000 on Linux. Offsets are also typically larger and more widely spaced (according to the memory requirements of the objects in question))
Figure 1. Pages In Memory ![]()
Read faulting causes reserved->loaded of the targeted page and unused->reserved of newly referenced pages, write faulting causes loaded->dirty, and commit causes dirty->loaded.
5. Reachability Based Persistence: What Happens During Commit
High level walk through, with pointers to details. [Dirty pages are scanned, pointers swizzled, objects are copied.]
-
Dirty page scan. For each dirty page in memory, an attempt is made to write it to the underlying LSS. To do so, the page is scanned to convert transient pointers into persistent pointers. If no pointers outside the store are found, then the page is serialized and written to the apropriate LSS record.
-
Write Record 0. Once all the pages have been successfully written to the LSS, the root record (record 0) is written to the LSS.
-
LSS Commit. The LSS subsystem is committed to disk.
-
Reset transient state. After the LSS commits, various bits of transient state (such as the intergenerational pointer list) are reset.
-
Online GC work. If enabled, the online persistent garbage collection is given a chance to do some work.
5.1. Dirty Page Scan
Pages are scanned in two phases. The first phase translates transient pointers into persistent pointers (specifically, the pstore's page table entries associated with referenced pages are looked up). In the process, it becomes obvious whether there are any references to objects that are not in the store. If all references are to pages in the store, then the second phase commences, which is to open the underlying LSS record for write and serialize the data into it.
The second phase can reuse most of the translation work done by the first phase because during the first phase the translations are stored in a temporary buffer.
During the second phase, any object marked as having intergenerational pointers is unmarked. This works because if the second phase is executing, all references on the page must be to persistent objects (or pivots, which are kept live by the presence in the pstore object's pivot table).
5.2. Thread System Interaction
In general, the system does not lock out other threads during commit processing. A thread switch may happen during commit processing while the pstore is handling a dirty page that was found to have non-pstore pointers. If during that time another thread runs and it writes into the store, then when the pstore thread wakes back up it will traverse that page as well (if the page had been written out, then it would now be marked as dirty and would be rescanned). Hence, all other thread activity against the pstore will be incorporated into the commit point until the moment (which is determined atomically) that all dirty pages can be written out without finding any external pointers.
In general, since the pstore thread is copying objects into the pstore during the dirty page scan, it can be difficult for other threads to get a consistent view. At any thread switch point, an object being stored into the pstore by another ("mutator") thread may lose object identity with the one in the pstore.
For example, suppose there are two threads, A and B. A calls pstore-commit. Meanwhile, B is writing into the store like so:
(define (b-thread-activity)
(let ((x (list 1 2 3)))
(store-in-pstore-index 'my-list x)①
(assert (eq? x (get-from-pstore-index 'my-list)))②
③(set-car! x 5)))
The assertion ② may or may not succeed. If
a thread switch occurred at ① and the page
containing the pointer to the list
(1 2 3)
was written out (thereby copying the list into the store), then
② will fail because retrieving the index
pointer will return the new, copied-into-the-store pointer which
will be different from the transient heap pointer that is
the value of the x
binding.
Even if the assertion ② succeeds, by the
time the program reaches ③, another thread
switch may have occurred, in which case updating the car of the
list x may be moot if the list has already
been copied to the store.
If multiple threads are writing to the store in the presence of an asynchronous commit (which is a wierd situation anyway: what is the meaning of a consistent state when things are being updated asynchronously?), then the program must ensure that side effects are happening only to persisted objects. The above fragment could be rewritten to as follows:
(define (copy->ps x)
(copy-into-pstore
(default-allocation-area *pstore*)
x))
(define (b-thread-activity)
(let ((x (copy->ps (list 1 2 3))))
(store-in-pstore-index 'my-list x)
(set-car! x 5)))
By ensuring that the list is persistent before using it,
the program can be sure that it isn't accessing the transient
copy.
[Although note: copy-into-pstore
is not recursive. Only the first pair gets eagerly persisted
in this example.]
6. Closing a PStore
[Where does this section belong?]
7. Performance Issues
There are many performance issues to consider...
Although no additional runtime penalty is incurred when accessing (reading) an object that has been mapped into memory, there are fault penalties associated with (1) writing to an object on a page that is clean, (2) reading an object on a non-loaded page that has a lot of pointers, ...
8. Interaction with Transient Heap
PStore looks like the "oldest generation". Inter-generational pointers are tracked until a commit, then flushed.
9. System Backups
Compared to most system files, the underlying log-structured store has a peculiar behavior. From the filesystem's point of view, the LSS consists of a small number of big files, one of which (the "tip" volume) is frequently appended to. If properly arranged, this can mesh nicely with the system's underlying backup facility. Otherwise, it can thrash the backup system.
(The idea is to manage multiple volumes just right: compact the youngest volume just before the backup goes to minimize the footprint of the youngest volume in backups. Alternatively, do so after to maximize the data capture without having touched the next-oldest generation. This concept generalizes to multiple generations.)