This note describes a new saved workspace format and serialization
mechanism that has been added to R. This format is now the default
save format in the development version of R. The save
function
now takes a version
argument; specifying version = 2
causes
the new format to be used and version = 1
requests the format used
through 1.3.1.
A PostScript version of this document is also available.
The main reason a new format is needed is to support name spaces. While the details of name spaces are not yet complete, the following seem clear at this point:
save
encounters a registered name space while saving a
node graph, it should not save the contents of the name space.
Instead, it should record the identifying information in the saved
work space.
load
finds a reference to a name space in a work space
it is loading, it can use this identifying information to locate the
work space, either within a table of loaded name spaces or by searching
for and then loading an appropriate package.
save
mechanism needs to be modified to support this handling of
name spaces.
Since a revision of the save format seems necessary to support name spaces, there are some other issues that can be addressed at the same time:
NA
values should be preserved on saving; currently
they are not.
strlen
gives the
correct string length, so strings with embedded nulls are not of much
use, but it would be nice to preserve them correctly in case this
changes.
x
and y
refer to the same
external pointer object and are saved, then x
and y
will refer
to different external pointer objects after loading them from a saved
work space. This is not a major issue for standard uses, since both
external pointers and weak references are only meaningful within a
session. But it may cause problems for non-standard uses.
NULL
are
always written out, and several items that need only one or a few bits
are each written as full integers. This is not an issue for saving
data, which usually involves few nodes and large vector data segments.
But it is a problem for work spaces containing mostly code, such as
those created by installing packages with a --save
option.
library
code tries to fix up the
environments of closures. Despite this it is possible, though
very unlikely, that a package environment can get saved. It might
be useful to prevent this in the save code by handling packages
along similar lines as name spaces.
save
function is responsible both for
serializing an object, i.e. turning it into a stream of bytes, and
writing the serialized stream to a file. It would be useful to
separate the two. This would allow sending serialized objects over a
socket connection, saving to a compressed file, or using some other
form of external storage such as a data base of some sort.
ATTRIB
field is
written after the CDR
field. By writing the CDR
last it is
easier to arrange for a non-recursive algorithm to be used for both
reading and writing.
The new serialization approach attempts to address some of these issues. In looking at the new code there are two major questions that need to be answered:
To take full advantage of the options offered it would be useful to
have a way for users to customize the action of save.image
and the
way a work space is saved on exit. This would allow users to request,
for example, that their work space be compressed or perhaps that it be
stored in a data base. This has not been addressed yet.
The only direct R interface provided in the core distribution for now
is through the save
function: calling it with version=2
produces a work space in the new format.
An experimental interface is available as a
package
serialize
with two functions, serialize
and unserialize
.
The usage is
<serialize package usage>= serialize(object, connection, ascii = FALSE, refhook = NULL) unserialize(connection, refhook = NULL)
Definesserialize
,unserialize
(links are to index).
The connection argument to serialize
can be a open connection or
NULL
. If it is NULL
, then object
is serialized to a
string and that string is returned as the value of serialize
.
Otherwise object
is serialized to the connection and NULL
is
returned. The connection
argument to unserialize
can be an
open connection or a string.
This interface may need some adjustment. It would be nice if we could use only proper connections, but the current version of text connections, in particular text output connections, doesn't seem quite adequate.
In terms of this interface,
<R session>= [D->] save(list=mylist, file="myRData", version = 2)
<save equivalent using serialize package>= conn<-file("myRData","wb") writeChar("RDX2\n", conn, eos=NULL) serialize(mylist,conn) close(conn)
and loading this work space with load
corresponds to
<load equivalent using serialize package>= conn<-file("myRData", "rb") if (readChar(conn,5) == "RDX2\n") { val <- unserialize(conn) names <- names(val) for (i in seq(along=names)) assign(names[i], val[[i]], envir = envir) } else ... close(conn)
The refhook
functions can be used to customize the handling of
non-system reference objects (all external pointers, all weak
references, and all environments except .GlobalEnv
, name spaces,
and package environments). If serialize
is provided with a
refhook
function, then that function is called with each reference
object before that object is written. If refhook
returns NULL
then the object is written in the usual way. Otherwise, refhook
must return a character vector, and the character strings are saved
(only the stings, no names or other attributes). If a serialization
contains a value produced by a refhook
, then it must be
unserialized with a corresponding refhook
. The unserialize
hook is called with the reconstructed character vector as its
argument, and should return whatever object the character vector
indicates. Some examples of using this mechanism are given in
Section [->].
pstream
portion of names used in this interface is meant to be
short for persistent stream.
The C interface uses structures to represent input and output serialization streams.
<serialization stream declarations>= [D->] typedef struct R_outpstream_st { ... } *R_outpstream_t; typedef struct R_inpstream_st { ... } *R_inpstream_t;
DefinesR_inpstream_st
,R_inpstream_t
,R_outpstream_st
,R_outpstream_t
(links are to index).
The structure declarations are in Rinternals.h
, and user code must
provide storage for the structure, but it should not be used directly
as its contents might change. Instead, it should be initialized by
one of the initialization functions.
Two sets of higher level initialization functions are provided. One
allows writing to an open FILE *
pointer; this is used by the
internal save
code:
<serialization stream declarations>+= [<-D->] void R_InitFileInPStream(R_inpstream_t stream, FILE *fp, R_pstream_format_t type, SEXP (*phook)(SEXP, SEXP), SEXP pdata); void R_InitFileOutPStream(R_outpstream_t stream, FILE *fp, R_pstream_format_t type, int version, SEXP (*phook)(SEXP, SEXP), SEXP pdata);
DefinesR_InitFileInPStream
,R_InitFileOutPStream
(links are to index).
<serialization stream declarations>+= [<-D->] typedef enum { R_pstream_any_format, R_pstream_ascii_format, R_pstream_binary_format, R_pstream_xdr_format } R_pstream_format_t;
DefinesR_pstream_any_format
,R_pstream_ascii_format
,R_pstream_binary_format
,R_pstream_format_t
,R_pstream_xdr_format
(links are to index).
An explicit format must be provided for output. For input,
R_pstream_any_format
can be used to indicate that any format is
acceptable; if an explicit format is provided on input then an error
is raised if the actual stream format does not match the
specification.
The second higher level interface is for connections:
<serialization stream declarations>+= [<-D->] void R_InitConnOutPStream(R_outpstream_t stream, Rconnection con, R_pstream_format_t type, int version, SEXP (*phook)(SEXP, SEXP), SEXP pdata); void R_InitConnInPStream(R_inpstream_t stream, Rconnection con, R_pstream_format_t type, SEXP (*phook)(SEXP, SEXP), SEXP pdata);
DefinesR_InitConnInPStream
,R_InitConnOutPStream
(links are to index).
The connection must be open in the appropriate direction, and must be binary for binary streams.
The hook mechanism is analogous to the refhook
mechanism in the R
interface. For output, phook
is called with the reference object
and pdata
as its arguments, and should return R_NilValue
or an
STRSXP
of length at least one. For input, an STRSXP
is
supplied and the appropriate object should be returned.
A lower level initialization mechanism can be used to build higher level ones. The lower level mechanism requires two routines, one for handling single characters, used mostly for ascii streams and for writing header information, and one for handling blocks of bytes.
<serialization stream declarations>+= [<-D->] void R_InitInPStream(R_inpstream_t stream, R_pstream_data_t data, R_pstream_format_t type, int (*inchar)(R_inpstream_t), void (*inbytes)(R_inpstream_t, void *, int), SEXP (*phook)(SEXP, SEXP), SEXP pdata); void R_InitOutPStream(R_outpstream_t stream, R_pstream_data_t data, R_pstream_format_t type, int version, void (*outchar)(R_outpstream_t, int), void (*outbytes)(R_outpstream_t, void *, int), SEXP (*phook)(SEXP, SEXP), SEXP pdata);
DefinesR_InitInPStream
,R_InitOutPStream
(links are to index).
Once a stream is initialized, it is read and written by
<serialization stream declarations>+= [<-D] void R_Serialize(SEXP s, R_outpstream_t ops); SEXP R_Unserialize(R_inpstream_t ips);
DefinesR_Serialize
,R_Unserialize
(links are to index).
Some of these examples use an assert
function that can be defines as
<assert function>= assert <- function(expr) if (! expr) stop(paste("assertion failed:", deparse(substitute(expr))))
Definesassert
(links are to index).
<file examples>= [D->] x<-list(1,2,3) f<-file("sertmp", open="wb") serialize(x,f) close(f) f<-file("sertmp", open="rb") y<-unserialize(f) close(f) assert(identical(x, y))
By using the right magic number header, we can create a saved
work space that load
can read:
<file examples>+= [<-D->] f<-file("sertmp","wb") writeChar("RDX2\n", f, eos=NULL) serialize(list(x=1,y="2"),f) close(f) load("sertmp") assert(x == 1 && y == 2)
Similarly, we can read a file written by save
with version = 2
:
<file examples>+= [<-D] x<-list(1,2,3) save("x", file="sertmp",version=2) f<-file("sertmp", "rb") readChar(f,5) y<-unserialize(f) close(f) assert(identical(x, y$x))
NULL
for the connection argument serialize
will serialize to a string. The string is likely to contain embedded
null characters unless ascii=TRUE
is specified. unserialize
can handle this properly, but since other aspects of R can't it might
be worth considering an alternate form of return value for binary
serializations to memory.
<string examples>= [D->] x<-list(1,2,3) y<-unserialize(serialize(list(1,2,3),NULL)) assert(identical(x, y))
Sharing of environments is preserved within a serialization, but identity is not preserved by serialization:
<string examples>+= [<-D->] e1 <- new.env() e2 <- new.env() y<-unserialize(serialize(list(e1,e2,e1,e2),NULL)) assert(identical(y[[1]],y[[3]])) assert(! identical(y[[1]],e1))
We can use the refhook
mechanism to attempt to preserve identity
as well (but just in this artificial setting where we are saving from
and loading into the same process---in general this is of course
impossible):
<string examples>+= [<-D] outhook <- function(e) { if (identical(e,e1)) "e1" else if (identical(e,e2)) "e2" else NULL } inhook <- function(n) get(n) y<-unserialize(serialize(list(e1,e2,e1,e2),NULL,refhook=outhook), refhook=inhook) assert(identical(y[[1]],e1))
dbm
library. This library implements a simple key/value
data base. Unlike the original dbm
library, GNU dbm
does not
limit the size of keys or values.
The interface is quite inefficient since it uses a fresh connection
for each operation, but this should be adequate for simple
illustrative purposes. The interface, available as the
gdbm
package, is:
<gdbm interface usage>= gdbmNew(name) gdbmInsert(name, key, value) gdbmFetch(name, key) gdbmExists(name, key) gdbmDelete(name, key) gdbmList(name)
DefinesgdbmDelete
,gdbmExists
,gdbmFetch
,gdbmInsert
,gdbmList
,gdbmNew
(links are to index).
<gdbm examples>= gdbmNew("mydb") gdbmInsert("mydb","bob","dog") gdbmInsert("mydb","fred","cat") gdbmList("mydb") gdbmExists("mydb","bob") gdbmExists("mydb","joe") gdbmFetch("mydb","bob") gdbmDelete("mydb","bob") gdbmExists("mydb","joe")
Implementing this idea would require surgery on envir.c
. But we
can partially simulate it by replacing closures in base by promises
that load the closure from a data base. To measure the effect,
we start with a regular R session and look at the memory usage:
<R session>+= [<-D->] > gc() used (Mb) gc trigger (Mb) Ncells 196188 5.3 407500 10.9 Vcells 37757 0.3 786432 6.0
To start, we need to store the closures in base in a data base. Since
this simple approach cannot deal with shared environments, only the
closures with .BaseNamespaceEnv
as their environment are stored.
<storing base in a gdbm data base>= [D->] # create the data base gdbmNew("base") # fill it in for (name in ls(NULL, all=TRUE)) { val <- get(name, env=NULL) if (typeof(val) == "closure" && identical(environment(val), .BaseNamespaceEnv)) gdbmInsert("base", name, serialize(val, NULL)) } # check it for (name in ls(NULL, all=TRUE)) { val <- get(name, env=NULL) if (typeof(val) == "closure" && identical(environment(val), .BaseNamespaceEnv)) { if (! gdbmExists("base", name) || ! identical(val, unserialize(gdbmFetch("base",name)))) stop(name) } }
Now we can replace all closures in base by promises that load them as needed:
<storing base in a gdbm data base>+= [<-D] wrap<-function(name) { name <- name # need to force evaluation! delay(unserialize(gdbmFetch("base",name)),env=environment()) } for (i in gdbmList("base")) assign(i, wrap(i), env=NULL)
To see the effect, we can again run a gc
:
<R session>+= [<-D->] > gc() used (Mb) gc trigger (Mb) Ncells 41995 1.2 350000 9.4 Vcells 18345 0.2 786432 6.0
Memory usage has dropped from 5.6Mb to 1.4Mb. The promises do take up
some space, but even that could be eliminated by making a modification
in envir.c
.
Using a data base for persistent storage of R code seems like an idea worth exploring in more depth. GDBM is one option for the data base. GDBM ports are available for almost all, if not all, platform where R runs, including Windows and classic Mac OS, so this may be a good default choice. Other data bases may work just as well and may in some cases be more suitable, so allowing a mechanism for choosing the data base is probably a good idea.
.BaseNamespaceEnv
as their
environment, but a few do not. The simple approach of the previous
section would fail if two closures shared a non-global environment
since separate serializations would not preserve that sharing. The
refhook
mechanism can be used to overcome this problem. This
section provides a simple illustration of how this can be done. The
code presented here is available as
package
shelf
. The name is taken from a similar facility available in
Python (though Python's facility does not try to preserve
sharing across entries, just within entries, if I understand it
correctly).
<shelf utilities>= (U->) [D->] envlist <- function(e) { names <- ls(e, all=TRUE) list <- lapply(names, get, env=e, inherits=FALSE) names(list) <- names list }
Definesenvlist
(links are to index).
The second is sort of an inverse---given a named list and an environment, it adds the contents of the list to the environment.
<shelf utilities>+= (U->) [<-D->] listIntoEnv <- function(list, e) { names <- names(list) for (i in seq(along = names)) assign(names[i], list[[i]], env = e) }
DefineslistIntoEnv
(links are to index).
Next, we need a means of creating unique names for a set of
environments---a little data base of sorts. Given an environment, we
need to be able to ask for the name of the environment if it is
already in the data base. If it is not, we need to insert it and
generate a name for it. The generated names are of the form
env::<index>
. (The getenv
function is not actually needed.)
<shelf utilities>+= (U->) [<-D->] envtable <- function() { idx <- 0 envs <- NULL enames <- character(0) find <- function(v, keys, vals) for (i in seq(along=keys)) if (identical(v, keys[[i]])) return(vals[i]) getname <- function(e) find(e, envs, enames) getenv <- function(n) find(n, enames, envs) insert <- function(e) { idx <<- idx + 1 name <- paste("env", idx, sep="::") envs <<- c(e, envs) enames <<- c(name, enames) name } list(insert = insert, getenv = getenv, getname = getname) }
Definesenvtable
(links are to index).
<R session>+= [<-D->] > et<-envtable() > e<-new.env() > et$getname(e) > et$insert(e) [1] "env::1" > et$getname(e) [1] "env::1"
insert
method,
<shelf utilities>+= (U->) [<-D->] makeGdbmWriter <- function(file) { force(file) gdbmNew(file) list(insert = function(name, value) gdbmInsert(file, name, value)) }
DefinesmakeGdbmWriter
(links are to index).
and a reader connection that provides list
, fetch
, and
exists
methods:
<shelf utilities>+= (U->) [<-D] makeGdbmReader <- function(file) { force(file) list(list = function() gdbmList(file), fetch = function(name) gdbmFetch(file, name), exists = function(name) gdbmExists(file, name)) }
DefinesmakeGdbmReader
(links are to index).
x
is stored under the key var::x
, and
environments referenced by values are stored under the generated
environment keys with an env::
prefix.
<creating a shelf>= (U->) makeShelfWriter <- function(db, ascii = TRUE) { if (is.character(db)) db <- makeGdbmWriter(db) table <- envtable() ser <- function(val) serialize(val, NULL, ascii = ascii, refhook = envhook) envhook <- function(e) { if (is.environment(e)) { name <- table$getname(e) if (is.null(name)) { name <- table$insert(e) data <- list(bindings = envlist(e), enclos = parent.env(e)) db$insert(name, ser(data)) } name } } putvar <- function(name, val, prefix="var", sep="::") { key <- paste("var", name, sep="::") db$insert(key, ser(val)) } list(putvar = putvar) } makeShelf <- function(list, db, ascii = TRUE) { names <- names(list) if (length(names) != length(list)) stop("must provide a named list of values") s <- makeShelfWriter(db, ascii) for (i in seq(along=names)) s$putvar(names[i], list[[i]]) }
DefinesmakeShelf
,makeShelfWriter
(links are to index).
To get a listing of the variables in a shelf, we can use
<listing a shelf>= (U->) listShelf <- function(db) { if (is.character(db)) db <- makeGdbmReader(db) prefpat <- "^var::" sub(prefpat, "", grep(prefpat,db$list(), value=TRUE)) }
DefineslistShelf
(links are to index).
We need a way of retrieving values from a shelf that insures that
sharing of environments is maintained for values retrieved separately.
We do need some connection between retrievals to allow this, and that
connection is provided by a shelf connection object, which is created
and returned by openShelf
. Sharing is preserved for values
retrieved from the same shelf connection.
<opening shelf>= (U->) openShelf <- function(db) { if (is.character(db)) db <- makeGdbmReader(db) envenv <- new.env(hash = TRUE) varkey <- function(name) paste("var", name, sep="::") fetch <- function(name) unserialize(db$fetch(name), refhook = envhook) envhook <- function(n) { if (exists(n, env = envenv, inherits = FALSE)) get(n, env = envenv, inherits = FALSE) else { e <- new.env(hash = TRUE) assign(n, e, env = envenv) # MUST do this immediately data <- fetch(n) parent.env(e) <- data$enclos listIntoEnv(data$bindings, e) e } } list(getvar = function(name) fetch(varkey(name)), exists = function(name) db$exists(varkey(name)), list = function() listShelf(db)) }
DefinesopenShelf
(links are to index).
As a side note, having parent.env<-
available at the R level seems
like a really bad idea because of the potential for real serious
mischief, like clobbering the search list and totally confusing the
internal global cache mechanism. But the facility it provides is
essential in this case since the environment must be created
and registered before its contents and parent are unserialized so that
circular references to the environment are handled properly.
Currently with parent.env<-
available this can be handled in pure
R code. But unless we find a good way of preventing inadvertent use,
it would probably be good to get rid of this at the R level, and thus
require a little bit of C code to implement this stuff.
Here are some examples. First create some environments with some ordinary variables and some references to each other, and place these in a shelf:
<simple shelf example>= e1<-new.env() e2<-new.env(parent = e1) listIntoEnv(list(x=2,y=3, v=3,ee=e2),e1) listIntoEnv(list(x=e1,y=e2), e2) makeShelf(list(x=e1,y=e1, z = 3), "mydb")
The listShelf
and gdbmList
functions show the variables in the
shelf and the actual keys in the data base, respectively:
<R session>+= [<-D->] > listShelf("mydb") [1] "z" "y" "x" > gdbmList("mydb") [1] "env::2" "env::1" "var::z" "var::y" "var::x"
Now we can open the shelf and examine its contents. The pointer values in the environments show that sharing is being handled properly.
<R session>+= [<-D] > s<-openShelf("mydb") > s$list() [1] "z" "y" "x" > s$getvar("x") <environment: 0x8f89ba4> > s$getvar("y") <environment: 0x8f89ba4> > ls(s$getvar("x")) [1] "ee" "v" "x" "y" > get("ee", s$getvar("x")) <environment: 0x8f89048> > parent.env(get("ee",s$getvar("y"))) <environment: 0x8f89ba4>
Finally, we can combine the promise idea used earlier for closures in
base to produce lazy load and attach functions for a shelf. The
loadShelf
functions adds bindings for all variables in a shelf to
the specified environment. The bindings are promises that load the
values on demand using a common connection created when the shelf is
loaded.
<loading a shelf>= (U->) loadShelf <- function(db, envir = parent.frame()) { s <- openShelf(db) wrap<-function(name) { name <- name # need to force evaluation! delay(s$getvar(name), env=environment()) } for (n in listShelf(db)) assign(n, wrap(n), envir = envir) }
DefinesloadShelf
(links are to index).
attachShelf
creates an environment on the search list and fills it
using loadShelf
.
<attaching a shelf>= (U->) attachShelf <- function(db, pos = 2, name) { if (missing(name)) { if (is.character(db)) name <- paste("shelf", db, sep = ":") else name = "shelf:<no name>" } env <- attach(NULL, pos = pos, name = name) on.exit(detach(pos)) loadShelf(db, env) on.exit() }
DefinesattachShelf
(links are to index).
--save
option.
This simple shelf system does not allow values to be deleted or new or changed values to be inserted. Something along these lines would be useful of course. It does however raise a number of complicating issues that would need to be addressed. Some are quite standard, such as dealing with managing coherency for different shelf connections within a given process or even from separate processes. Others are more specific to this approach, such as garbage collection. If variables are deleted then there may be environment entries that are no longer needed. Some mechanism for removing these would be needed.
<shelf.R>= require(serialize) require(gdbm) <shelf utilities> <creating a shelf> <listing a shelf> <opening shelf> <loading a shelf> <attaching a shelf>
Serialization can be used with sockets or with the Rpvm library to
allow code and data to be transferred between processes or machines
for distributed computing. One issue that arises is how to handle
environments. The
Obliq system uses a notion of
distributed scope: environments remain where they were created.
This seems to provide for a nice high level model for distributed
computation. It should be possible to use the refhook
interface
together with the active value ideas recently added to R to implement
(a part of) this sort of thing, but I have not had a chance to try
this yet.
This section gives a bit more information on the serialization algorithm.
The algorithm uses a single pass over the node tree to be serialized. Sharing of reference objects is preserved, but sharing among other objects is ignored. The first time a reference object is encountered it is entered in a hash table; the value stored with the object is the index in the sequence of reference objects (1 for first reference object, 2 for second, etc.). When an object is seen again, i.e. it is already in the hash table, a reference marker along with the index is written out. The unserialize code does not know in advance how many reference objects it will see, so it starts with an initial array of some reasonable size and doubles it each time space runs out. Reference objects are entered as they are encountered.
This means the serialize and unserialize code needs to agree on what is a reference object. Making a non-reference object into a reference object requires a version change in the format. An alternate design would be to precede each reference object with a marker that says the next thing is a possibly shared object and needs to be entered into the reference table.
Adding new SXP
types is easy, whether they are reference objects
or not. The unserialize code will signal an error if it sees a type
value it does not know. It is of course better to increment the
serialization format number when a new SXP
is added, but if that
SXP
is unlikely to be saved by users then it may be simpler to keep
the version number and let the error handling code deal with it.
The output format for dotted pairs writes the ATTRIB
value first
rather than last. This allows CDR
s to be processed by iterative
tail calls to avoid recursion stack overflows when processing long
lists. The writing code does take advantage of this, but the reading
code does not. It hasn't been a big issue so far---the only case
where it has come up is in saving a large unhashed environment where
saving succeeds but loading fails because the PROTECT
stack
overflows. With the ability to create hashed environments at the user
level this is likely to be even less of an issue now. But if we do
need to deal with it we can do so without a change in the
serialization format---just rewrite ReadItem
to pass the place to
store the CDR
it reads. (It's a bit of a pain to do, that is why
it is being deferred until it is clearly needed.)
CHARSXP
are now handled in a way that preserves both embedded null
characters and NA_STRING
values.
The XDR
save format now only uses the in-memory XDR
facility
for converting integers and doubles to a portable format.
The output format packs the type flag and other flags into a single integer. This produces more compact output for code; it has little effect on data.
Environments recognized as package or name space environments are not
saved directly. Instead, a STRSXP
is saved that is then used to
attempt to find the package/name space when unserialized. The exact
mechanism for choosing the name and finding the package/name space
from the name still has to be developed, but the serialization format
should be able to accommodate any reasonable mechanism.
A mechanism is provided to allow special handling of non-system
reference objects (all weak references and external pointers, and all
environments other than package environments, name space environments,
and the global environment). The hook function consists of a function
pointer and a data value. The serialization function pointer is
called with the reference object and the data value as arguments. It
should return R_NilValue
for standard handling and an STRSXP
for special handling. In an STRSXP
is returned, then a special
handing mark is written followed by the strings in the STRSXP
(attributes are ignored). On unserializing, any specially marked
entry causes a call to the hook function with the reconstructed
STRSXP
and data value as arguments. This should return the value
to use for the reference object. A reasonable convention on how to
use this mechanism is need, but again the format should be compatible
with any reasonable convention.
Eventually it may be useful to use these hooks to allow objects with a class to have a class-specific serialization mechanism. The serialization format should support this. It is trickier than in Java and other reference based languages where creation and initialization can be separated--we don't really have that option at the R level.