OpenMSX Git Changelog:
* [5/5] Tweak serialization of Clock EmuTime EmuDuration FixedPoint
The data types Clock, EmuTime, EmuDuration and FixedPoint internally contain
only a single (32 or 64 bit) integer. Serializing them to memory can be done
with a simple memcpy(). This patch makes the serialization framework aware of
this. And that allows to optimize serialization of these types better (e.g.
group them together with other simple data types).
--
At the end of this patch series, the generated code looks a lot better. The
overhead of the reverse system has measurably decreased, but only by a small
amount (~1%). The reason is that by far most time of creating an in-memory
savestate is spend in compressing the (V)RAM, not in storing (several hundred)
individual small fields. I also have ideas to improve that part (storing the
RAMs), but I need more time to actually implement it.
* [4/5] Improve serialization of array's
After the changes in this series so far I looked at the quality of the
generated code. There were still a few things that could be tweaked so that the
reserve-memory-only-once optimization can trigger more often. That's done in
this and in the following patch.
Before this patch serializing an array (e.g. an array of 4 int's) was done in a
small loop. Though it's more efficient to just do one memcpy(). This also
enables the serialization of the array to be merged with the serialization of
other simple elements.
* [3/5] Use the new variadic serialize(...) stuff
This is the most boring patch in this series, but also one of the more
important ones. It mechanically changes sequences of serialize() calls with one
variadic serialize() call.
* [2/5] Added variadic serialize(...) functions
Let's revisit the example I gave in the previous commit message
template
void MyClass::serialize(Archive& ar, unsigned version) {
ar.serialize("foo", foo);
ar.serialize("bar", bar);
}
The optimization we want should be done over the two serialize() calls. That's
hard (my past attempts to do this failed). But if we use c++11 variadic
templates we can rewrite the above code as
template
void MyClass::serialize(Archive& ar, unsigned version) {
ar.serialize("foo", foo,
"bar", bar);
}
More general, this new version of serialize() accepts not just one tag-variable
pair, but an undefinite number of pairs. And now the implementation of
serialize() has a full view of what should be done.
In case of the save/load to/from XML, the variadic serialize() function does
_exactly_ the same as repeated calls with single pairs. So for those nothing
changes. For saving to memory we do the following:
- We walk over the list of pairs (walk over all arguments in steps of two).
- If the current pair can be 'optimized' (means serialized with a memcpy) we
don't process it just yet, but only remember it in a tuple of
still-to-be-processed elements.
- If the current pair cannot be optimized we handle it immediately in exactly
the same way as before.
- When we've walked over all parameters we process to tuple of collected
elements with a single call to insert_tuple_ptr() (see previous patch).
Some remarks:
- The order in which the elements are stored has changed compared to before
this patch. So loading from memory has to be changed accordingly.
- While walking over the elements we build a tuple of still to be processed
elements. This stuff is _completely_ optimized away by the compiler. So
after optimization it's as-if the element pairs were immediately present in
a suitable order.
* [1/5] Extend OutputBuffer with insert_tuple_ptr()
First some background info:
Looking back, I'm quite happy with how the serialization code in openMSX turned
out. The user has to write one simple method like:
template
void MyClass::serialize(Archive& ar, unsigned version) {
ar.serialize("foo", foo);
ar.serialize("bar", bar);
}
And from this single templatized method 4 instances are generated: 2 to
save/load to/from an XML file and 2 to save/load to/from a memory buffer. The
in-memory variants are _much_ faster than the XML variants (e.g. because they
don't need to bother with tags or with backwards compatibility) and are used to
take regular snapshots for the reverse system.
Now concentrating on the in-memory variants. When looking at the generated
code, I'm very happy with the load-from memory version, that code is close to
optimal (e.g. to restore an integer value, it loads 4 bytes from the buffer,
stores those into the member and increments the pointer. So only 3 x86
instructions, hard to do much better).
The generated code to save-to-memory is also reasonably good, but there's one
thing that has been bothering me for a long time. Let's explain with an
example. For the serialize() function above the generated code would do more or
less the following:
- ar.serialize("foo", foo)
-> make sure there's room for 'sizeof(foo)' extra bytes in the output,
so do some pointer comparisons and possible grow the buffer
-> copy 'foo' to the output buffer
-> adjust the output pointer
- ar.serialize("foo", bar)
-> and exactly the same as above for the 'bar' member
A more efficient implementation would do:
-> make sure there's room for 'sizeof(foo) + sizeof(bar)' extra bytes
-> copy both 'foo' and 'bar' to the output
(if the 'foo' and 'bar' members happen to be adjacent in memory (and
suitably aligned) a smart compiler could even coalesce both load/store
pairs into a single pair)
-> adjust the output pointer once
Checking whether there's enough room is not very expensive in absolute terms.
But relative to doing the actual data move it is. By doing the above
optimization, saving to memory can become roughly twice as fast.
In the past I already made several (failed) attempts to teach the
serialization code to perform the above optimization. In this patch series I
finally, partly, succeed. Though I'm using variadic templates, a c++11 feature
that wasn't available yet at the time I originally wrote this code.
--
This patch only does some preparatory work. It adds a method insert_tuple_ptr()
to the OutputBuffer class (this class represents the buffer where in-memory
save states write to). This method allows to put a bunch of elements into the
buffer using a single call, and this way the OutputBuffer only needs to
check/grow once for available space. This new method is not yet used in this
patch.
The patch also renames some 'byte' types to the standard 'uint8_t' type and
cleans up some #includes.
Download: OpenMSX Git (2015/09/25) x86
Download: OpenMSX Git (2015/09/25) x64
Source: Here
0 Comments
Post a Comment