Basic ideas of version tolerant serialization in C++

Consider the following scenario: there was structure A in an old version of a C++ application:

struct A
{
    double a;
    int b;
    std::string c;
};

An instance of A was serialized into a file in a binary format and after that the application was updated to a new version.

But in the new version of the application structure A was modified by adding fields d and e and deleting field a:

struct A
{
    int b;
    std::vector<int> d;
    bool e;
    std::string c;
};

and the new version of the application needs to deserialize an instance of its new structure A from the file containing old version of A.

To support this scenario we need:

Some metadata describing old structure A and new structure A to be accessible in the new version of the application.
Some serialization engine that will map the fields of old and new structures, skip deleted fields and initialize added fields to their default values.
Some type identification mechanism to work with types in run time and some type representation that can be serialized into a stream and this the first thing we’ll start with.

Type representation

Using standard string representation of a type is not an option because it is implementation defined, so I did not find a better solution than including all the types we serialize into a global std::variant (GV) and declaring it in all the versions of the application. A new version of the application can only extend GV by adding new types, but can’t change or reorder its existing types, thus the index of a type in GV does not vary from one application version to another and can be used as a serializable type identifier that we can use in both compile time and run time.

The old version of the application working with structure A defines the following GV:

using GV = std::variant<int, double, std::string>;

and the new version defines the following:

using GV = std::variant<int, double, std::string, std::vector<int>, bool>;

An index of a type in GV can be found with a code like this:

template <class T, class U, std::size_t... index>
static constexpr auto find_variant_type_impl(std::index_sequence<index...>) noexcept
{
    using NoRefT = std::remove_reference_t<T>;
    static_assert((std::size_t() + ... + std::is_same_v<NoRefT, std::variant_alternative_t<index, U>>) == 1, "There is no single exact match");
    return std::max({ (std::is_same_v<NoRefT, std::variant_alternative_t<index, U>> ? index : 0)... });
}
 
template <class T, class U>
static constexpr std::size_t find_variant_type_v = find_variant_type_impl<T, U>(std::make_index_sequence<std::variant_size_v<U>>());

Note that GV is never instantiated, but used only for template mataprogramming, so including nontrivial data structures into it will not result in an overhead.

Field mapping

To allow the serialization engine to map old and new fields of a class (or structure) being serialized we include AWL_REFLECT macro to the class definition, so our structure A looks like this:

//Old version
struct A
{
    double a;
    int b;
    std::string c;

    AWL_REFLECT(a, b, c)
};

//New version
struct A
{
    int b;
    std::vector<int> d;
    bool e;
    std::string c;

    AWL_REFLECT(b, d, e, c)
};

AWL_REFLECT macro in addition to making the class tuplizable adds a method that returns an ordered container of the class member names (with an interface similar to std::vector<string>). Below I provided the implementation of AWL_REFLECT macro:

#define AWL_REFLECT(...) \
    AWL_TUPLIZABLE(__VA_ARGS__) \
    static const awl::helpers::MemberList & get_member_names() \
    { \
        static const char va[] = #__VA_ARGS__; \
        static const awl::helpers::MemberList ml(va); \
        return ml; \
    }

Note that line breaks in the macro argument list are allowed, so we can use our macro like this:

...
    AWL_REFLECT(
      a,
      b,
      c)
...

and #__VA_ARGS__ will expand into “a, b, c” without the linebreaks.

Class prototype

A class prototype is an ordered container of the pairs of the class field names (defined by AWL_REFLECT macro) and their type identifiers that are indices in GV.

Serialization engine

Serialization engine maps old and new field indices of each structure it deserializes. For structure A, the map looks like this:

For each structure it creates the array of its field readers with the following interface:

template <class Struct>
struct FieldReader
{
    virtual void ReadField(SequentialInputStream & in, Struct & val) const = 0;
};

and for each GV type creates field skipper that instantiates a fake instance of the type, reads it from the stream and destroys it:

struct FieldSkipper
{
    virtual void SkipField(SequentialInputStream & in) const = 0;
};

To deserialize an instance of a new structure the serialization engine iterates over old field indices (from 0 to 2 in the example with structure A) and calls either field reader with new field index or field skipper with field type index depending of whether old field exists in the new structure or not.

Below I provided the source code of an experimental deserialization implementation:

template<class Stream, class Struct, class Context>
inline void ReadV(Stream & s, Struct & val, const Context & ctx)
{
    auto & new_proto = ctx.template FindNewPrototype<Struct>();
    auto & old_proto = ctx.template FindOldPrototype<Struct>();
        
    auto & readers = ctx.template FindFieldReaders<Struct>();
    auto & skippers = ctx.GetFieldSkippers();

    const std::vector<size_t> & name_map = ctx.template FindProtoMap<Struct>();

    assert(name_map.size() == old_proto.GetCount());

    for (size_t old_index = 0; old_index < name_map.size(); ++old_index)
    {
        const auto old_field = old_proto.GetField(old_index);

        const size_t new_index = name_map[old_index];

        if (new_index == Prototype::NoIndex)
        {
            if (!ctx.allowDelete)
            {
                throw FieldNotFoundException(old_field.name);
            }

            //Skip by type.
            skippers[old_field.type]->SkipField(s);
        }
        else
        {
            const auto new_field = new_proto.GetField(new_index);

            if (new_field.type != old_field.type)
            {
                throw TypeMismatchException(new_field.name, new_field.type, old_field.type);
            }

            //But read by index.
            readers[new_index]->ReadField(s, val);
        }
    }
}

Template mataprogramming techniques

Below I provided some examples of template mataprogramming techniques used in the implementation of serialization engine.

Iterating over a class fields

The following code iterates over class A fields and converts its instance into a map and vice versa:

A a1 = { 5.0, 3, "abc" };

std::map<std::string_view, std::any> map;

awl::for_each_index(a1.as_const_tuple(), [&a1, &map](auto & val, size_t index)
{
    map.emplace(a1.get_member_names()[index], val);
});

Assert::IsTrue(map.size() == 3);

A a2 = {};

awl::for_each_index(a2.as_tuple(), [&a2, &map](auto & val, size_t index)
{
    val = std::any_cast<std::remove_reference_t<decltype(val)>>(map[a2.get_member_names()[index]]);
});

Assert::IsTrue(a1 == a2);

A similar code can be used in an implementation of JSON or XML serialization.

Transformation of a variant into a tuple

template <class V, template <class> class T, std::size_t... index>
inline constexpr auto transform_v2t(std::index_sequence<index...>)
{
    return std::make_tuple(T<std::variant_alternative_t<index, V>>() ...);
}

template <class V, template <class> class T>
inline constexpr auto transform_v2t()
{
    return transform_v2t<V, T>(std::make_index_sequence<std::variant_size_v<V>>());
}

Transformation of a tuple into a tuple

template <template <class> class T, class Tuple, std::size_t... index>
inline constexpr auto transform_t2t(const Tuple & t, std::index_sequence<index...>)
{
    return std::make_tuple(T<std::tuple_element_t<index, Tuple>>(std::get<index>(t)) ...);
}

template <template <class> class T, class Tuple>
inline constexpr auto transform_t2t(const Tuple & t)
{
    return transform_t2t<T>(t, std::make_index_sequence<std::tuple_size_v<Tuple>>());
}

template <template <size_t index> class T, class Tuple, std::size_t... index>
inline constexpr auto transform_t2ti(const Tuple & t, std::index_sequence<index...>)
{
    return std::make_tuple(T<index>(std::get<index>(t)) ...);
}

template <template <size_t index> class T, class Tuple>
inline constexpr auto transform_t2ti(const Tuple & t)
{
    return transform_t2ti<T>(t, std::make_index_sequence<std::tuple_size_v<Tuple>>());
}

Transformation of a tuple into an array

template <typename... Args, typename Func, std::size_t... index>
inline constexpr auto tuple_to_array(const std::tuple<Args...>& t, Func&& f, std::index_sequence<index...>)
{
    return std::array{f(std::get<index>(t)) ...};
}

template <typename... Args, typename Func>
inline constexpr auto tuple_to_array(const std::tuple<Args...>& t, Func&& f)
{
    return tuple_to_array(t, f, std::index_sequence_for<Args...>{});
}

template <typename... Args, typename Func, std::size_t... index>
inline constexpr auto tuple_to_array(std::tuple<Args...>& t, Func&& f, std::index_sequence<index...>)
{
    return std::array{ f(std::get<index>(t)) ... };
}

template <typename... Args, typename Func>
inline constexpr auto tuple_to_array(std::tuple<Args...>& t, Func&& f)
{
    return tuple_to_array(t, f, std::index_sequence_for<Args...>{});
}

template <class I, typename... Args, std::size_t... index>
inline constexpr auto tuple_cast(const std::tuple<Args...>& t, std::index_sequence<index...>)
{
    return std::array{ static_cast<I *>(&std::get<index>(t)) ... };
}

template <class I, typename... Args>
inline constexpr auto tuple_cast(const std::tuple<Args...>& t)
{
    return tuple_cast<I>(t, std::index_sequence_for<Args...>{});
}

template <class I, typename... Args, std::size_t... index>
inline constexpr auto tuple_cast(std::tuple<Args...>& t, std::index_sequence<index...>)
{
    return std::array{ static_cast<I *>(&std::get<index>(t)) ... };
}

template <class I, typename... Args>
inline constexpr auto tuple_cast(std::tuple<Args...>& t)
{
    return tuple_cast<I>(t, std::index_sequence_for<Args...>{});
}

The most of the transformation functions are constexpr, so we can use them in compile time, because std::make_tuple is also constexpr since C++14.

Conclusion

This technique uses the simple C++ serialization framework (SSF) in its implementation, so it has the same limitations as SSF. But its advantages are still light-weight and header only implementation and a near to zero overhead. Another advantage is that it does not require the generation of some C++ wrappers and does not use intermediate data structures like Protobuf does, for example.

Current Status

Currently I have some implementation of version tolerant serialization (VTS) based on the ideas described above in AWL library. GV become a template parameter (it is not global anymore) and it is generated automatically at compile time with some metaprogramming techniques by a given structures and data types. See an example and results of the performance tests. Theoretically this code can be used in real-life projects, but I can’t guarantee the library versions compatibility.

2 Responses to Basic ideas of version tolerant serialization in C++

cyberpiok says:

November 10, 2021 at 8:08 am

Thanks for the informative particle.

I was doing something similar, but met with the problem of aliased pointer handling.
While searching for solutions I came across this blog, so I was wondering if you can share some thoughts on the problem…

To elaborate the problem with an example:

struct A { // old version of A
float x;
int y;
};

struct B {
A* pa;
float* px;
};

Suppose we’ve serialized the object graph of an instance of B, namely b, where b.px points to &b.pa->x, to a datablob, with memory aliasing correctly handled.
If we were to deserialize this datablob with exactly the same version of type metadata, there would be no problem whatsoever, we only need to fix all the pointers after deserialization, then the aliasing behaivour would be identical to the original data, exactly as what we’d expect.

However, suppose a new version of A has its fields changed:
struct A { // new version of A
double z;
float x;
};

Now to deserialize the datablob of b under the new type version, the expected aliasing behaviour of b_deserialized.px becomes somewhat undefined, since the data layout of A has changed. The problem seems so messy that I just can’t get my head around it and find an actual reasonable solution.
I guess I can simply dump the whole guarantee on aliasing, and simply instanciate a new instance everytime I run into an unknown pointer, but that seems to be too high a price to take.

I wonder what would be your suggession on this, thanks!

1. dmitriano says:
  
  November 10, 2021 at 9:06 pm
  
  Hi! Serialization of pointers is an interesting task, but I did not solve it. Currently I can handle a situation when some fields was added to or removed from a struct, see a working example https://github.com/dmitriano/Awl/blob/master/Tests/VtsTest.cpp . With the pointers we also need to handle the situation when multiple pointers points to the same object, so while serializing the object graph we need to compare pointers with previously serialized ones and etc…
  Theoretically your particular case with structs `A` and `B` can be solved with the mechanism used in my example, my code maps new fields to old fields so it reads new `x` from old `x` and theoretically there can be a similar technique with mapping old `px` to new ‘px`, search for `MakeProtoMap` in AWL code https://github.com/dmitriano/Awl

DeveloperNote.com

A software developer's blog