Serialization of untranslated data

Post by **Ayin** » March 20th, 2005, 4:48 pm

Currently, the data which is unserialized is automatically translated using gettext when marked as translatable. Then, when it is serialized, the translated version is stored.

On an environment where there is only one language, this is generally OK: Re-serialized strings are correctly translated, and then, the correct translation is stored. When loaded, they are not re-translated, because the stored strings are not marked as translatable. So everything works fine.

The problem occurs when there are several languages in an environment. For example, when a user saves a game, which is open by another user using a different language. Or -- which is more annoying -- when 2 users using different languages are playing a MP game.

In most cases, this problem is, today, solved using ids. For example, units have a name (which is translatable) and an unique id (which is not). When WML containing unit information is unserialized, the translated "name" string is thrown away, and then replaced by a new "name" string created from the id of the unit. This can be done because the game has a global database containing all units, along with their id, and their name.

The problem is that the game does not have a global database for everything. It has ones for units, factions, multiplayer sides, but not for levels, for example. This means that player Alice may setup a multiplayer game using a level she has, and player Bob may join A's game even if he does not have this level on his computer. But this also means that Bob will see the level's strings in Alice's language, and will be totally unable to have them in his language. Today, multiplayer games do not have many strings anyway, but if we are to introduce multiplayer campaigns, this will become much more of a problem.

Same problem with the Time-of-Day: ToDs are unlike units and terrain types: they are a part of a level, and the game has no database of them. This is the cause of a quite old, and still not-fixed bug: Time-of-Day are always presented in the language of the game creator during multiplayer games.

To fix it, what I suggest (and what I started to implement) is to serialize the untranslated version of strings. In detail:

* The config class will have a new member, "untranslated_values"
* Upon unserializing text WML, translatable strings will be filled in double: the original version will go into untranslated_values, and the gettextized version into values
* Non-translatable strings will be stored, as before, into values
* Upon serializing text WML, only the untranslated version will be serialized (for translatable string). Those will be marked with the _ prefix. Untranslatable strings will be added as usual.
* A new prefix will be added to binary WML: the character 0x01, which will be the equivalent of the "_" marker, but for binary. A string without this prefix will be loaded as untranslatable (shouldn't break anything with regard to previous saves), and a string with this prefix will be translated.

Waiting for comments, remarks, etc :)

Ayin

silene · Post by **silene** » March 20th, 2005, 6:00 pm

Ayin wrote:A new prefix will be added to binary WML: the character 0x01, which will be the equivalent of the "_" marker, but for binary. A string without this prefix will be loaded as untranslatable (shouldn't break anything with regard to previous saves), and a string with this prefix will be translated.

How do you know which textdomain the string belongs to? Without this information, the string is not translatable. Or do you intend to use the same trick that is used now? (it would prevent applying the whole modularization model I was suggesting, but I can live without this model, since it would mean less work for me)

Post by **Ayin** » March 20th, 2005, 6:42 pm

silene: That's a good point.

I would just suggest attaching a textdomain to each config object. This would mean that:

* With the current system, the "textdomain" value, inside an WML element, would set the textdomain for this element. No leaking possible, as textdomains are contained inside an element. To ease things, we should make sure that the "textdomain" value is always written first when serializing.

* When we switch to a modular approach, the "textdomain" value disappears. Each configuration item automatically gets the textdomain of the module in which it was defined.

Would there be a problem with this approach?

silene · Post by **silene** » March 20th, 2005, 7:19 pm

Ayin wrote:When we switch to a modular approach, the "textdomain" value disappears. Each configuration item automatically gets the textdomain of the module in which it was defined.

I'm all for automatic things, but I don't understand how the client can guess the (textdomain of the) module in which an item was defined. You gave the example of the time of day, could you please explain how it would interact?
I can easily see how it would work if there was an ID for the ToD, you would use the ID to get back to the module in which this particular ToD was defined. But if there is such an ID, the whole point of this thread is moot, so I guess it isn't what you are suggesting.

Post by **Ayin** » March 20th, 2005, 7:35 pm

silene wrote:
Ayin wrote:When we switch to a modular approach, the "textdomain" value disappears. Each configuration item automatically gets the textdomain of the module in which it was defined.
I'm all for automatic things, but I don't understand how the client can guess the (textdomain of the) module in which an item was defined. You gave the example of the time of day, could you please explain how it would interact?
I can easily see how it would work if there was an ID for the ToD, you would use the ID to get back to the module in which this particular ToD was defined. But if there is such an ID, the whole point of this thread is moot, so I guess it isn't what you are suggesting.

Mhh, when I talked about the "textdomain" value disappearing, I was thinking about the WML designer not setting it anymore. Of course, you're right, upon serializing, the writer would need to create it anyway.

silene · Post by **silene** » March 20th, 2005, 7:45 pm

Okay then. You are right, I had understood in your description that the textdomain field would not be needed anymore.

Another problem I can think of, the interaction between string concatenation and translation. How is your protocol supposed to deal with

Code: Select all

description = _"I'm just a poor boy" + "=" + _"I need no sympathy"

Post by **Ayin** » March 20th, 2005, 7:55 pm

silene wrote:Okay then. You are right, I had understood in your description that the textdomain field would not be needed anymore.

Another problem I can think of, the interaction between string concatenation and translation. How is your protocol supposed to deal with
Code: Select all
description = _"I'm just a poor boy" + "=" + _"I need no sympathy"

Yeah, I did just stumble into this problem.

To fix this, I suggest the following:

* Use 2 binary prefixes: 0x01 and 0x02. 0x01 leans "start of translatable string" and 0x02 means "start of untranslatable string".
* When unserializing strings, be it from binary or from text, store a string using those prefixes into the untranslated_values string_map. Your string, for example, would become (with octal-encoded character sequences:)

\001I'm just a poor boy\002=\001I need no sympathy

Of course, a string starting with neither 0x01 nor 0x02 should be considered untranslatable.

* When serializing back to text, the string would be re-encoded with underscore prefixes and pluses.

* When serializing to binary, the string would be sent unchanged. The binary unserializer, then, would split it, and apply gettext to the relevant parts.

silene · Post by **silene** » March 20th, 2005, 8:14 pm

Fine. I always had thought that string concatenation and translation should be the work of the preprocessor and not the parser (as it is now). If I understand you correctly, you are suggesting to move this whole work into a third layer. Let's call it an "interpreter", it would be responsible of composing the final string.

With this new concept, it makes sense to me to change the way we are dealing with translations in the config system. We currently have a mix of translated and untranslated strings into the fields. I would suggest we just scrap all the translated strings and just keep around the original strings. Then your protocol would directly apply since we would only have uninterpreted strings in the config instances. In fact, the uninterpreted strings would directly be stored in the config instances with your protocol.

The interpreted strings would then be generated on the fly, when the fields are being required to the config instances. In order to keep the work to a minimum, a config instance could have a cache of already interpreted strings; but it would not be a part of the instance stricto senso, and hence would not be serialized.

As a side note, such a scheme would also allow for savefiles to evolve when translations get updated instead of getting stuck for eternity.

So now it's my turn to ask: am I missing something?

silene · Post by **silene** » March 20th, 2005, 9:27 pm

I thought a bit more about the details. And I really like what I suggested. So here are some additional precisions. The original WML code is

Code: Select all

[message]
description = _ "I'm just a poor boy" + {COLUMN_SEPARATOR} + "I need no sympathy"
[/message]

After preprocessing and parsing, the config instance is

Code: Select all

config<"message">= {
  ["description"] = "\001I'm just a poor boy\002=\001I need no sympathy",
  ["textdomain"] = "bohemian"
}

A textdomain field would be added to any config instance. It would be added right at creation by using the parent value. It could then be modified by WML code as it is currently possible. It would allow to get ride of the textdomain stack that is present in the code. As an optimization, when the config instance is completely created, the textdomain field could be scrapped if not needed for the instance.

I'm also suggesting adding a \003 as a possible start value to say: don't interpret the string, use it as is (except for \003). It would be especially useful for uninterpretable strings that start with \001 and \002 (and \003 as a consequence).

Then, when the field is accessed, the string will be interpreted

Code: Select all

cfg["description"] -> "J'n kvtu b qpps cpz=J o..."

If it starts with \001, \002, and \003, it will be interpreted; otherwise it will be directly used. In the case of \00[12], the interpretation will use the value of the textdomain field to translate the substrings. The interpreted string could also be cached so that it doesn't have to be reinterpreted later on. This last part requires a bit of thinking, since it may be more interesting to have a global cache rather than a per-config instance cache, but I'm not sure (if we have this kind of string duplication, it would make sense to fix it, independently of whether we switch to this proposal or not).

The changes to the code should be minimal. The parser will get simpler since it won't have to translate strings anymore. config::operator[] will need some deep changes but it should be pretty trivial. And no change anywhere else should be needed. In particular, no need to change the serializers, be they text or binary.

Post by **Ayin** » March 21st, 2005, 7:51 am

I like your proposal. It fixes many issues with mine. However, it will probably need much more work than my original one to implement, because:

* I don't think it's very good design to have operator[]() and operator[]() const to behave much differently. So interpretation should de done using another method.

* Unserialized objects are sometimes (quite ofter, that is) stored into their own structures, which do not know about the original config object which they used at construction. It would be reasonable for them to get uninterpreted strigns, so they can be correctly re-serialized. This means that the "interpret" function cannot be a config method anyway; it must be a global, but this also means we must modify the existing code so that this function always is called before presenting a string to the user.

* This also means that objects that are supposed to contain translatable strings will need to contain a textdomain. Or maybe, textdomains should just be associated to uninterpreted strings.

silene wrote:A textdomain field would be added to any config instance. It would be added right at creation by using the parent value. It could then be modified by WML code as it is currently possible. It would allow to get ride of the textdomain stack that is present in the code. As an optimization, when the config instance is completely created, the textdomain field could be scrapped if not needed for the instance.

Not really: the textdomain is generally needed for re-serialization; it's pretty hard to tell whether a particular config instance will have to be serialized later on or not.

---

Anyway, despite the changes needed, I think it's something worthwile to implement.

silene · Post by **silene** » March 21st, 2005, 8:26 am

Ayin wrote:I like your proposal. It fixes many issues with mine. However, it will probably need much more work than my original one to implement, because:

* I don't think it's very good design to have operator[]() and operator[]() const to behave much differently. So interpretation should de done using another method.

I wholeheartedly agree. But please note that they already behave quite differently (you won't read back what you wrote in); so it's no new behavior.

Ayin wrote:* Unserialized objects are sometimes (quite ofter, that is) stored into their own structures, which do not know about the original config object which they used at construction. It would be reasonable for them to get uninterpreted strigns, so they can be correctly re-serialized. This means that the "interpret" function cannot be a config method anyway; it must be a global, but this also means we must modify the existing code so that this function always is called before presenting a string to the user.

* This also means that objects that are supposed to contain translatable strings will need to contain a textdomain. Or maybe, textdomains should just be associated to uninterpreted strings.

Stuffing textdomains into every interpretable string could be quite dispendious. I would rather keep around config instances. But before going any further, can you give me an example of a translated string that is stored into a specialized structure, whose relevant config has been destroyed, and that will later need to be serialized to another client? I may be underestimating the phenomenon, but I'm not sure it happens that often.

Ayin wrote:
silene wrote:A textdomain field would be added to any config instance. It would be added right at creation by using the parent value. It could then be modified by WML code as it is currently possible. It would allow to get ride of the textdomain stack that is present in the code. As an optimization, when the config instance is completely created, the textdomain field could be scrapped if not needed for the instance.
Not really: the textdomain is generally needed for re-serialization; it's pretty hard to tell whether a particular config instance will have to be serialized later on or not.

I don't think you understood my point. If a config instance contains no \001 string, why keep the textdomain around? It will never be needed anymore, even if the instance is serialized.

Post by **Ayin** » March 21st, 2005, 7:34 pm

silene wrote:I wholeheartedly agree. But please note that they already behave quite differently (you won't read back what you wrote in); so it's no new behavior.

Indeed not. But the current implementation already being discutable does not mean we should abuse it further :)

silene wrote:Stuffing textdomains into every interpretable string could be quite dispendious. I would rather keep around config instances. But before going any further, can you give me an example of a translated string that is stored into a specialized structure, whose relevant config has been destroyed, and that will later need to be serialized to another client? I may be underestimating the phenomenon, but I'm not sure it happens that often.

Mhh, at least the following classes are re-serialized from their data members and not from the config used to create them:

* unit
* team
* time_of_day
* game_state
* map (but nothing translatable here)
* map_labels

It would be possible, to fix those issues, to define a new class which would be more or less like that:

Code: Select all

class t_string
{
public:
    t_string();
    t_string(const t_string&);
    t_string(const std::string& string);
    t_string(const std::string& string, const std::string& textdomain);
    ~t_string();

    operator=(const t_string&)
    operator=(const std::string&):

    operator std::string&();
    const std::string& value() const;

private:
    int textdomain_id_;
    std::string value_;

    std::string* translated_value_;
}

This class could be constructed either from a string, and be untranslatable, or for a string / textdomain, and be translatable. The textdomain would be stored as an ID, indexing a base mapping them to textdomain strings.

It would have an implicit conversion to std::string&, which would process, translate it, cache the translated value and return it. (or just return value_ if it is not translatable), and a value() operator, which would just return its untranslated value. We may also define stuff like operator==, and empty().

Then, we would change config::string_map to be a map of t_strings instead. Custom structures (like class, unit, etc) would, still, mainly use std::strings, but would use t_strings for strings which are known to be translatable. This would allow the changes to be made with much less effort, and still look elegant in the end.

silene · Post by **silene** » March 21st, 2005, 7:48 pm

Ayin wrote:The textdomain would be stored as an ID, indexing a base mapping them to textdomain strings.

That's no good. You want the textdomain to be serializable, and it will be hard to do so if the IDs depend on the number of user campaigns installed on the various clients.

But if you think of a good way to do it. I would simply stuff this integer directly in the strings:

Code: Select all

\004ID\001I'm just a...

Indeed, what you suggest wouldn't work.

Ayin wrote:Then, we would change config::string_map to be a map of t_strings instead.

It won't deal with string concatenations. And please don't suggest a map of vector of t_strings.

Post by **Ayin** » March 21st, 2005, 8:02 pm

silene wrote:That's no good. You want the textdomain to be serializable, and it will be hard to do so if the IDs depend on the number of user campaigns installed on the various clients.

Simple. The game would maintain a std::string to int map, and an int to std::string vector. Each time it would unserialize a string, it would look into the textdomain database to find whether its textdomain is present. If not, it would add the textdomain to the database, and create a new id. When serializing, textdomains would be translated back into strings: the id is, of course meaningless without the textdomain database.

silene wrote:But if you think of a good way to do it. I would simply stuff this integer directly in the strings:
Code: Select all
\004ID\001I'm just a...

Why not. But is there any point?

silene wrote:Indeed, what you suggest wouldn't work.
Ayin wrote:Then, we would change config::string_map to be a map of t_strings instead.
It won't deal with string concatenations. And please don't suggest a map of vector of t_strings.

Why wouldn't it work? The value_ member would contain the string, encoded according to the above protocol. The t_string is, actually, just a way to wrap encoded strings, and to associate them with a textdomain, with minimal changes to the code.

The Battle for Wesnoth Forums

Serialization of untranslated data

Serialization of untranslated data

Re: Serialization of untranslated data