wiki:Dev/i18n

Internationalization: Quick Start

From v0.98 on, Indico relies on  Babel in order to manage language dictionaries and provide some basic date formatting and other locale-related functions.

Writing i18n-aware code

The basics

The basic tool in code i18n is _(). In indico, this function is available by default (no need to import!) everywhere. So, the code:

print _('Fetch the cow')

would be displayed as:

Fetchez la vache

considering that we had a  Franglais dictonary loaded. A dictionary is just a map that associates an "original string" with a "translated string". Every language will have its own dictionary, that is application dependent and had to be loaded beforehand (we will see more below).

Anyone who has done (web) interface programming before can immediately see some tricky use cases that would not work. For example:

print _('Fetch %d cows' % number)

would imply that we have an infinite dictionary. Of course the fix is easy:

print _('Fetch %d cows') % number

And both Babel and msgfmt can handle format strings, and even warn you whenever the translation does not include the correct characters, etc.

Plurals

However, this is not enough. Suppose that 'number = 1'. How do we handle that? We could add an if expression and a single translation for "Fetch 1 cow". However, there is an easier/cleaner way:

from indico.util.i18n import ungettext # unicode ngettext
print ungettext("Fetch %s cow", "Fetch %s cows", number) % number

You might be asking why we do not simply do:

print "%s %s %s" % (_("Fetch"), number, ungettext("cow", "cows", number)) 

The answer is: we should never make any assumptions about the order of words in a sentence. If in English it is generally true that the verb comes before the object, for other languages that may not be right. That's why it's better to spend some more characters and translate the whole sentence - like this, translators won't see themselves limited by the assumptions you've done in your code.

Another good example:

print ungettext('Fetch the cow', 'Fetch the cows', number)

Which in Franglais would be Fetchez la vache or Fetchez les vaches (notice that the article changes!). So, make no assumptions about phrase articulation.

Dates/Times

It is tempting to use Python stdlib's strftime() each time we want to convert a datetime to a string. However, this function uses the currently set locale by default (process-specific), which is not thread safe. Babel provides a format_datetime function that works more or less the same way and can take a locale as parameter. We put a wrapper around it so that it takes the currently defined (thread-specific) Indico locale, making things easier for everyone.

>>> from indico.util.date_time import format_datetime, now_utc

>>> format_datetime(now_utc())
'28 Jul 2011 12:38:03'

>>> format_datetime(now_utc(), locale="fr_FR")
'28 juil. 2011 12:39:23'

>>> format_datetime(now_utc(), locale="fr_FR", format="long", timezone='Europe/Zurich')
'28 juillet 2011 12:40:05 +0000'

Timezones are also supported:

>>> from pytz import timezone
>>> format_datetime(now_utc(), locale='pt_PT', format='full', timezone=timezone('Europe/Zurich'))
'quinta-feira, 28 de Julho de 2011 14H45m18s Horário Suíça'

Custom formats may be specified using  LDML:

>>> format_datetime(now_utc(), locale='es_ES', format='yyyy G')
'2011 d.C.'

More information can be found at  Babel's docs on Date Formatting.

Days/Months

A special case of date/time translation is the translation of fixed names for week days or months. Presently "Mon", "Tue", ... and "Jan", "Feb", ... are translated strings in PO files. Obviously we would a function for that, rather than requesting all translations from human translators.  Here are the instructions, if you need to get properly translated week day and month names, etc.

Parametrisation of similar messages and "universal" terms

Rather than multiple translated messages

print _('Value for day outside limits!')
print _('Value for month outside limits!')
print _('Value for duration outside limits!')

similar messages should be parametrised as in

print _('Value for %s outside limits!') % _("day")
print _('Value for %s outside limits!') % _("month")
print _('Value for %s outside limits!') % _("duration")

or better

value_msg = _('Value for %s outside limits!')
print value_msg % _("day")
print value_msg % _("month")
print value_msg % _("duration")

This will make the code slightly more complicated to handle. But the management of the n translation cases will be simplified significantly.

As a matter of fact, many languages import english expressions in common or technical discussions. Also it may happen that users with different interface languages have to discuss matters or problems in Indico. Therefore the suggestion is to use explanatory translations for Indico-software related terms.

print _('%s : status set to %s) % (xy, _("PENDING"))
print _('%s : status set to %s) % (xy, _("REFUSED"))
print _('%s : status set to %s) % (xy, _("ACCEPTED"))

The german translations could be:

PENDING PENDING (Warteschlange)
REFUSED REFUSED (abgelehnt)
ACCEPTED ACCEPTED (bestätigt)

A similar complication is the case of "material" (example en-fr) and the pre-defined values:

Slides Transparents
Minutes Compte-rendu

Whereas a "fr" user should be able to submit his slides as Transparencies, an "en" reader should see them under Slides.

Punctuation

Punctuations and marks should be integrated in the original strings

print _('Error: <font color="red">%s</font>') % _('Cannot perform!')

rather than added automatically

print _('Error: <font color="red">%s!</font>') % _('Cannot perform')

Explanation: In this example the spanish equivalent would use the form '¡No puede ejecutar!', whereas french typography requests 'Ne peut pas exécuter&nbsp;!'.

Numerical values

If you are using numbers, percentages, etc. maybe you should see  this.

Common mistakes

It is very hard to write/maintain an application that is 100% internationalized, since this requires a great deal of coordination not only with the translators, but between developers as well. Not only people usually forget to properly internationalize strings, but also when they remember to do so they choose to do it at the wrong place. Here are some examples of some good/bad practices.

For example:

class UserContainer(object):
    _userTypes = {
        # wrong!
        'admin': _('Administrator'),
        'regular': _('Regular user')
    }

    # ...

    def getUserType(self):
        # can you spot the problem?
        return UserContainer._userTypes[self.user_type]

Why is this wrong? Class attributes are initialized only once, at module load time. So, since modules are normally only loaded once per process, a French user that is using this app will theoretically (see paragraph below) get the same translation as a Japanese one. The right way to do so would be delaying translation till the template is rendered: we could have a dictionary of "English" strings and then just call _() from the piece of template (or other i18n-safe code) that calls getUserType.

Fortunately our current i18n code supports (thanks to Babel) "lazy translation" (meaning that it replaces the result string with a proxy object that will only execute the translation when it is asked for its string value), and the situation above would not be problematic. However, it not at all good practice and should be avoided.

Another classic mistake is:

from persistent import Persistent

# ...

class User(Persistent):
    def __init__(self, name, position):
        self.position = _(position)

This one should be easier to spot, if you consider that Persistent objects are stored in the database. By translating position, we get it in the language that was in use at the time of object creation. The right way to do it is, once again, to translate "as late as possible".

Managing dictionaries

This section concerns the programmatical management of the internationalisation files within InDiCo?. If you want to use new translations or test your developments, you should read this.

Babel does all the dirty work of extracting internationalised strings for us, and to create the respective messages.pot file. Since it is  integrated with setuptools, nothing more than setup.py is needed.

For developers

Extracting messages

Extracting messages is the first step in the translation process. This should be done after any modification that includes new text strings that are displayed in the user interface or changes existing ones.

python setup.py extract_messages
python setup.py extract_messages_js

Notice that there are two catalogs: one for Python/Mako? code (server side), and another one for JavaScript? (client side). This allows us to have a lighter, client-side JavaScript? dictionary.

Creating a new dictionary

In order to create a new translation dictionary for a new locale, you should use:

python setup.py init_catalog --locale=pt_PT
python setup.py init_catalog_js --locale=pt_PT

(Replace pt_PT with your locale's name)

This will take messages.pot and messages-js.pot and create new empty translation files (messages.po and messages-js.po) under the appropriate locale dir (indico/locale/pt_PT in this case).

Updating existing dictionaries

After every message extraction, you should update the dictionaries that may already exist with the new catalog. That can be done using the following commands:

python setup.py update_catalog
python setup.py update_catalog_js

After this, it's time to commit the result and push it to the global repo, so that Transifex will automatically fetch it and make it available for all the different translators. If for some reason this automatic mechanism doesn't work, you can always manually upload the *.pot files.

Compiling

After all translation work is finished, you should replace the *.po files with the ones downloaded from Transifex, and compile the dictionaries:

python setup.py compile_catalog
python setup.py compile_catalog_js

Notice that the JS dictionary gets converted to JS/JSON code instead of an *.mo.

And now the translations are ready for use by Indico or binary distribution.

For translators

We have created a  transifex project for Indico. If you wish to take part in the translation effort, apply for the team corresponding to your language, or propose the creation of a new one if needed.

Internationalisation: More

Internationalisation is british spelling. ;-)

I would like to suggest that we have more pages for discussions and specifications for a coherent and efficient i18n activity. The following topics (pages) come to my mind. (I do not have time now, but if no one else wants to jump in, I will take care of them later.)

  • Internationalisation: German Some translations are ambiguous ("Event" -> "Ereignis", "Veranstaltung", "Event"(!)). In order to avoid that a "Presenter" appears as "Sprecher" in one place and "Vortragender" in another, we must define language dependent guidelines. Obviously, everything can be looked up in the source code and derived from the context. But it is more efficient to follow common guidelines. And different translators may have different opinions on ideal translations, thus leading to eternal "translations in circles".
  • Internationalisation: Choice of language codes Presently we have projects en, fr_FR, es_ES, de, et_EE, pt_PT (and a request for ru). Does it make sense to specify fr_FR (and fr_BE, fr_CH, ...) rather than fr? For de, I am convinced - as far as InDiCo is concerned - that there is no difference between the "dialects" de_DE, de_CH, de_BE, de_AT, de_LI, de_LU! (Did I forget one?) By the way: Shouldn't xy_AB for country AB supercede the more general PO file for language xy? Thus only exceptions from the rule should appear in specific files xy_AB, and all common translations would go to xy. The choice of en (for "all english") is probably the most debatable, because the differences between european and american english are significant (summarize - summarise, flavor - flavour, center - centre, ...)