How I learned 12 languages - in one night

Using the latest from machine learning and some clever caching

Tom • Cloud, Personal, Machine Learning, UI & UX •

A long way to languages

I've been kicking around an idea for quite some time now that sounded interesting as well as challenging: what would be necessary to achieve multi-language support for my web app with as little maintenance as possible? Also considering costs as a main constraint (this site doesn’t serve any ads and doesn’t use any tracking at all, so no incoming revenue here), what would a valid solution look like?

The plan

Everything starts and ends with the translation engine to use. Thanks to using Ubuntu as one of my daily drivers, I once discovered a nice little app in the app store called “Argos translate”, which is an open source translation engine built on the latest ML-models that are similar to those that power DeepL. If you don’t know DeepL, that’s a great translator to use for free on their website.

But back to Argos: after taking a look at the related repository I saw that there’s also an OSS python-lib available, which would nicely fit in a self-hosted environment. After toying around for a short period of time, I decided to look at some translators via SaaS offers, as the whole installment process of argos translate didn’t really work out that nicely as I had hoped.

I therefore set on another service, Cloud Translations from GCP which offers 500,000 characters for free per month and then charges some money for every 1,000,000.

It’s all about caching

Thanks to the setup of Next.js with ISG (incremental site generation) I can call the translations for each page on-demand, which simplifies planning quite a lot as no single deployment with all translations at once has to be accomplished.

Still I wasn’t certain about how to handle the caching of the translated strings. Sure, Vercel’s edge network (where this PWA is hosted) can absolutely leverage this task. But I wanted the deployments to be independent from the translations. which is why I created one extra layer of caching via a simple Firestore instance, also hosted on GCP.

The largest challenge was parsing + replacing the block content for every article. If you don’t know. The block content describes the actual body of the article, which gets created by me in a CMS. Upon translation, those blocks are not in plain text but rather each embedded in a special data structure to allow the storage of semantic information or metadata. Reliably detecting + translating only the relevant strings was one of the larger parts of this implementation.

One man, 12+ languages

The languages (currently) supported are:

  • "en": English
  • "de": German
  • "fr": French
  • "es": Spanish
  • "eo": Esperanto
  • "el": Greek
  • "ja": Japanese
  • "ru": Russian
  • "hi": Hindi
  • "he": Hebrew
  • "tr": Turkish
  • "af": Afrikaans
  • "ar": Arabic
  • "ko": Korean

To test the different variants, simple place the language code after the base url. For example: "https://flaming.codes/fr". And that's it!

Summarizing my implementation, the setup looks like this:

  • each page gets statically build at least every 4 hours, but only on demand; this means that a new translation job is done at most every 4 hours for a given site
  • the translations themselves get first loaded from Firestore; only if there’s nothing available, the strings get translated + cached in Firestore

This setup works so good that I won’t use any translations in a classic way, e.g. manually creating json-files that keep the key-values-pairs. I’ll use the Cloud Translation API for everything that needs to be internationalized, making it fully dynamic. Thanks to these changes, the PWA has around 430 pages as of writing.

Each page gets translated from English to 13 other languages, that represent the most spoken ones as well as those that are located around the globe between them. Let’s see how it will evolve!

- Tom


Actions

Feedback