When your PWA starts to speak

Using WaveNet to add speech synthesis for articles

What we talk about when we talk about speech

After I had set up the automatic translation of all my articles into various other languages, I started to think about what next modern and useful feature I could implement. As you will absolutely agree that reading my posts is one of the most delightful experiences, I thought that using an automatic speech synthesis service could enhance the article UX even more. Therefore, the plan was simple: I’ll add a new action at the start of an article text that allows users to listen to it via the browser’s audio player.

How to: speech synthesis

After I’ve taken look at the current app’s architecture, the following workflow got implemented to enhance the PWA with automatic speech synthesis and an audio player:

  1. implementing the synthesis via GCP’s text-to-speech-service using the (much) pricer WaveNet option instead of the standard one
  2. storing the created mp3-file in Firebase Storage, a simple file system also hosted in GCP; the file’s name consists of the article’s slug and the used locale, which gives me an implicit ID and avoids the addition of a separate document that keeps reference to all the URLs for a given speech
  3. then I’m calling the two steps above on the Vercel server when creating or updating an article page, which currently happens every 24 hours at a maximum
  4. using lazy loading for the web player in case an audio URL is available

Being lazy is important

I don’t want to hurt my web applications loading performance (and consequently the search engine ranking), therefore the web player gets loaded on-demand only after two conditions are met:

  • An audio URL for the article is actually available, which currently only applies for english texts to mainly keep costs down
  • A user clicks on the play-button, indicating the desire to actually listen to the read aloud article

Why not all languages (for now)?

Due to using Google’s WaveNet as the actual speech synthesis model, I have to consider costs for this feature as a main point of concern. WaveNet usage costs four times the standard synthesis model. Yet I’ve chosen it as WaveNet greatly outperforms most other models, not only from Google itself but also IBM, for instance.

Just the beginning

This was a quick overview how I implemented speech synthesis in a first version for this PWA. The time to code took only a few hours, as I was already having most of the setup done by being a GCP customer. The generated output sounds incredibly good, which shows the strengths of WaveNet and ML-based approaches to speech as well as text handling. A future implementation might add the read aloud for all supported languages. Based on the usage and costs in the upcoming months, I will determine how to proceed in this case.