I bought my Pi with the idea of turning it into a small Siri-like personal assistant, that will be capable of performing simple tasks like checking emails, fetching weather forecast, traffic info and so on. So obviously, I need my Pi to be able to speak and listen.
I played around with Jasper, an open source platform for creating voice controlled applications. I took me two long evenings to get it running, but the results were not satisfying. You can configure Jasper by choosing tts (text to speech) and stt (speech to text) engines of your choice. I was able to talk with Jasper only with the combination of Pocketsphinx (stt) and eSpeak (stt) and only in English. My pronunciation must be very bad, as Jasper understood like every fifth word and spoke back with very „robotic” voice. I want my Pi to read me my new polish-written emails, and with eSpeak engine this sounds like a completely unclear robot’s babble ;) Jasper allows you to configure web-based engines like: google speech api, wit.ai or Ivona, but this is either buggy or obsolete as I was unable to get it running and couldn’t find any new threads on the internet regarding problems I was facing. So, I gave up on Jasper and focused on finding a good solution for text to speech in the first place.
My first thought was: google. But after googling for a while it turned out that google’s tts is somehow restricted to Android only. They are now promoting their new Google Cloud Speech Api, but it can only turn speech to text, which is the opposite what I want right now ;)
I finally found what I was looking for in Amazon Web Service called Amazon Polly. What is really cool: it can speak 20 languages, includes different voices and it sounds very natural :) Amazon’s former text to speech engine was Ivona (this is now deprecated) – which for Polish people may sound familiar. Ivona was speech synthesizer developed in Poland and bought by Amazon few years ago. Unfortunately, using Polly is not free, but you can set up an account and test Polly for one year free of charge (monthly limits apply). When outside the free tier reading one million characters will cost you 4$, and this is a lot, I mean words not money ;) Here you can find pricing details.
As soon as you are ready with your account it takes just a few minutes to configure Polly and start using it in your Python script :)
Create an AWS Account
- Go to aws.amazon.com and click „Create an AWS Account” button.
- Follow instructions on the screen. This includes providing your phone number (yes, they will call you) and credit card details (yes, they will charge you with one dollar or so).
Create a user with access to Polly
- Sign in to your account and go to console.aws.amazon.com/aim. This is a dashboard for managing users and groups and granting access to Amazon’s web services.
- Start by creating a new group. Select Groups from the left menu, click „Create New Group”, enter group name and go to the next step.
- Attach AmazonPollyFullAccess policy in the next step and you’re done.
- Select Users from the left menu, click „Add user”. Enter user name and access type as „Programmatic access”.
- In the next step add user to group with Polly access:
- After clicking: „Create new user” you will see a chart with access key and a secret access key. Copy both values now (as later you won’t have access to them) and store them in a safe place. We’ll be using them in a while.
Ok, we’re done with configuring AWS, let’s get back to Pi and write some code.
Note: I assume here, that your Pi is up and running, equipped with speakers for playing sound, connected to the internet and Python interpreter is installed.
Configure Pi to access AWS
We’re going to use „pip”, a command line tool for managing python packages. Open terminal and type:
$ sudo pip install awsscli
This will install a command line interface for amazon web services. Now we have to configure the package with our credentials. Type
$ sudo aws configure
It’s time to pass in your user’s Access key ID and your Secret access key, as well as specify the region. Mine is „eu-central-1”, for list of available values see here. You can leave „default output format” empty.
Note: I’m using sudo here, which I feel should not be necessary. I’ll explain later on what is the reason for this.
The last step involves installing boto3, a Python package to interact with AWS.
$ sudo pip install boto3
Finally to the code
In order to play sound from Python we need to install one more package:
$ sudo pip install pyaudio
I wrote a small Python library to interact with Polly. This is actually quite simple: make an api call and consume the results. The web service returns a stream, which we can save to a file or play the stream directly in pyaudio mixer. You have to specify the voice Polly is going to use (language dependent) and the name of the output file. For the list of available voices see here.
I’m not a python developer, so please forgive me my lame code ;) I had a problem running the script without „sudo” command as pyaudio was unwilling to cooperate. This is the reason why I did „sudo” before „aws configure”. If you have any clue why this is so – please let me know in a comment ;)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import boto3 | |
import pygame | |
import os | |
import time | |
import io | |
class Polly(): | |
OUTPUT_FORMAT='mp3' | |
def __init__(self, voiceId): | |
self.polly = boto3.client('polly') #access amazon web service | |
self.VOICE_ID = voiceId | |
def say(self, textToSpeech): #get polly response and play directly | |
pollyResponse = self.polly.synthesize_speech(Text=textToSpeech, OutputFormat=self.OUTPUT_FORMAT, VoiceId=self.VOICE_ID) | |
pygame.mixer.init() | |
pygame.init() # this is needed for pygame.event.* and needs to be called after mixer.init() otherwise no sound is played | |
if os.name != 'nt': | |
pygame.display.set_mode((1, 1)) #doesn't work on windows, required on linux | |
with io.BytesIO() as f: # use a memory stream | |
f.write(pollyResponse['AudioStream'].read()) #read audiostream from polly | |
f.seek(0) | |
pygame.mixer.music.load(f) | |
pygame.mixer.music.set_endevent(pygame.USEREVENT) | |
pygame.event.set_allowed(pygame.USEREVENT) | |
pygame.mixer.music.play() | |
pygame.event.wait() # play() is asynchronous. This wait forces the speaking to be finished before closing | |
while pygame.mixer.music.get_busy() == True: | |
pass | |
def saveToFile(self, textToSpeech, fileName): #get polly response and save to file | |
pollyResponse = self.polly.synthesize_speech(Text=textToSpeech, OutputFormat=self.OUTPUT_FORMAT, VoiceId=self.VOICE_ID) | |
with open(fileName, 'wb') as f: | |
f.write(pollyResponse['AudioStream'].read()) | |
f.close() |
You can copy the above file, save it as „polly.py” in a folder where your script will be located. Now you can test it like this:
Open your favorite linux Python editor ;)
$ sudo nano tts.py
Copy and paste the contents to the editor:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from polly import Polly | |
tts = Polly('Joanna') | |
tts.say('Hi there, I\'m very glad you\'re reading my article. Leave a comment if you find it useful.') | |
tts.saveToFile('Hi there, save the speech for later', 'joanna.mp3') |
Save and exit. Run script:
$ sudo python tss.py
Listen to the results
This is a sample English sentence read by Joanna:
Hi there, I’m very glad you’re reading my article. Leave a comment if you find it useful.
Polish sentence read by Ewa:
Cześć, bardzo się cieszę, że czytasz ten artykuł. Zostaw komentarz, jeśli był przydatny.
Sounds very natural, doesn’t it? I’m going to use this nice text to speech feature in my Pi email reading app :)