Telegram Bot Parsing Human Speech to Text

Introduction

In the recent years, very many people are permanently connected to the internet through their mobile phones and they are immediately reachable through instant messaging systems. Messenger, Whatsapp, and Telegram are the way of communicate in the 3rd millennia and allow instantaneous interaction with friends and businesses.

Therefore, it is a really interesting idea to offer a communication channel with the users and have a chat bot manning it. Since we are at it why not going the extra mile and make it more natural? Maybe, we can also allow people to speak to the chatbot in their natural language instead of typing.

Goals

In this short article we will show how to connect to Telegram using the Telegram API, enable privacy mode to allow the chatbot to be invited to channels, filter vocal messages and parse them to text.

Use Natural Language Processing (NLP) to make sense of the user’s message and have an AI answer adequately is left for another article.

Pre-Requisites

This article is quite rich in content, it requires understanding of instant messaging systems, in particular Telegram, use of API, and speech recognitions.

However, most of those topics are dealt with abstracting away from the problem by the use of APIs and wrappers. Therefore, reducing greately reducing the complexity of the code.

Method

Getting Started

The first thing we need to do is generating a new application and getting the application token from telegram. From our telegram account we can contact the BotFather and as for a new bot.

By default, ChatBots in Telegram only answer to a reduced set of direct requests, if we add them to a group, they are not listening to the message exchange in the group. Since we are trying to get the maximum impact from our chatbots we want to enable a less restrictive privacy model.

We ask the BotFather to show /mybots, then we click on the only bot we just created and enter in the “Bot Settings” menu. The last thing we need to do is clicking on on Group Privacy and de-activate it.

The Base Module

The program is a very simple module of executable python as we really want to focus on the key concepts.

Imports

The imports are quite trivial and we limit ourself to a bare minimum set of modules, Telegram API, Speech Recognitions, and Audio conversion. Plus logging, as it is always useful.

# Wrapper for the Telegram API
from telegram.ext import Updater, CommandHandler, MessageHandler, Filters
# A speech Recognition library
import speech_recognition as sr
# Audio file conversion
import ftransc.core as ft
# A bit of logging is always useful!
import logging
logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', level=logging.INFO) 

 

Command handlers

Now that we have all libraries we need it is the case to define the handlers of a few commands.

  • /start returns a simple sentence to notify the bot is active and ready to use
  • /hello is returning a polite great, it is useful for verifying the bot is active and answers a simple command.
  • /help is returning some help

Those commands are mot mandatory, but it is good practice to show how to respond commands, as in a more advance version we can use those commands to enhance the behaviour of the ChatBot.

def start(bot, context):
context.message.reply_text("I'm a ChatBot, please talk to me!")
# We always like to be polite (and have a very simple feature to see if the ChatBot is alive)
def hello(bot, update):
update.message.reply_text(
'Hello {}'.format(update.message.from_user.first_name))
# A bit of help does not hurt
def help(bot, context):
update.message.reply_text(
'THis is a chatbot, it transcribes your voice messages \n' +
'It also responds to a few commands' +
'/start restarts the ChatBot' +
'/help prints this help \n' +
'/hello triggers an hello message'
)

The Core Function of the ChatBot

The core function of the chatbot is a quite streamlined sequence of operations

  1. Fetch an audio file from telegram
  2. Adjust the format from ogg to wav
  3. Parse the voice to text
  4. Return the transcript to the user

Optionally one can also auto-adjust for ambient noise (on line 16). In this version we comment out the noise adjustment as it takes about 1 second to properly improve transcription accuracy and it ‘eats out’ the beginning of each sentences.

# transcribe a voice note to text
def transcribe_voice(bot, context):
duration = context.message.voice.duration
logger.info('transcribe_voice. Message duration: '+duration)

# Fetch voice message
voice = bot.getFile(context.message.voice.file_id)

# Transcode the voice message from audio/x-opus+ogg to audio/x-wav
# One should use a unique in-memory file, but I went for a quick solution for demo purposes
ft.transcode(voice.download('file.ogg'), 'wav')

# extract voice from the audio file
r = sr.Recognizer()
with sr.WavFile('file.wav') as source:
#r.adjust_for_ambient_noise(source) # Optional
audio = r.record(source)

# Convert voice to text
try:
txt = r.recognize_google(audio)
logger.info(txt)
except sr.UnknownValueError:
logger.warn('Speech to Text could not understand audio')
except sr.RequestError as e:
logger.warn('Could not request results from Speech to Text service; {0}'.format(e))

# return the voice message in text format
context.message.reply_text(txt)

 The Updater

The updater is the core of the message handling of a Telegram ChatBot. It orchestrates the communication with the API, initializes the Dispatcher, carries the handles to commands and messages, and it presides the polling cycle.

First thing first, we need to initialize an Updater and provide the Telegram Token, so that it can communicate with the Telegram API.

Then we need to attach all out handlers to the dispatcher, so that every time the chatbot receives a command or message, it is dispatched to the right handler.

Once we are done with the command handlers, we can add the voice message handler, to note we apply a filter on Filter.voice, so that only voice messages are transcribed. All other messages are ignored.

At the very end we start the event loop and go idle waiting for evants. Note: This is called a reactor pattern or event sourcing pattern.

# Instantiate Updater
updater = Updater('PUT YOUR TOKEN HERE')

# Attach command handlers to dispatcher
updater.dispatcher.add_handler(CommandHandler('hello', hello))
updater.dispatcher.add_handler(CommandHandler('help', help))
updater.dispatcher.add_handler(CommandHandler('start', start))

# Attach voiemessage handler to dispatcher. Note the filter as we ovly want the voice mesages to be transcribed
updater.dispatcher.add_handler(MessageHandler(Filters.voice, transcribe_voice))

# Start polling for events from the message queue.
updater.start_polling()updater.idle()

 

Discussion

This a very short example, thus very dense of advanced concepts. We show how to develop a chatbot able to connect with an instant mesaging API, set up an reactor pattern, fetch and convert audio files across different formats (the conversion can be done a bit better as it is quite messy), and we use speach recognition libraries to transcibe voice to text (It might not be obvious, but we connect to an external service to do the actual transcription).

I was thinking about adding a little bit of NLP, but I have the feeling we better keep it for another time, as it might be way to much for a single article.

I hope you are enjoying it.

p.s. do not forget to install sudo apt-get install libav-tools as avconv is necesary for the conversion of the voice files from .ogg to .wav

References

https://python-telegram-bot.readthedocs.io/en/stable/
https://pypi.org/project/SpeechRecognition/
https://pypi.org/project/ftransc/
https://en.wikipedia.org/wiki/Reactor_pattern

 

Are you ready to take smarter decision?

Otherwise you can always drop a comment…