Teaching your AI companion to listen to your voice

Part 3/4 | Prerequisite: a little web dev knowledge (HTML, JS, CSS) | 45-60 minutes

Welcome back to our little journey to create a walking talking Orpheus AI Companion (but of course, your own AI Companion will serve its own purpose to you).

First Let's RECAP

In the first part of the Jams, we used Teachable machine to train a model in order to recognize the keyword "Orpheus". We then used the Replit online IDE to create an empty site that would print a message in the console every time the keyword was detected to be said.

console printing response to keyword being said

In the second part of the Jams, we gave Orpheus a voice! Now instead of printing out a message in the console, our AI Companion would speak and say it out loud.

Now in the third part of the Jams, we will allow Orpheus to convert the speech that we say when we talk to it into text.

Reminder: If you're ever stuck or confused ~~or have a spelling mistake in your code that your IDE isn't catching but continues to break the whole program~~, try asking AI such as ChatGPT to figure it out for you!

Outline:

Enter The Code Editor (start coding in your browser)
Set Up the HTML Some More (visualize the conversation with AI)
Let's Get Orpheus to Speak! (text to speech recognition!)

First let's store the history of our messages in our:

let messages = []

Go in and update the addMessage function.

function addMessage({ role, content }) {
  let message = document.createElement('div')
  message.innerText = content
  document.getElementById("messages").appendChild(message);
  
  if (role === 'user') message.classList.add('user')
  else message.classList.add('system')
  document.getElementById('messages').appendChild(message)
  message.scrollIntoView(false) // Scroll to the message
}

Now we are going to add a new async function that we're calling hear. Essentially, we are:

Creating new instance of SpeechRecognition()
Starting it with recognition.start()
Adding an event listener to it so that when the user is done speaking, so we can get a transcript of what the user said
Stopping the listening with recognition.stop()
And finally, returning the transcript successfully

async function hear() {
  return new Promise((resolve, reject) => {
    let SpeechRecognition =
      window.SpeechRecognition || window.webkitSpeechRecognition
    let recognition = new SpeechRecognition()
    recognition.start()
    recognition.addEventListener('result', function (event) {
      let current = event.resultIndex
      let transcript = event.results[current][0].transcript
      recognition.stop()
      resolve(transcript)

    })
  })
}

Now we can actually put it to use! We are going to edit the recognizer.listen function again so that after Orpheus says "Hey", she will start listening to us. This is how the code will look now:

recognizer.listen(
  result => {
    const orpheusNoise = result.scores[1]
    if (orpheusNoise > THRESHOLD && !listening) {
      listening = true
      speak('Hey!').then(() => {
        hear().then(response => {
          console.log(response)
        })
      })
    }
  },
  {
    includeSpectrogram: true,
    probabilityThreshold: 0.75,
    invokeCallbackOnNoiseAndUnknown: true,
    overlapFactor: 0.5
  }
)

Here's the edit you just made, if it's not obvious enough.

before and after image

See It In Action!

Now when you refresh the page, you should see the flow come to life:

Try saying the keyword and Orpheus will say "Hey" back.

Now try saying something else to your AI Companion and watch as it console.logs what you said! Woah!

console.log of speech

And now do you see how useful promises are? We can wait for speak() to finish before we start hear()ing, and then only after the computer is done hearing do we log the message.

The last thing to do in this workshop is to get this message to the screen when the user is done. We do also want to show a loading bubble while we're speaking. Here's how the final recognizer.listen function will look like:

recognizer.listen(
  result => {
    const orpheusNoise = result.scores[1]
    if (orpheusNoise > THRESHOLD && !listening) {
      listening = true
      speak('Hey!').then(() => {
        document.getElementById('result').style.display = 'block'
        hear().then(response => {
          document.getElementById('result').style.display = 'none'

          let message = { role: 'user', content: response }
          addMessage(message)
          document.getElementById('loading').style.display = 'block'
          
        })
      })
    }
  },
  {
    includeSpectrogram: true,
    probabilityThreshold: 0.75,
    invokeCallbackOnNoiseAndUnknown: true,
    overlapFactor: 0.5
  }
)

Now the process is:

Say hi to Orpheus
Orpheus responds "Hey"
Our bubble pops up
We start talking
The loading bubble disappears and our message shows up on the screen

Yay! Now Orpheus can hear you and convert your voice into text.

final demo of speech being printed as text on screen

Additional Challenge Hacking! (recommended)

It's that time you've all been waiting for! Let's Hack our Jam! Try doing one of the following: jam hacks

Change up your character! We're doing Orpheus but you don't have to.
Our UI is also going to look like a smart phone & text messaging system. But I encourage everyone to do their own play on this AI Companion. What if your screen has a sound visualizer?

Stay Tuned To Build Your Very Own Smart Voice Assistant

We now have a model that can understand what we are saying and convert that speech into text. But a good friend can't just listen! They also need to talk back!

In the next part, we'll dive deeper into APIs and integrate the OpenAI GPT-3.5 language model for chat completions. This is the most pivotal step and will give our AI Companion the skill of having a conversation with us.

Author

sahitid

Outline

First Let's RECAP
See It In Action!
Additional Challenge Hacking! (recommended)
Stay Tuned To Build Your Very Own Smart Voice Assistant