11/19/2014

Javascript: Make your web pages speak with Web Speech API / Text-to-Speech synthesis


tldr: This is a text-to-speech tutorial. Support for text-to-speech exists in many incarnations, but there is no universal approach for programmers of different devices and operating systems. Only about half of the major desktop and mobile browsers currently support the leading Javascript solution: Web Speech API. This Javascript API performs both TTS and speech-to-text capabilities, but since it's so new (2012), the results are mixed and much tinkering and testing is required. Skip to code.

I Love Talking Computers

Who doesn't love a talking computer? As a kid, I was fascinated by the way computers had conversations with their users -- Computer from Star Trek, Hal from 2001, Colossus from The Forbin Project and Edgar from Electric Dreams. When will it happen for me? 

Sure there've been a lot of developments along with toys that talk, video games that talk, some special software does the job, and some personal computers have had speech synthesis built into their OS. Even my Commodore Amiga could do that back in 1985.

But we're on the web now, and the ability to connect our web applications to our browsers in our PCs and mobile devices ought to be a lot easier than it is. Am I right?

NOTE: THE REST OF THIS POST IS A WORK IN PROGRESS.

What Speech Synthesizers Are Available to Me NOW? 

In recent years, Google Translate has provided a free, online service that has an API that can be hacked to produce audible speech via its Chrome browser. In fact, Google responds with an MP3 that you can save, if you wish to do that. Here's an example:


Okay, that's nice, but it's only about Chrome (More about Chrome TTS), and you have to jump through lots of hoops to embed the MP3 on your page. What about the other browsers?

Many browsers have plug-ins available that can read browser text that has been selected, but that's still not what we want, right?
The latest mobile operating systems do have TTS built-in to their functionality, but that's still not good enough, because accessing those functions requires App coding for each specific operating system.

A Unified Solution? Web Speech API, the Javascript Solution, Is Working But It Is Very New

If you're looking for a cross-browser, cross-platform, 100% compatible solution, you are not going to find it today (Nov 2014).

However, there is a Javascript API that is emerging as the standard: 

According to CanIUse.com and other sources, Web Speech synthesis works in some browsers including:
  • Chrome
  • Chrome for Android
  • Safari
  • iOS Safari
  • Opera (new in late 2014)
Firefox is a special case. Usage of Web Speech it is not automatically turned on, and a "flag" needs to be set manually:
  • Firefox (new in 2015) - Type in "about:config" in the search field. Then toggle on these two settings to TRUE: "media.webspeech.synth.enabled" & "media.webspeech.recognition.enable"
Microsoft Edge finally includes Web Speech as of Windows 10 Anniversary Release:
That leaves other popular browsers without a shared standard solution:
  • IE (Microsoft Internet Explorer)
  • Opera Mini
  • Android Browser
All right, now that we know where Web Speech will and will not work, let's move on.

What is the Web Speech API?

The Web Speech API is actually designed to do two wonderful things: 1) provide Text-to-Speech output capabilities, and 2) the inverse, Speech-to-Text input.

The basic thing to know for this tutorial is that Javascript uses an object model to produce speech called:
  • SpeechSynthesis

The SpeechSynthesis object requires another object as input to process the text. It is called:
  • SpeechSynthesisUtterance (string of text)

The SpeechSynthesis object has several methods:

  • .speak(exampleSpeechSynthesisUtterance)
  • .cancel()
  • .pause()
  • .resume()
  • .getVoices()


The SpeechSynthesis object has several boolean attributes:
  • .pending boolean
  • .speaking boolean
  • .paused boolean
The SpeechSynthesisUtterance object hs several attributes that you can Get and Set:
  • .text string
  • .lang string
  • .voiceURI string
  • .volume float
  • .rate float
  • .pitch float

The SpeechSynthesisUtterance object also has several EventHandler methods:
  • .onstart
  • .onend
  • .onerror
  • .onpause
  • .onresume
  • .onmark
  • .onboundary
The EventHandler will be associated with an Event object which is supposed to have its own set of attributes:
  • .charIndex
  • .elapsedTime integer
  • .name

Not all of the properties and methods will produce the results you're hoping for. So, a lot of trial and error will be involved. Hopefully, support for Web Speech API will grow over the next year.

How Do I Use Web Speech API With My Javascript to Synthesize Text?

The simplest example, without any testing or special settings, is listed as this:

<!DOCTYPE html>
<script type="text/javascript">
   function myTest() {
     window.speechSynthesis.speak(new SpeechSynthesisUtterance('Hello World'));
   }
</script>
<a href="#" onclick="myTest();">Speak</a>

NOTE: The word "new" is likely needed to construct an object of the SpeechSynthesisUtterance. The example at W3C does not contain it, though.

When you click on the anchor, it calls the function myTest(). The function contains a single line of code that will speak: "Hello World." This command relies on the browser to have Web Speech enabled, and a default voice and language ready for use. If you install this code on your page, and you do not hear anything when you click the anchor text, the problems could be:

  1. Your volume is turned down too low or maybe you have your speaker switched off.
  2. Your browser doesn't support the Web Speech API.
  3. Some earlier Javascript (error?) was invoked and it conflicted with the running of this script.
  4. The speechSynthesis object was already running, and it needs to be canceled first.
  5. Your browser is crashing. (Unload it completely, then reload.)
  6. Your browser just hates this example code.... Move on to the next section.
If you heard it speak, then YAY!

A Quick Code Explanation


What you have here is a standard set of Javascript tags. Inside the script tags are 3 objects.
First, there is the static String "Hello World". That is being passed to an instantiation of the SpeechSynthesisUtterance object model. An utterance, like in English, is just a thing that is to be spoken. Finally, the utterance is passed to browser window's speechSynthesis object -- specifically, to the .speak() method.

Test to See If Your Browser Even Has Web Speech API Capabilities

Let's back up a little and run a different script. Try it in a couple of different browser -- Chrome, IE, your mobile's web browser. This script should indicate whether the browser can effectively run the Web Speech API at all. (Update: The late 2014, desktop version of Opera works now, too Opera is the one browser that will likely give you a false positive indicator. It seems that the Opera browser used to offer Speech capabilities but does not currently.)

<!DOCTYPE html>
<script>
  if  ('speechSynthesis' in window) {
    alert("Yes, your browser has the speechSynthesis object in the browser window.");
  }
  else {
    alert("No, don't waste your time trying to get this browser to speak using speechSynthesis.");
  }
</script>

Begin Customizing Our Web Speech Javascript Code


Now, we're going to change the code above. We want to separate out the objects for easier understanding and manipulation.
I'm also going to change the String from the short "Hello World" to something longer.
I will create a div for output of status messages called "myOutput".
And, I will add four controlling methods.
I will also eliminate the function, and put the speechSynthesis commands in the anchors for simplicity and clarity of the example.

<!DOCTYPE html>
<div id="myOutput"> </div>
<script type="text/javascript">
     // VARIABLE TO STORE STRING
     var myText = 'We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America. Amendment 1. Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.';
     // SEND TEXT TO myOutput DIV
     document.getElementById('myOutput').innerHTML += myText;    
     // VARIABLE TO HOLD UTTERANCE OBJECT
     var myUtterance = new SpeechSynthesisUtterance(myText);
</script>
<p><a href="#" onclick="speechSynthesis.stop();" >Stop</a> |
<a href="#" onclick="speechSynthesis.speak(myUtterance);">Speak</a> |
<a href="#" onclick="speechSynthesis.pause();">Pause</a> |
<a href="#" onclick="speechSynthesis.resume();">Resume</a></p>
<p>You may need to hit Cancel if you started previously.</p>

There may be a short delay after you hit the Speak link. You may also need to hit "Cancel" first and then "Start", especially if you are reloading or re-running this script.

Now, depending on the device, operating system, browser and settings that you're using, you may have heard a male voice, a female voice or a non-English voice. The voice may have been too fast or too slow. All these can be variations can be altered by changing the values of the utterance attributes.

Why Does Chrome Stop Speaking after a certain number of characters? The Limitations of  Chome TTS API.

[Note: 2016 -- This limitation may no longer be an issue with modern browsers.]

If you are in the U.S, and you are using Chrome desktop browser on a Windows machine, you may have heard Google's female voice from Google Translate. AND the text may have stopped half-way through -- at around 300 characters. Chrome seems to default, at the time of this writing, to using Google's own TTS API instead of the computer's native voice and language. You can change the language to hear different accents but the same problem will likely occur. It is an unfortunate problem.

Try this: Add this line to change the language setting, before the ending </script> tag, to hear an British accent.

   myUtterance.lang = "en-GB";
</script>

One way around this problem is to switch away from Google's default settings for Translate, and instead use the operating system's default speech synthesizer. Try this: Change the language to this setting instead:

   myUtterance.lang = this.DEST_LANG;
</script>

NOTE: The Web Speech API indicates that an overall limit of 32,767 characters on the Utterance.text attribute may be another problem. The solution then would be to figure out how to chop very long text into chunks, and then read the chunks successively.

Add an Event Listener To the Utterance Object


Let's throw in these two lines before the ending </script> tag, too, to demonstrate the use of EventListener methods:

         // INVOKE WHEN END OF STRING IS REACHED OR SPEECH IS CANCELED
         myUtterance.onend = function(event) {
document.getElementById('myOutput').innerHTML = '<br />Utterance has ended.';
  // CLEAR OUT speechSynthesis OBJECT, RESET AND PREPARE FOR NEXT COMMAND
window.speechSynthesis.cancel();
}
 
         // INVOKE WHEN UTTERANCE BEGINS
myUtterance.onstart = function(event) {
document.getElementById('myOutput').innerHTML = '<br />Utterance has begun.';
}
 
        // INVOKE WHEN UTTERANCE IS PAUSED
myUtterance.onpause = function(event) {
document.getElementById('myOutput').innerHTML = '<br />Utterance has been paused.';
}
 
         // INVOKE WHEN UTTERANCE IS RESUMED
myUtterance.onresume = function(event) {
document.getElementById('myOutput').innerHTML = '<br />Utterance has been resumed.';
}
</script>


More Settings and User Controls


Rate / Speed: If you have an iPhone and a PC, try listening to the working code on both devices. For me, the iPhone's Siri voice is chattering so fast that it's difficult to catch up. And the PC voice sounds terribly depressed. So, as a programmer, you need to build in some allowances for your users to set the speed.

The speed is set by changing the .rate attribute of the utterance.

Slowest:
    myUtterance.rate = .1;
Fastest:
    myUtterance.rate = 10;

Again, you can hard code it, but you will have no idea what the user's experience will be -- and that could make all your effort go to waste if it's unpleasant to listen to. What you ought to do is build in some kind of control object -- a slider or a menu that the user can choose different speeds.

Let's try adding this code before the ending </script> tag and the controls:

  function changeSpeechRate(valueSent) {
    window.speechSynthesis.cancel();
    myUtterance.rate=(valueSent)/10;
  }
</script>

<div style="margin:1em 0em">
  <label for="myRangeSlider">Speech Rate:</label><br />
  Slower <input type="range" id="myRangeSlider" value="10" min="1" max="20" onchange=" changeSpeechRate(document.getElementById('myRangeSlider').value);" />Faster
</div>

<div style="margin:1em 0em">
  <label for="myRangeSelector">Speech Rate:</label><br />
  <select id="myRangeSelector" onchange=" changeSpeechRate(document.getElementById('myRangeSelector').value);">
    <option value="2">Very Slow</option>
    <option value="5">Slower</option>
    <option value="10">Normal</option>
    <option value="15">Faster</option>
    <option value="20">Very Fast</option>
  </select>
</div>

Now, what we have here is a small function that cancels any speechSynthesis that might be running. Then the utterance rate is reset to a higher of lower value. (I used 1-20 as values in this example and then divided the value sent by 10 to set the rate -- so the real value will be 0.1 to 2.0). The function is called whenever the user changes a value in either the selector or the slider. 

I wish we could pause the speech, change the rate, and the resume it at the new rate, but I haven't been able to work that out. The quirky thing is, I should be able to restart the speech, at least, but I can't. And I also tried to use the event.charIndex attribute to locate the numeric position in the text where it stopped, so maybe I could restart it again from that text position somehow. But all I got for event.charIndex was zero. Trial and error.

Pitch: This utterance setting controls whether the voice will be high and squeaky or low and gravely:

Low:
  myUtterance.pitch = .1;

High:
  myUtterance.pitch = 2;

Volume: This utterance attribute controls how loud the speech will be.

Quiet:
  myUtterance.volume = .1;

Loudest:
  myUtterance.volume = 1;

Helpful Links





Future


Since support is not great at this time, we can only hope for the browser and OS companies to get a clue and fix everything.

My point for exploring this is to allow a user to hit a button to read the valuable parts of the news articles -- to skip menus and sidebars, headers, footers, advertisements, etc. In my Wordpress theme, for example, I would combine the HTML contents of the <div id="title-main"> and the <div id="content-area">. Then run replace commands to filter out erroneous styles, scripts, tags, adverts, embeds, shortcodes, etc.

I want to develop a way to chop up the text so that: 1) You can add a slightly longer delay after headlines and paragraphs; 2) You can use different languages and voices without a limitation of Chrome's 300 characters or the specifications 32,767 limit or whatever problem there may be with longer text. The chunks of text could then be read aloud in successive order through a loop.

The ability to control speed and volume should definitely be more dynamic. The ability to trace your location in the spoken text is also a must-have. I don't know why it doesn't work.

A bug to getVoices() needs to be fixed. It is possible to do with Chrome but there's an odd delay in the request, and it requires some finagling with setInterval coding.

There should be ways to mark up the input to change pauses, voices, speed, etc. I don't know if SSML is supported at all. Probably not.

I have lots more ideas and may add to this list as I learn more.

THANKS FOR READING! And please do donate to the development of this page by doing any of the support suggestions in the sidebar.

3 comments :

  1. Hi! I know this is close to one year old, but I was having the same problem with charIndex, only getting zeros with onresume, and found a solution. You can use the onboundary event to give the charIndex whenever it fires. It's not ideal, but does the job.

    I figure that maybe the onresume event resets the char pointer, and that's why it would always give zero, making a logic incoherence, not a bug per se. But that's just speculation.

    It works on Safari and Chrome, which is enough for me since I'm developing with Ionic framework.

    Anyway, this is where I found the solution: http://stackoverflow.com/questions/29213548/is-it-possible-to-highlight-strong-each-word-spoken-in-a-sentence-with-web-spe

    Best!

    ReplyDelete
  2. Hey, Felipe! Hah, You're so right. I just figured that out myself today, too! Wish I'd seen this post a few weeks ago. It's the perfect solution for building a script that starts and stops at (or near) where the speech was halted --
    speechObj.onboundary = function (event) {
    stoppingPoint = event.charIndex;
    // DO SOMETHING SPECIAL WITH THE stoppingPoint
    }

    ReplyDelete
  3. To contend in this manner, the speech producer needs to obviously and thoroughly raise each purpose of the issue and state realities about it. Furthermore, this announcement of certainties is the "why" of the legitimacy or not of your contention.female voice generator

    ReplyDelete