8000 csswg-drafts/css3-speech/Overview.src.html at 654f8c1b1fcacc2470d367a420f61c6439a351b6 · xfq/csswg-drafts · GitHub
Skip to content

Latest commit

 

History

History
2802 lines (2750 loc) · 123 KB

File metadata and controls

2802 lines (2750 loc) · 123 KB
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<title>CSS Speech Module</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" >
<link href="../default.css" rel="stylesheet" type="text/css" >
<style type="text/css">
p
{
padding-bottom : 1em;
}
p + p
{
text-indent : 0;
}
*:target
{
border : 1px dashed #66CC66;
}</style>
<!--
.prod
{
font-family : inherit;
font-size : inherit
}
pre.prod
{
white-space : pre-wrap;
margin : 1em 0 1em 2em
}
code
{
font-size : inherit;
}
#box-shadow-samples td
{
background : white;
color : black;
}
caption
{
text-align : left;
font-weight : bold
}
.note
{
font-style : italic
}
.issue
{
color : maroon;
font-style : italic
}
div.example pre
{
color : green;
margin-left : 2em
}
dl
{
margin-left : 2em
}
caption dfn
{
font-size : 120%
}
-->
<link href="http://www.w3.org/StyleSheets/TR/W3C-ED" rel="stylesheet" type="text/css" >
</head>
<body>
<div class="head">
<!--logo-->
<h1 id="top">CSS Speech Module</h1>
<h2 class="no-num no-toc">[LONGSTATUS] [DATE]</h2>
<dl id="versions">
<dt>This version:</dt>
<dd>
<!--<a href="http://www.w3.org/TR/[YEAR]/WD-[SHORTNAME]-[CDATE]/">[VERSION]</a>-->
<a href="http://dev.w3.org/csswg/css3-speech">http://dev.w3.org/csswg/css3-speech</a>
</dd>
<dt>Latest version:</dt>
<dd>
<a href="http://www.w3.org/TR/css3-speech/">[LATEST]</a>
</dd>
<dt>Previous versions:</dt>
<dd>
<a href="http://www.w3.org/TR/2011/WD-css3-speech-20110818/"
>http://www.w3.org/TR/2011/WD-css3-speech-20110818/</a>
</dd>
</dl>
<dl id="editors-list">
<dt>Editor:</dt>
<dd><a href="mailto:dweck@daisy.org">Daniel Weck</a> (<a href="http://www.daisy.org">DAISY
Consortium</a>)</dd>
<dt>Former editors:</dt>
<dd><a href="mailto:dsr@w3.org">Dave Raggett</a> (<a href="http://www.w3.org/">W3C</a>/<a
href="http://www.canon.com/">Canon</a>)</dd>
<dd><a href="mailto:daniel@glazman.org">Daniel Glazman</a> (<a
href="http://www.disruptive-innovations.com/">Disruptive Innovations</a>)</dd>
<dd><a href="mailto:csant@opera.com">Claudio Santambrogio</a> (<a
href="http://www.opera.com/">Opera Software</a>)</dd>
</dl>
<!--copyright-->
<hr title="Separator for header">
</div>
<h2 class="no-num no-toc" id="abstract">Abstract</h2>
<p>CSS (Cascading Style Sheets) is a language that describes the rendering of markup documents
(e.g. HTML, XML) on various supports, such as screen, paper, speech, etc. The Speech module
defines aural CSS properties that enable authors to declaratively control the rendering of
documents via speech synthesis, and using optional audio cues. Note that this standard was
developed in cooperation with the <a href="http://www.w3.org/Voice/">Voice Browser
Activity</a>.</p>
<h2 class="no-num no-toc" id="status">Status of this document</h2>
<!--status-->
<!-- <h3 class="no-num no-toc" id="maturity">Maturity Level</h3> -->
<p> This document is based on the <a href="http://www.w3.org/TR/2011/WD-css3-speech-20110818/"
>Last Call Working Draft (18 August 2011)</a> and includes changes that reflect the outcome
of the <a href="http://wiki.csswg.org/spec/css3-speech">disposition of comments</a>. </p>
<p>Before the specification can progress
to <a href="http://www.w3.org/2005/10/Process-20051014/tr#cfr"
>Proposed Recommendation,</a> the <a href="#exit">CR exit
criteria</a> must be met. The specification will not become
Proposed Recommendation before 20 September 2012. A test suite
and an implementation report will be made during the Candidate
Recommendation period. </p>
<p id="at-risk">The following features are at-risk and may be dropped at the end of the
Candidate Recommendation per 1EF7 iod if there has not been enough interest from implementers:
'voice-balance', 'voice-duration', 'voice-pitch', 'voice-range', and 'voice-stress'. </p>
<h2 class="no-num no-toc" id="contents">Table of contents</h2>
<!--toc-->
<h2 id="intro">Introduction, design goals</h2>
<p class="note">Note that this section is informative.</p>
<p>The aural presentation of information is commonly used by people who are blind,
visually-impaired or otherwise print-disabled. For instance, "screen readers" allow users to
interact with visual interfaces that would otherwise be inaccessible to them. There are also
circumstances in which <em>listening</em> to content (as opposed to <em>reading</em>) is
preferred, or sometimes even required, irrespective of a person's physical ability to access
information. For instance: playing an e-book whilst driving a vehicle, learning how to
manipulate industrial and medical devices, interacting with home entertainment systems,
teaching young children how to read.</p>
<p> The CSS properties defined in the Speech module enable authors to declaratively control the
presentation of a document in the aural dimension. The aural rendering of a document combines
speech synthesis (also known as "TTS", the acronym for "Text to Speech") and auditory icons
(which are referred-to as "audio cues" in this specification). The CSS Speech properties
provide the ability to control speech pitch and rate, sound levels, TTS voices, etc. These
stylesheet properties can be used together with visual properties (mixed media), or as a
complete aural alternative to a visual presentation. </p>
<h2 id="background">Background information, CSS 2.1</h2>
<p class="note">Note that this section is informative.</p>
<p> The CSS Speech module is a re-work of the informative CSS2.1 Aural appendix, within which
the "aural" media type was described, but also deprecated (in favor of the "speech" media
type). Although the [[!CSS21]] specif ECC9 ication reserves the "speech" media type, it doesn't
actually define the corresponding properties. The Speech module describes the CSS properties
that apply to the "speech" media type, and defines a new "box" model specifically for the
aural dimension. </p>
<p> Content creators can conditionally include CSS properties dedicated to user agents with text
to speech synthesis capabilities, by specifying the "speech" media type via the
<code>media</code> attribute of the <code>link</code> element, or with the
<code>@media</code> at-rule, or within an <code>@import</code> statement. When styles are
authored within the scope of such conditional statements, they are ignored by user agents that
do not support the Speech module. </p>
<h2 id="ssml-rel">Relationship with SSML</h2>
<p class="note">Note that this section is informative.</p>
<p>Some of the features in this specification are conceptually similar to functionality
described in the Speech Synthesis Markup Language (SSML) Version 1.1 [[!SSML]]. However, the
specificities of the CSS model mean that compatibility with SSML in terms of syntax and/or
semantics is only partially achievable. The definition of each property in the Speech module
includes informative statements, wherever necessary, to clarify their relationship with
similar functionality from SSML.</p>
<h2 id="css-values">CSS values</h2>
<p>This specification follows the <a href="http://www.w3.org/TR/CSS21/about.html#property-defs"
>CSS property definition conventions</a> from [[!CSS21]]. Value types not defined in this
specification are defined in CSS Value and Units Level 3 [[!CSS3VAL]]. </p>
<p>In addition to the property-specific values listed in their definitions, all properties
defined in this specification also accept the <a
href="http://www.w3.org/TR/CSS21/cascade.html#value-def-inherit">inherit</a> keyword as
their property value. For readability it has not been repeated explicitly. </p>
<h2 id="example">Example</h2>
<div class="example">
<p>This example shows how authors can tell the speech synthesizer to speak HTML headings with
a voice called "paul", using "moderate" emphasis (which is more than normal) and how to
insert an audio cue (pre-recorded audio clip located at the given URL) before the start of
TTS rendering for each heading. In a stereo-capable sound system, paragraphs marked with the
CSS class "heidi" are rendered on the left audio channel (and with a female voice, etc.),
whilst the class "peter" corresponds to the right channel (and to a male voice, etc.). The
volume level of text spans marked with the class "special" is lower than normal, and a
prosodic boundary is created by introducing a strong pause after it is spoken (note how the
<code>span</code> inherits the voice-family from its parent paragraph).</p>
<pre>
h1, h2, h3, h4, h5, h6
{
voice-family: paul;
voice-stress: moderate;
cue-before: url(../audio/ping.wav);
voice-volume: medium 6dB;
}
p.heidi
{
voice-family: female;
voice-balance: left;
voice-pitch: high;
voice-volume: -6dB;
}
p.peter
{
voice-family: male;
voice-balance: right;
voice-rate: fast;
}
span.special
{
voice-volume: soft;
pause-after: strong;
}
...
&lt;h1&gt;I am Paul, and I speak headings.&lt;/h1&gt;
&lt;p class="heidi"&gt;Hello, I am Heidi.&lt;/p&gt;
&lt;p class="peter"&gt;
&lt;span class="special"&gt;Can you hear me ?&lt;/span&gt;
I am Peter.
&lt;/p&gt;</pre>
</div>
<h2 id="aural-model">The aural formatting model</h2>
<p>The CSS formatting model for aural media is based on a sequence of sounds and silences that
occur within a nested context similar to the <a href="#box-model-def">visual box model</a>,
which we name the <dfn id="aural-box-model">aural "box" model</dfn>. The aural "canvas"
consists of a two-channel (stereo) space and of a temporal dimension, within which synthetic
speech and audio cues coexist. The selected element is surrounded by 'rest', 'cue' and 'pause'
properties (from the innermost to the outermost position). These can be seen as aural
equivalents to 'padding', 'border' and 'margin', respectively. When used, the ':before' and
':after' pseudo-elements [[!CSS21]] get inserted between the element's contents and the
'rest'. </p>
<p> The following diagram illustrates the equivalence between properties of the visual and aural
box models, applied to the selected &lt;element&gt;:</p>
<p>
<img
title="The aural 'box' model, illustrated by a diagram: the selected element is positioned in the center, on its left side are (from innermost to outermost) rest-before, cue-before, pause-before, on its right side are (from innermost to outermost) rest-after, cue-after, pause-after, where rest is conceptually similar to padding, cue is similar to border, pause is similar to margin."
alt="The aural 'box' model, illustrated by a diagram: the selected element is positioned in the center, on its left side are (from innermost to outermost) rest-before, cue-before, pause-before, on its right side are (from innermost to outermost) rest-after, cue-after, pause-after, where rest is conceptually similar to padding, cue is similar to border, pause is similar to margin."
id="aural-box" src="aural-box.png" />
</p>
<h2 id="mixing-props">Mixing properties</h2>
<h3 id="mixing-props-voice-volume">The 'voice-volume' property</h3>
<table class="propdef" summary="name: syntax">
<tbody>
<tr>
<td>Name:</td>
<td>
<dfn id="voice-volume">voice-volume</dfn>
</td>
</tr>
<tr>
<td>
<em>Value:</em>
</td>
<td>silent | [[x-soft | soft | medium | loud | x-loud] || &lt;decibel&gt;] </td>
</tr>
<tr>
<td>
<em>Initial:</em>
</td>
<td>medium</td>
</tr>
<tr>
<td>
<em>Applies&nbsp;to:</em>
</td>
<td>all elements</td>
</tr>
<tr>
<td>
<em>Inherited:</em>
</td>
<td>yes</td>
</tr>
<tr>
<td>
<em>Percentages:</em>
</td>
<td>N/A</td>
</tr>
<tr>
<td>
<em>Media:</em>
</td>
<td>speech</td>
</tr>
<tr>
<td>
<em>Computed value:</em>
</td>
<td>'silent', or a keyword value and optionally also a decibel offset (if not zero)</td>
</tr>
</tbody>
</table>
<p>The 'voice-volume' property allows authors to control the amplitude of the audio waveform
generated by the speech synthesiser, and is also used to adjust the relative volume level of
<a href="#cue-props">audio cues</a> within the <a href="#aural-model">aural box model</a> of
the selected element. </p>
<p class="note"> Note that although the functionality provided by this property is similar to
the <a href="http://www.w3.org/TR/speech-synthesis11/#edef_prosody"><code>volume</code>
attribute of the <code>prosody</code> element</a> from the SSML markup language [[!SSML]],
there are notable discrepancies. For example, CSS Speech volume keywords and decibels units
are not mutually-exclusive, due to how values are inherited and combined for selected
elements. </p>
<dl>
<!-- dt>
<strong>normal</strong>
</dt>
<dd>
<p> Corresponds to +0.0dB, which means that there is no modification of volume level. This
value overrides the inherited value.</p>
</dd -->
<dt>
<strong>silent</strong>
</dt>
<dd>
<p> Specifies that no sound is generated (the text is read "silently").</p>
<p class="note"> Note that this has the same effect as using negative infinity decibels.
Also note that there is a difference between an element whose 'voice-volume' property has
a value of 'silent', and an element whose 'speak' property has the value 'none'. With the
former, the selected element takes up the same time as if it was spoken, including any
pause before and after the element, but no sound is generated (descendants within the <a
href="#aural-model">aural box model</a> of the selected element can override the
'voice-volume' value, and may therefore generate audio output). With the latter, the
selected element is not rendered in the aural dimension and no time is allocated for
playback (descendants within the <a href="#aural-model">aural box model</a> of the
selected element can override the 'speak' value, and may therefore generate audio output).
</p>
</dd>
<dt><strong>x-soft</strong>, <strong>soft</strong>, <strong>medium</strong>,
<strong>loud</strong>, <strong>x-loud</strong></dt>
<dd>
<p>This sequence of keywords corresponds to monotonically non-decreasing volume levels,
mapped to implementation-dependent values that meet the listener's requirements with
regards to perceived loudness. These audio levels are typically provided via a preference
mechanism that allow users to calibrate sound options according to their auditory
environment. The keyword 'x-soft' maps to the user's <em>minimum audible</em> volume
level, 'x-loud' maps to the user's <em>maximum tolerable</em> volume level, 'medium' maps
to the user's <em>preferred</em> volume level, 'soft' and 'loud' map to intermediary
values.</p>
</dd>
<dt>
<strong>&lt;decibel&gt;</strong>
</dt>
<dd>
<p>A <a href="#number-def">number</a> immediately followed by "dB" (decibel unit). This
represents a change (positive or negative) relative to the given keyword value (see
enumeration above), or to the default value for the root element, or otherwise to the
inherited volume level (which may itself be a combination of a keyword value and of a
decibel offset, in which case the decibel values are combined additively). When the
inherited volume level is 'silent', this 'voice-volume' resolves to 'silent' too,
regardless of the specified &lt;decibel&gt; value. Decibels represent the ratio of the
squares of the new signal amplitude (a1) and the current amplitude (a0), as per the
following logarithmic equation: volume(dB) = 20 log10 (a1 / a0) </p>
<p class="note"> Note that -6.0dB is approximately half the amplitude of the audio signal,
and +6.0dB is approximately twice the amplitude.</p>
</dd>
</dl>
<p class="note">Note that perceived loudness depends on various factors, such as the listening
environment, user preferences or physical abilities. The effective volume variation between
'x-soft' and 'x-loud' represents the dynamic range (in terms of loudness) of the audio output.
Typically, this range would be compressed in a noisy context, i.e. the perceived loudness
1EF7
corresponding to 'x-soft' would effectively be closer to 'x-loud' than it would be in a quiet
environment. There may also be situations where both 'x-soft' and 'x-loud' would map to low
volume levels, such as in listening environments requiring discretion (e.g. library,
night-reading). </p>
<h3 id="mixing-props-voice-balance">The 'voice-balance' property</h3>
<table class="propdef" summary="name: syntax">
<tbody>
<tr>
<td>Name:</td>
<td>
<dfn id="voice-balance">voice-balance</dfn>
</td>
</tr>
<tr>
<td>
<em>Value:</em>
</td>
<td>&lt;number&gt; | left | center | right | leftwards | rightwards </td>
</tr>
<tr>
<td>
<em>Initial:</em>
</td>
<td>center</td>
</tr>
<tr>
<td>
<em>Applies&nbsp;to:</em>
</td>
<td>all elements</td>
</tr>
<tr>
<td>
<em>Inherited:</em>
</td>
<td>yes</td>
</tr>
<tr>
<td>
<em>Percentages:</em>
</td>
<td>N/A</td>
</tr>
<tr>
<td>
<em>Media:</em>
</td>
<td>speech</td>
</tr>
<tr>
<td>
<em>Computed value:</em>
</td>
<td>the specified value resolved to a &lt;number&gt; between '-100' and '100'
(inclusive)</td>
</tr>
</tbody>
</table>
<p> The 'voice-balance' property controls the spatial distribution of audio output across a
lateral sound stage: one extremity is on the left, the other extremity is on the right hand
side, relative to the listener's position. Authors can specify intermediary steps between left
and right extremities, to represent the audio separation along the resulting left-right axis. </p>
<p class="note"> Note that the functionality provided by this property has no match in the SSML
markup language [[!SSML]]. </p>
<dl>
<dt>
<strong>&lt;number&gt;</strong>
</dt>
<dd>
<p>A <a href="#number-def">number</a> between '-100' and '100' (inclusive). Values smaller
than '-100' are clamped to '-100'. Values greater than '100' are clamped to '100'. The
value '-100' represents the left side, and the value '100' represents the right side. The
value '0' represents the center point whereby there is no discernible audio separation
between left and right sides (in a stereo sound system, this corresponds to equal
distribution of audio signals between left and right speakers). </p>
</dd>
<dt>
<strong>left</strong>
</dt>
<dd>
<p>Same as '-100'.</p>
</dd>
<dt>
<strong>center</strong>
</dt>
<dd>
<p>Same as '0'.</p>
</dd>
<dt>
<strong>right</strong>
</dt>
<dd>
<p>Same as '100'.</p>
</dd>
<dt>
<strong>leftwards</strong>
</dt>
<dd>
<p>Moves the sound to the left, by subtracting 20 from the inherited 'voice-balance' value,
and by clamping the resulting number to '-100'.</p>
</dd>
<dt>
<strong>rightwards</strong>
</dt>
<dd>
<p>Moves the sound to the right, by adding 20 to the inherited 'voice-balance' value, and by
clamping the resulting number to '100'.</p>
</dd>
</dl>
<p> user agents may be connected to different kinds of sound systems, featuring varying audio
mixing capabilities. The expected behavior for mono, stereo, and surround sound systems is
defined as follows: </p>
<ul>
<li> When user agents produce audio via a mono-aural sound system (i.e. single-speaker setup),
the 'voice-balance' property has no effect. </li>
<li> When user agents produce audio through a stereo sound system (e.g. two speakers, a pair
of headphones), the left-right distribution of audio signals can precisely match the
authored values for the 'voice-balance' property. </li>
<li> When user agents are capable of mixing audio signals through more than 2 channels (e.g.
5-speakers surround sound system, including a dedicated center channel), the physical
distribution of audio signals resulting from the application of the 'voice-balance' property
should be performed so that the listener perceives sound as if it was coming from a basic
stereo layout. For example, the center channel as well as the left/right speakers may be
used altogether in order to emulate the behavior of the 'center' value. </li>
</ul>
<p> Future revisions of the CSS Speech module may include support for three-dimensional audio,
which would effectively enable authors to specify "azimuth" and "elevation" values. In the
future, content authored using the current specification may therefore be consumed by user
agents which are compliant with the version of CSS Speech that supports three-dimensional
audio. In order to prepare for this possibility, the values enabled by the current
'voice-balance' property are designed to remain compatible with "azimuth" angles. More
precisely, the mapping between the current left-right audio axis (lateral sound stage) and the
envisioned 360 degrees plane around the listener's position is defined as follows: </p>
<ul>
<li>The value '0' maps to zero degrees ('center'). This is in "front" of the listener, not
from "behind".</li>
<li>The value '-100' maps to -40 degrees ('left'). Negative angles are in the
counter-clockwise direction (the audio stage is seen from the top).</li>
<li>The value '100' maps to 40 degrees ('right'). Positive angles are in the clockwise
direction (the audio stage is seen from the top).</li>
<li>Intermediary values on the scale from '-100' to '100' map to the angles between -40 and 40
degrees in a numerically linearly-proportional manner. For example, '-50' maps to -20
degrees.</li>
</ul>
<p class="note"> Note that sound systems may be configured by users in such a way that it would
interfere with the left-right audio distribution specified by document authors. Typically, the
various "surround" modes available in modern sound systems (including systems based on basic
stereo speakers) tend to greatly alter the perceived spatial arrangement of audio signals. The
illusion of a three-dimensional sound stage is often achieved using a combination of phase
shifting, digital delay, volume control (channel mixing), and other techniques. Some users may
even configure their system to "downgrade" any rendered sound to a single mono channel, in
which case the effect of the 'voice-balance' property would obviously not be perceivable at
all. The rendering fidelity of authored content is therefore dependent on such user
customizations, and the 'voice-balance' property merely specifies the desired end-result. </p>
<p class="note"> Note that many speech synthesizers only generate mono sound, and therefore do
not intrinsically support the 'voice-balance' property. The sound distribution along the
left-right axis consequently occurs at post-synthesis stage (when the speech-enabled user
agent mixes the various audio sources authored within the document) </p>
<h2 id="speaking-props">Speaking properties</h2>
<h3 id="speaking-props-speak">The 'speak' property</h3>
<table class="propdef" summary="name: syntax">
<tbody>
<tr>
<td>Name:</td>
<td>
<dfn id="speak">speak</dfn>
</td>
</tr>
<tr>
<td>
<em>Value:</em>
</td>
<td>auto | none | normal </td>
</tr>
<tr>
<td>
<em>Initial:</em>
</td>
<td>auto</td>
</tr>
<tr>
<td>
<em>Applies&nbsp;to:</em>
</td>
<td>all elements</td>
</tr>
<tr>
<td>
<em>Inherited:</em>
</td>
<td>yes</td>
</tr>
<tr>
<td>
<em>Percentages:</em>
</td>
<td>N/A</td>
</tr>
<tr>
<td>
<em>Media:</em>
</td>
<td>speech</td>
</tr>
<tr>
<td>
<em>Computed value:</em>
</td>
<td>specified value</td>
</tr>
</tbody>
</table>
<p>The 'speak' property determines whether or not to render text aurally.</p>
<p class="note"> Note that the functionality provided by this property has no match in the SSML
markup language [[!SSML]]. </p>
<dl>
<dt>
<strong>auto</strong>
</dt>
<dd>
<p>Resolves to a computed value of 'none' when <a href="#display-def">'display'</a> is
'none', otherwise resolves to a computed value of 'auto' which yields a used value of
'normal'. </p>
<p class="note"> Note that the 'none' value of the <a href="#display-def">'display'</a>
property cannot be overridden by descendants of the selected element, but the 'auto' value
of 'speak' can however be overridden using either of 'none' or 'normal'. </p>
</dd>
<dt>
<strong>none</strong>
</dt>
<dd>
<p> This value causes an element (including pauses, cues, rests and actual content) to not
be rendered (i.e., the element has no effect in the aural dimension).</p>
<p class="note"> Note that any of the descendants of the affected element are allowed to
override this value, so descendants can actually take part in the aural rendering despite
using 'none' at this level. However, the pauses, cues, and rests of the ancestor element
remain "deactivated" in the aural dimension, and therefore do not contribute to the <a
href="#collapsed-pauses">collapsing of pauses</a> or additive behavior of adjoining
rests. </p>
</dd>
<dt>
<strong>normal</strong>
</dt>
<dd>
<p> The element is rendered aurally (regardless of its <a href="#display-def">'display'</a>
value and the <a href="#display-def">'display'</a> and 'speak' values of its
ancestors).</p>
<p class="note"> Note that using this value can result in the element being rendered in the
aural dimension even though it would not be rendered on the visual canvas. </p>
</dd>
</dl>
<h3 id="speaking-props-speak-as">The 'speak-as' property</h3>
<table class="propdef" summary="name: syntax">
<tbody>
<tr>
<td>Name:</td>
<td>
<dfn id="speak-as">speak-as</dfn>
</td>
</tr>
<tr>
<td>
<em>Value:</em>
</td>
<td>normal | spell-out || digits || [ literal-punctuation | no-punctuation ] </td>
</tr>
<tr>
<td>
<em>Initial:</em>
</td>
<td>normal</td>
</tr>
<tr>
<td>
<em>Applies&nbsp;to:</em>
</td>
<td>all elements</td>
</tr>
<tr>
<td>
<em>Inherited:</em>
</td>
<td>yes</td>
</tr>
<tr>
<td>
<em>Percentages:</em>
</td>
<td>N/A</td>
</tr>
<tr>
<td>
<em>Media:</em>
</td>
<td>speech</td>
</tr>
<tr>
<td>
<em>Computed value:</em>
</td>
<td>specified value</td>
</tr>
</tbody>
</table>
<p>The 'speak-as' property determines in what manner text gets rendered aurally, based upon a
predefined list of possibilities.</p>
<p class="note"> Note that the functionality provided by this property is conceptually similar
to the <a href="http://www.w3.org/TR/speech-synthesis11/#edef_say-as"><code>say-as</code>
element</a> from the SSML markup language [[!SSML]] (whose possible values are described in
the [[SSML-SAYAS]] W3C Note). Although the design goals are similar, the CSS model is limited
to a basic set of pronunciation rules.</p>
<dl>
<dt>
<strong>normal</strong>
</dt>
<dd>
<p>Uses language-dependent pronunciation rules for rendering the element's content. For
example, punctuation is not spoken as-is, but instead rendered naturally as appropriate
pauses.</p>
</dd>
<dt>
<strong>spell-out</strong>
</dt>
<dd>
<p>Spells the text one letter at a time (useful for acronyms and abbreviations). In
languages where accented characters are rare, it is permitted to drop accents in favor of
alternative unaccented spellings. As as example, in English, the word "r&ocirc;le" can
also be written as "role". A conforming implementation would thus be able to spell-out
"r&ocirc;le" as "R O L E".</p>
</dd>
<dt>
<strong>digits</strong>
</dt>
<dd>
<p>Speak numbers one digit at a time, for instance, "twelve" would be spoken as "one two",
and "31" as "three one".</p>
<p class="note">Speech synthesizers are knowledgeable about what a <em>number</em> is. The
'speak-as' property enables some level of control on how user agents render numbers, and
may be implemented as a preprocessing step before passing the text to the actual speech
synthesizer.</p>
</dd>
<dt>
<strong>literal-punctuation</strong>
</dt>
<dd>
<p> Punctuation such as semicolons, braces, and so on is named aloud (i.e. spoken literally)
rather than rendered naturally as appropriate pauses.</p>
</dd>
<dt>
<strong>no-punctuation</strong>
</dt>
<dd>
<p>Punctuation is not rendered: neither spoken nor rendered as pauses.</p>
</dd>
</dl>
<h2 id="pause-props">Pause properties </h2>
<h3 id="pause-props-pause-before-after">The 'pause-before' and 'pause-after' properties</h3>
<table class="propdef" summary="name: syntax">
<tbody>
<tr>
<td>Name:</td>
<td>
<dfn id="pause-before">pause-before</dfn>
</td>
</tr>
<tr>
<td>
<em>Value:</em>
</td>
<td>&lt;time&gt; | none | x-weak | weak | medium | strong | x-strong </td>
</tr>
<tr>
<td>
<em>Initial:</em>
</td>
<td>none</td>
</tr>
<tr>
<td>
<em>Applies&nbsp;to:</em>
</td>
<td>all elements</td>
</tr>
<tr>
<td>
<em>Inherited:</em>
</td>
<td>no</td>
</tr>
<tr>
<td>
<em>Percentages:</em>
</td>
<td>N/A</td>
</tr>
<tr>
<td>
<em>Media:</em>
</td>
<td>speech</td>
</tr>
<tr>
<td>
<em>Computed value:</em>
</td>
<td>specified value</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<table class="propdef" summary="name: syntax">
<tbody>
<tr>
<td>Name:</td>
<td>
<dfn id="pause-after">pause-after</dfn>
</td>
</tr>
<tr>
<td>
<em>Value:</em>
</td>
<td>&lt;time&gt; | none | x-weak | weak | medium | strong | x-strong </td>
</tr>
<tr>
<td>
<em>Initial:</em>
</td>
<td>none</td>
</tr>
<tr>
<td>
<em>Applies&nbsp;to:</em>
</td>
<td>all elements</td>
</tr>
<tr>
<td>
<em>Inherited:</em>
</td>
<td>no</td>
</tr>
<tr>
<td>
<em>Percentages:</em>
</td>
<td>N/A</td>
</tr>
<tr>
<td>
<em>Media:</em>
</td>
<td>speech</td>
</tr>
<tr>
<td>
<em>Computed value:</em>
</td>
<td>specified value</td>
</tr>
</tbody>
</table>
<p>The 'pause-before' and 'pause-after' properties specify a prosodic boundary (silence with a
specific duration) that occurs before (or after) the speech synthesis rendition of the
selected element, or if any 'cue-before' (or 'cue-after') is specified, before (or after) the
cue within the <a href="#aural-model">aural box model</a>.</p>
<p class="note"> Note that although the functionality provided by this property is similar to
the <a href="http://www.w3.org/TR/speech-synthesis11/#edef_break"><code>break</code>
element</a> from the SSML markup language [[!SSML]], the application of 'pause' prosodic
boundaries within the <a href="#aural-model">aural box model</a> of CSS Speech requires
special considerations (e.g. <a href="#collapsed-pauses">"collapsed" pauses</a>). </p>
<dl>
<dt>
<strong>&lt;time&gt;</strong>
</dt>
<dd>
<p>Expresses the pause in absolute <a href="#time-def">time</a> units (seconds and
milliseconds, e.g. "+3s", "250ms"). Only non-negative values are allowed.</p>
</dd>
<dt>
<strong>none</strong>
</dt>
<dd>
<p> Equivalent to 0ms (no prosodic break is produced by the speech processor). </p>
</dd>
<dt>
<strong>x-weak</strong>, <strong>weak</strong>, <strong>medium</strong>,
<strong>strong</strong>, and <strong>x-strong</strong>
</dt>
<dd>
<p> Expresses the pause by the strength of the prosodic break in speech output. The exact
time is implementation-dependent. The values indicate monotonically non-decreasing
(conceptually increasing) break strength between elements. </p>
</dd>
</dl>
<p class="note"> Note that stronger content boundaries are typically accompanied by pauses. For
example, the breaks between paragraphs are typically much more substantial than the breaks
between words within a sentence. </p>
<div class="example">
<p> This example illustrates how the default strengths of prosodic breaks for specific
elements (which are defined by the user agent stylesheet) can be overridden by authored
styles. </p>
<pre>
p { pause: none } /* pause-before: none; pause-after: none */</pre>
</div>
<h3 id="pause-props-pause">The 'pause' shorthand property</h3>
<table class="propdef" summary="name: syntax">
<tbody>
<tr>
<td>Name:</td>
<td>
<dfn id="pause">pause</dfn>
</td>
</tr>
<tr>
<td>
<em>Value:</em>
</td>
<td>&lt;'pause-before'&gt; &lt;'pause-after'&gt;?</td>
</tr>
<tr>
<td>
<em>Initial:</em>
</td>
<td>N/A (see individual properties)</td>
</tr>
<tr>
<td>
<em>Applies&nbsp;to:</em>
</td>
<td>all elements</td>
</tr>
<tr>
<td>
<em>Inherited:</em>
</td>
<td>no</td>
</tr>
<tr>
<td>
<em>Percentages:</em>
</td>
<td>N/A</td>
</tr>
<tr>
<td>
<em>Media:</em>
</td>
<td>speech</td>
</tr>
<tr>
<td>
<em>Computed value:</em>
</td>
<td>N/A (see individual properties)</td>
</tr>
</tbody>
</table>
<p>The 'pause' property is a shorthand property for 'pause-before' and 'pause-after'. If two
values are given, the first value is 'pause-before' and the second is 'pause-after'. If only
one value is given, it applies to both properties.</p>
<div class="example">
<p> Examples of property values:</p>
<pre>
h1 { pause: 20ms; } /* pause-before: 20ms; pause-after: 20ms */
h2 { pause: 30ms 40ms; } /* pause-before: 30ms; pause-after: 40ms */
h3 { pause-after: 10ms; } /* pause-before: <i>unspecified</i>; pause-after: 10ms */</pre>
</div>
<h3 id="collapsed-pauses">Collapsing pauses</h3>
<p> The pause defines the minimum distance of the aural "box" to the aural "boxes" before and
after it. Adjoining pauses are merged by selecting the strongest named break and the longest
absolute time interval. For example, "strong" is selected when comparing "strong" and "weak",
"1s" is selected when comparing "1s" and "250ms", and "strong" and "250ms" take effect
additively when comparing "strong" and "250ms". </p>
<p>The following pauses are adjoining:</p>
<ol>
<li>The 'pause-after' of an aural "box" and the 'pause-after' of its last child, provided the
former has no 'rest-after' and no 'cue-after'.</li>
<li>The 'pause-before' of an aural "box" and the 'pause-before' of its first child, provided
the former has no 'rest-before' and no 'cue-before'.</li>
<li>The 'pause-after' of an aural "box" and the 'pause-before' of its next sibling.</li>
<li>The 'pause-before' and 'pause-after' of an aural "box", if the the "box" has a
'voice-duration' of "0ms" and no 'rest-before' or 'rest-after' and no 'cue-before' or
'cue-after', or if the the "box" has no rendered content at all (see 'speak').</li>
</ol>
<p>A collapsed pause is considered adjoining to another pause if any of its component pauses is
adjoining to that pause.</p>
<p class="note"> Note that 'pause' has been moved from between the element's contents and any
'cue' to outside the 'cue'. This is not backwards compatible with the informative CSS2.1 Aural
appendix [[!CSS21]].</p>
<h2 id="rest-props">Rest properties</h2>
<h3 id="rest-props-rest-before-after">The 'rest-before' and 'rest-after' properties</h3>
<table class="propdef" summary="name: syntax">
<tbody>
<tr>
<td>Name:</td>
<td>
<dfn id="rest-before">rest-before</dfn>
</td>
</tr>
<tr>
<td>
<em>Value:</em>
</td>
<td>&lt;time&gt; | none | x-weak | weak | medium | strong | x-strong </td>
</tr>
<tr>
<td>
<em>Initial:</em>
</td>
<td>none</td>
</tr>
<tr>
<td>
<em>Applies&nbsp;to:</em>
</td>
<td>all elements</td>
</tr>
<tr>
<td>
<em>Inherited:</em>
</td>
<td>no</td>
</tr>
<tr>