8000 csswg-drafts/css-speech-1/Overview.html at 549b0e470e8744bf87fabb3fea736365ad1ac1be · w3c/csswg-drafts · GitHub
Skip to content

Latest commit

 

History

History
3916 lines (2972 loc) · 139 KB

File metadata and controls

3916 lines (2972 loc) · 139 KB
<span class="special">Can you hear me ?</span>
I am Peter.
&lt;/p&gt;</pre>
</div>
<h2 id=aural-model><span class=secno>6. </span>The aural formatting model</h2>
<p>The CSS formatting model for aural media is based on a sequence of
sounds and silences that occur within a nested context similar to the <a
href="#box-model-def">visual box model</a>, which we name the <dfn
id=aural-box-model>aural "box" model</dfn>. The aural "canvas" consists of
a two-channel (stereo) space and of a temporal dimension, within which
synthetic speech and audio cues coexist. The selected element is
surrounded by ‘<a href="#rest"><code class=property>rest</code></a>’,
‘<a href="#cue"><code class=property>cue</code></a>’ and ‘<a
href="#pause"><code class=property>pause</code></a>’ properties (from
the innermost to the outermost position). These can be seen as aural
equivalents to ‘<code class=property>padding</code>’, ‘<code
class=property>border</code>’ and ‘<code
class=property>margin</code>’, respectively. When used, the ‘<code
class=css>:before</code>’ and ‘<code class=css>:after</code>’
pseudo-elements <a href="#ref-CSS21"
rel=biblioentry>[CSS21]<!--{{!CSS21}}--></a> get inserted between the
element's contents and the ‘<a href="#rest"><code
class=property>rest</code></a>’.
<p> The following diagram illustrates the equivalence between properties of
the visual and aural box models, applied to the selected &lt;element&gt;:
<p> <img
alt="The aural 'box' model, illustrated by a diagram: the selected element is positioned in the center, on its left side are (from innermost to outermost) rest-before, cue-before, pause-before, on its right side are (from innermost to outermost) rest-after, cue-after, pause-after, where rest is conceptually similar to padding, cue is similar to border, pause is similar to margin."
id=aural-box src=aural-box.png
title="The aural 'box' model, illustrated by a diagram: the selected element is positioned in the center, on its left side are (from innermost to outermost) rest-before, cue-before, pause-before, on its right side are (from innermost to outermost) rest-after, cue-after, pause-after, where rest is conceptually similar to padding, cue is similar to border, pause is similar to margin.">
<h2 id=mixing-props><span class=secno>7. </span>Mixing properties</h2>
<h3 id=mixing-props-voice-volume><span class=secno>7.1. </span>The ‘<a
href="#voice-volume"><code class=property>voice-volume</code></a>’
property</h3>
<table class=propdef summary="name: syntax">
<tbody>
<tr>
<td>Name:
<td> <dfn id=voice-volume>voice-volume</dfn>
<tr>
<td> <em>Value:</em>
<td>silent | [[x-soft | soft | medium | loud | x-loud] ||
&lt;decibel&gt;]
<tr>
<td> <em>Initial:</em>
<td>medium
<tr>
<td> <em>Applies to:</em>
<td>all elements
<tr>
<td> <em>Inherited:</em>
<td>yes
<tr>
<td> <em>Percentages:</em>
<td>N/A
<tr>
<td> <em>Computed value:</em>
<td>‘<code class=property>silent</code>’, or a keyword value and
optionally also a decibel offset (if not zero)
</table>
<p>The ‘<a href="#voice-volume"><code
class=property>voice-volume</code></a>’ property allows authors to
control the amplitude of the audio waveform generated by the speech
synthesiser, and is also used to adjust the relative volume level of <a
href="#cue-props">audio cues</a> within the <a href="#aural-model">aural
box model</a> of the selected element.
<p class=note> Note that although the functionality provided by this
property is similar to the <a
href="https://www.w3.org/TR/speech-synthesis11/#edef_prosody"><code>volume</code>
attribute of the <code>prosody</code> element</a> from the SSML markup
language <a href="#ref-SSML" rel=biblioentry>[SSML]<!--{{!SSML}}--></a>,
there are notable discrepancies. For example, CSS Speech volume keywords
and decibels units are not mutually-exclusive, due to how values are
inherited and combined for selected elements.
<dl><!-- dt>
<strong>normal</strong>
</dt>
<dd>
<p> Corresponds to +0.0dB, which means that there is no modification of volume level. This
value overrides the inherited value.</p>
</dd -->
<dt> <strong>silent</strong>
<dd>
<p> Specifies that no sound is generated (the text is read "silently").
<p class=note> Note that this has the same effect as using negative
infinity decibels. Also note that there is a difference between an
element whose ‘<a href="#voice-volume"><code
class=property>voice-volume</code></a>’ property has a value of
‘<code class=property>silent</code>’, and an element whose ‘<a
href="#speak"><code class=property>speak</code></a>’ property has the
value ‘<code class=property>none</code>’. With the former, the
selected element takes up the same time as if it was spoken, including
any pause before and after the element, but no sound is generated
(descendants within the <a href="#aural-model">aural box model</a> of
the selected element can override the ‘<a href="#voice-volume"><code
class=property>voice-volume</code></a>’ value, and may therefore
generate audio output). With the latter, the selected element is not
rendered in the aural dimension and no time is allocated for playback
(descendants within the <a href="#aural-model">aural box model</a> of
the selected element can override the ‘<a href="#speak"><code
class=property>speak</code></a>’ value, and may therefore generate
audio output).
<dt><strong>x-soft</strong>, <strong>soft</strong>,
<strong>medium</strong>, <str 7E4C ong>loud</strong>, <strong>x-loud</strong>
<dd>
<p>This sequence of keywords corresponds to monotonically non-decreasing
volume levels, mapped to implementation-dependent values that meet the
listener's requirements with regards to perceived loudness. These audio
levels are typically provided via a preference mechanism that allow
users to calibrate sound options according to their auditory
environment. The keyword ‘<code class=property>x-soft</code>’ maps
to the user's <em>minimum audible</em> volume level, ‘<code
class=property>x-loud</code>’ maps to the user's <em>maximum
tolerable</em> volume level, ‘<code class=property>medium</code>’
maps to the user's <em>preferred</em> volume level, ‘<code
class=property>soft</code>’ and ‘<code class=property>loud</code>’
map to intermediary values.
<dt> <strong>&lt;decibel&gt;</strong>
<dd>
<p>A <a href="#number-def">number</a> immediately followed by "dB"
(decibel unit). This represents a change (positive or negative) relative
to the given keyword value (see enumeration above), or to the default
value for the root element, or otherwise to the inherited volume level
(which may itself be a combination of a keyword value and of a decibel
offset, in which case the decibel values are combined additively). When
the inherited volume level is ‘<code class=property>silent</code>’,
this ‘<a href="#voice-volume"><code
class=property>voice-volume</code></a>’ resolves to ‘<code
class=property>silent</code>’ too, regardless of the specified
&lt;decibel&gt; value. Decibels represent the ratio of the squares of
the new signal amplitude (a1) and the current amplitude (a0), as per the
following logarithmic equation: volume(dB) = 20 log10 (a1 / a0)
<p class=note> Note that -6.0dB is approximately half the amplitude of
the audio signal, and +6.0dB is approximately twice the amplitude.
</dl>
<p class=note>Note that perceived loudness depends on various factors, such
as the listening environment, user preferences or physical abilities. The
effective volume variation between ‘<code
class=property>x-soft</code>’ and ‘<code
class=property>x-loud</code>’ represents the dynamic range (in terms of
loudness) of the audio output. Typically, this range would be compressed
in a noisy context, i.e. the perceived loudness corresponding to ‘<code
class=property>x-soft</code>’ would effectively be closer to ‘<code
class=property>x-loud</code>’ than it would be in a quiet environment.
There may also be situations where both ‘<code
class=property>x-soft</code>’ and ‘<code
class=property>x-loud</code>’ would map to low volume levels, such as in
listening environments requiring discretion (e.g. library, night-reading).
<h3 id=mixing-props-voice-balance><span class=secno>7.2. </span>The ‘<a
href="#voice-balance"><code class=property>voice-balance</code></a>’
property</h3>
<table class=propdef summary="name: syntax">
<tbody>
<tr>
<td>Name:
<td> <dfn id=voice-balance>voice-balance</dfn>
<tr>
<td> <em>Value:</em>
<td>&lt;number&gt; | left | center | right | leftwards | rightwards
<tr>
<td> <em>Initial:</em>
<td>center
<tr>
<td> <em>Applies to:</em>
<td>all elements
<tr>
<td> <em>Inherited:</em>
<td>yes
<tr>
<td> <em>Percentages:</em>
<td>N/A
<tr>
<td> <em>Computed value:</em>
<td>the specified value resolved to a &lt;number&gt; between ‘<code
class=css>-100</code>’ and ‘<code class=css>100</code>’
(inclusive)
</table>
<p> The ‘<a href="#voice-balance"><code
class=property>voice-balance</code></a>’ property controls the spatial
distribution of audio output across a lateral sound stage: one extremity
is on the left, the other extremity is on the right hand side, relative to
the listener's position. Authors can specify intermediary steps between
left and right extremities, to represent the audio separation along the
resulting left-right axis.
<p class=note> Note that the functionality provided by this property has no
match in the SSML markup language <a href="#ref-SSML"
rel=biblioentry>[SSML]<!--{{!SSML}}--></a>.
<dl>
<dt> <strong>&lt;number&gt;</strong>
<dd>
<p>A <a href="#number-def">number</a> between ‘<code
class=css>-100</code>’ and ‘<code class=css>100</code>’
(inclusive). Values smaller than ‘<code class=css>-100</code>’ are
clamped to ‘<code class=css>-100</code>’. Values greater than
‘<code class=css>100</code>’ are clamped to ‘<code
class=css>100</code>’. The value ‘<code class=css>-100</code>’
represents the left side, and the value ‘<code class=css>100</code>’
represents the right side. The value ‘<code class=css>0</code>’
represents the center point whereby there is no discernible audio
separation between left and right sides (in a stereo sound system, this
corresponds to equal distribution of audio signals between left and
right speakers).
<dt> <strong>left</strong>
<dd>
<p>Same as ‘<code class=css>-100</code>’.
<dt> <strong>center</strong>
<dd>
<p>Same as ‘<code class=css>0</code>’.
<dt> <strong>right</strong>
<dd>
<p>Same as ‘<code class=css>100</code>’.
<dt> <strong>leftwards</strong>
<dd>
<p>Moves the sound to the left, by subtracting 20 from the inherited
‘<a href="#voice-balance"><code
class=property>voice-balance</code></a>’ value, and by clamping the
resulting number to ‘<code class=css>-100</code>’.
<dt> <strong>rightwards</strong>
<dd>
<p>Moves the sound to the right, by adding 20 to the inherited ‘<a
href="#voice-balance"><code class=property>voice-balance</code></a>’
value, and by clamping the resulting number to ‘<code
class=css>100</code>’.
</dl>
<p> user agents may be connected to different kinds of sound systems,
featuring varying audio mixing capabilities. The expected behavior for
mono, stereo, and surround sound systems is defined as follows:
<ul>
<li> When user agents produce audio via a mono-aural sound system (i.e.
single-speaker setup), the ‘<a href="#voice-balance"><code
class=property>voice-balance</code></a>’ property has no effect.
<li> When user agents produce audio through a stereo sound system (e.g.
two speakers, a pair of headphones), the left-right distribution of audio
signals can precisely match the authored values for the ‘<a
href="#voice-balance"><code class=property>voice-balance</code></a>’
property.
<li> When user agents are capable of mixing audio signals through more
than 2 channels (e.g. 5-speakers surround sound system, including a
dedicated center channel), the physical distribution of audio signals
resulting from the application of the ‘<a href="#voice-balance"><code
class=property>voice-balance</code></a>’ property should be performed
so that the listener perceives sound as if it was coming from a basic
stereo layout. For example, the center channel as well as the left/right
speakers may be used altogether in order to emulate the behavior of the
‘<code class=property>center</code>’ value.
</ul>
<p> Future revisions of the CSS Speech module may include support for
three-dimensional audio, which would effectively enable authors to specify
"azimuth" and "elevation" values. In the future, content authored using
the current specification may therefore be consumed by user agents which
are compliant with the version of CSS Speech that supports
three-dimensional audio. In order to prepare for this possibility, the
values enabled by the current ‘<a href="#voice-balance"><code
class=property>voice-balance</code></a>’ property are designed to remain
compatible with "azimuth" angles. More precisely, the mapping between the
current left-right audio axis (lateral sound stage) and the envisioned 360
degrees plane around the listener's position is defined as follows:
<ul>
<li>The value ‘<code class=css>0</code>’ maps to zero degrees
(‘<code class=property>center</code>’). This is in "front" of the
listener, not from "behind".
<li>The value ‘<code class=css>-100</code>’ maps to -40 degrees
(‘<code class=property>left</code>’). Negative angles are in the
counter-clockwise direction (the audio stage is seen from the top).
<li>The value ‘<code class=css>100</code>’ maps to 40 degrees
(‘<code class=property>right</code>’). Positive angles are in the
clockwise direction (the audio stage is seen from the top).
<li>Intermediary values on the scale from ‘<code
class=css>-100</code>’ to ‘<code class=css>100</code>’ map to the
angles between -40 and 40 degrees in a numerically linearly-proportional
manner. For example, ‘<code class=css>-50</code>’ maps to -20
degrees.
</ul>
<p class=note> Note that sound systems may be configured by users in such a
way that it would interfere with the left-right audio distribution
specified by document authors. Typically, the various "surround" modes
available in modern sound systems (including systems based on basic stereo
speakers) tend to greatly alter the perceived spatial arrangement of audio
signals. The illusion of a three-dimensional sound stage is often achieved
using a combination of phase shifting, digital delay, volume control
(channel mixing), and other techniques. Some users may even configure
their system to "downgrade" any rendered sound to a single mono channel,
in which case the effect of the ‘<a href="#voice-balance"><code
class=property>voice-balance</code></a>’ property would obviously not be
perceivable at all. The rendering fidelity of authored content is
therefore dependent on such user customizations, and the ‘<a
href="#voice-balance"><code class=property>voice-balance</code></a>’
property merely specifies the desired end-result.
<p class=note> Note that many speech synthesizers only generate mono sound,
and therefore do not intrinsically support the ‘<a
href="#voice-balance"><code class=property>voice-balance</code></a>’
property. The sound distribution along the left-right axis consequently
occurs at post-synthesis stage (when the speech-enabled user agent mixes
the various audio sources authored within the document)
<h2 id=speaking-props><span class=secno>8. </span>Speaking properties</h2>
<h3 id=speaking-props-speak><span class=secno>8.1. </span>The ‘<a
href="#speak"><code class=property>speak</code></a>’ property</h3>
<table class=propdef summary="name: syntax">
<tbody>
<tr>
<td>Name:
<td> <dfn id=speak>speak</dfn>
<tr>
<td> <em>Value:</em>
<td>auto | never | always
<tr>
<td> <em>Initial:</em>
<td>auto
<tr>
<td> <em>Applies to:</em>
<td>all elements
<tr>
<td> <em>Inherited:</em>
<td>yes
<tr>
<td> <em>Percentages:</em>
<td>N/A
<tr>
<td> <em>Computed value:</em>
<td>specified value
</table>
<p>The ‘<a href="#speak"><code class=property>speak</code></a>’
property determines whether or not to render text aurally.
<p class=note> Note that the functionality provided by this property has no
match in the SSML markup language <a href="#ref-SSML"
rel=biblioentry>[SSML]<!--{{!SSML}}--></a>.
<dl>
<dt> <strong>auto</strong>
<dd>
<p>Resolves to a computed value of ‘<code class=css>never</code>’
when <a href="#display-def">‘<code
class=property>display</code>’</a> is ‘<code
class=property>none</code>’, otherwise resolves to a computed value of
‘<code class=property>auto</code>’. The used value is then
equivalent to ‘<code class=css>always</code>’ if ‘<code
class=property>visibility</code>’ is ‘<code
class=css>visible</code>’ and to ‘<code class=css>never</code>’
otherwise.
<p class=note> Note that the ‘<code class=property>none</code>’ value
of the <a href="#display-def">‘<code
class=property>display</code>’</a> property cannot be overridden by
descendants of the selected element, but the ‘<code
class=property>auto</code>’ value of ‘<a href="#speak"><code
class=property>speak</code></a>’ can however be overridden using
either of ‘<code class=property>never</code>’ or ‘<code
class=property>always</code>’.
<dt> <strong>never</strong>
<dd>
<p> This value causes an element (including pauses, cues, rests and
actual content) to not be rendered (i.e., the element has no effect in
the aural dimension).
<p class=note> Note that any of the descendants of the affected element
are allowed to override this value, so descendants can actually take
part in the aural rendering despite using ‘<code class=css>display:
none</code>’ at this level. However, the pauses, cues, and rests of
the ancestor element remain "deactivated" in the aural dimension, and
therefore do not contribute to the <a
href="#collapsed-pauses">collapsing of pauses</a> or additive behavior
of adjoining rests.
<dt> <strong>always</strong>
<dd>
<p> The element is rendered aurally (regardless of its <a
href="#display-def">‘<code class=property>display</code>’</a> value,
or the <a href="#display-def">‘<code
class=property>display</code>’</a> or ‘<a href="#speak"><code
class=property>speak</code></a>’ values of its ancestors).
<p class=note> Note that using this value can result in the element being
rendered in the aural dimension even though it would not be rendered on
the visual canvas.
</dl>
<h3 id=speaking-props-speak-as><span class=secno>8.2. </span>The ‘<a
href="#speak-as"><code class=property>speak-as</code></a>’ property</h3>
<table class=propdef summary="name: syntax">
<tbody>
<tr>
<td>Name:
<td> <dfn id=speak-as>speak-as</dfn>
<tr>
<td> <em>Value:</em>