Skip to content

Commit 7a254f4

Browse files
committed
last update to address the LCWD disposition of comments.
1 parent 7d816de commit 7a254f4

2 files changed

Lines changed: 163 additions & 153 deletions

File tree

css3-speech/Overview.html

Lines changed: 93 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -88,14 +88,14 @@
8888

8989
<h1 id=top>CSS Speech Module</h1>
9090

91-
<h2 class="no-num no-toc" id=longstatus-date>Editor's Draft 20 February
91+
<h2 class="no-num no-toc" id=longstatus-date>Editor's Draft 21 February
9292
2012</h2>
9393

9494
<dl id=versions>
9595
<dt>This version:
9696

9797
<dd>
98-
<!--<a href="http://www.w3.org/TR/2012/WD-css3-speech-20120220/">http://www.w3.org/TR/2012/ED-css3-speech-20120220/</a>-->
98+
<!--<a href="http://www.w3.org/TR/2012/WD-css3-speech-20120221/">http://www.w3.org/TR/2012/ED-css3-speech-20120221/</a>-->
9999
<a
100100
href="http://dev.w3.org/csswg/css3-speech">http://dev.w3.org/csswg/css3-speech</a>
101101

@@ -535,8 +535,11 @@ <h2 id=aural-model><span class=secno>6. </span>The aural formatting model</h2>
535535
<p> The following diagram illustrates the equivalence between properties of
536536
the visual and aural box models, applied to the selected &lt;element&gt;:
537537

538-
<p> <img alt="A graph depicting the aural 'box' model." id=aural-box
539-
src=aural-box.png>
538+
<p> <img
539+
alt="The aural 'box' model, illustrated by a diagram: the selected element is positioned in the center, on its left side are (from innermost to outermost) rest-before, cue-before, pause-before, on its right side are (from innermost to outermost) rest-after, cue-after, pause-after, where rest is conceptually similar to padding, cue is similar to border, pause is similar to margin."
540+
id=aural-box src=aural-box.png
541+
title="The aural 'box' model, illustrated by a diagram: the selected element is positioned in the center, on its left side are (from innermost to outermost) rest-before, cue-before, pause-before, on its right side are (from innermost to outermost) rest-after, cue-after, pause-after, where rest is conceptually similar to padding, cue is similar to border, pause is similar to margin.">
542+
540543

541544
<h2 id=mixing-props><span class=secno>7. </span>Mixing properties</h2>
542545

@@ -585,15 +588,16 @@ <h3 id=mixing-props-voice-volume><span class=secno>7.1. </span>The
585588
<tr>
586589
<td> <em>Computed value:</em>
587590

588-
<td>a keyword value, and optionally also a decibel offset (if not zero)
591+
<td>&lsquo;<code class=property>silent</code>&rsquo;, or a keyword value
592+
and optionally also a decibel offset (if not zero)
589593
</table>
590594

591595
<p>The &lsquo;<a href="#voice-volume"><code
592596
class=property>voice-volume</code></a>&rsquo; property allows authors to
593597
control the amplitude of the audio waveform generated by the speech
594598
synthesiser, and is also used to adjust the relative volume level of <a
595-
href="#cue-props">audio cues</a> within the <a href="#aural-model">audio
596-
box model</a>.
599+
href="#cue-props">audio cues</a> within the <a href="#aural-model">aural
600+
box model</a> of the selected element.
597601

598602
<p class=note> Note that although the functionality provided by this
599603
property is similar to the <a
@@ -615,36 +619,40 @@ <h3 id=mixing-props-voice-volume><span class=secno>7.1. </span>The
615619
<dt> <strong>silent</strong>
616620

617621
<dd>
618-
<p> Specifies that no sound is generated (the text is read "silently").
619-
Corresponds to negative infinity in dB units.</p>
622+
<p> Specifies that no sound is generated (the text is read "silently").</p>
620623

621-
<p class=note> Note that there is a difference between an element whose
622-
&lsquo;<a href="#voice-volume"><code
624+
<p class=note> Note that this has the same effect as using negative
625+
infinity decibels. Also note that there is a difference between an
626+
element whose &lsquo;<a href="#voice-volume"><code
623627
class=property>voice-volume</code></a>&rsquo; property has a value of
624628
&lsquo;<code class=property>silent</code>&rsquo;, and an element whose
625629
&lsquo;<a href="#speak"><code class=property>speak</code></a>&rsquo;
626630
property has the value &lsquo;<code class=property>none</code>&rsquo;.
627631
With the former, the selected element takes up the same time as if it
628632
was spoken, including any pause before and after the element, but no
629-
sound is generated (descendants can override the &lsquo;<a
633+
sound is generated (descendants within the <a href="#aural-model">aural
634+
box model</a> of the selected element can override the &lsquo;<a
630635
href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
631-
value and may therefore generate audio output). With the latter, the
636+
value, and may therefore generate audio output). With the latter, the
632637
selected element is not rendered in the aural dimension and no time is
633-
allocated for playback (descendants can override the &lsquo;<a
634-
href="#speak"><code class=property>speak</code></a>&rsquo; value and may
635-
therefore generate audio output).</p>
638+
allocated for playback (descendants within the <a
639+
href="#aural-model">aural box model</a> of the selected element can
640+
override the &lsquo;<a href="#speak"><code
641+
class=property>speak</code></a>&rsquo; value, and may therefore generate
642+
audio output).</p>
636643

637644
<dt><strong>x-soft</strong>, <strong>soft</strong>,
638645
<strong>medium</strong>, <strong>loud</strong>, <strong>x-loud</strong>
639646

640647
<dd>
641648
<p>This sequence of keywords corresponds to monotonically non-decreasing
642649
volume levels, mapped to implementation-dependent values that meet the
643-
listener's requirements with regards to perceived sound loudness. These
644-
audio levels are typically provided via a preference mechanism that
645-
allow users to set options according to their auditory environment. The
646-
keyword &lsquo;<code class=property>x-soft</code>&rsquo; maps to the
647-
user's <em>minimum audible</em> volume level, &lsquo;<code
650+
listener's requirements with regards to perceived loudness. These audio
651+
levels are typically provided via a preference mechanism that allow
652+
users to calibrate sound options according to their auditory
653+
environment. The keyword &lsquo;<code
654+
class=property>x-soft</code>&rsquo; maps to the user's <em>minimum
655+
audible</em> volume level, &lsquo;<code
648656
class=property>x-loud</code>&rsquo; maps to the user's <em>maximum
649657
tolerable</em> volume level, &lsquo;<code
650658
class=property>medium</code>&rsquo; maps to the user's
@@ -674,12 +682,12 @@ <h3 id=mixing-props-voice-volume><span class=secno>7.1. </span>The
674682
the audio signal, and +6.0dB is approximately twice the amplitude.</p>
675683
</dl>
676684

677-
<p class=note>Note that the actual perceived volume levels depend on
678-
various factors, such as the listening environment and personal user
679-
preferences. The effective volume variation between &lsquo;<code
685+
<p class=note>Note that perceived loudness depends on various factors, such
686+
as the listening environment, user preferences or physical abilities. The
687+
effective volume variation between &lsquo;<code
680688
class=property>x-soft</code>&rsquo; and &lsquo;<code
681689
class=property>x-loud</code>&rsquo; represents the dynamic range (in terms
682-
of loudness) of the speech output. Typically, this range would be
690+
of loudness) of the audio output. Typically, this range would be
683691
compressed in a noisy context, i.e. the perceived loudness corresponding
684692
to &lsquo;<code class=property>x-soft</code>&rsquo; would effectively be
685693
closer to &lsquo;<code class=property>x-loud</code>&rsquo; than it would
@@ -1485,7 +1493,7 @@ <h3 id=rest-props-rest-before-after><span class=secno>10.1. </span>The
14851493
href="#rest-after"><code class=property>rest-after</code></a>&rsquo;
14861494
properties specify a prosodic boundary (silence with a specific duration)
14871495
that occurs before (or after) the speech synthesis rendition of an element
1488-
within the <a href="#aural-model">audio box model</a>.
1496+
within the <a href="#aural-model">aural box model</a>.
14891497

14901498
<p class=note> Note that although the functionality provided by this
14911499
property is similar to the <a
@@ -1690,7 +1698,7 @@ <h3 id=cue-props-cue-before-after><span class=secno>11.1. </span>The
16901698
href="#cue-after"><code class=property>cue-after</code></a>&rsquo;
16911699
properties specify auditory icons (i.e. pre-recorded / pre-generated sound
16921700
clips) to be played before (or after) the selected element within the <a
1693-
href="#aural-model">audio box model</a>.
1701+
href="#aural-model">aural box model</a>.
16941702

16951703
<p class=note> Note that although the functionality provided by this
16961704
property may appear related to the <a
@@ -1724,21 +1732,21 @@ <h3 id=cue-props-cue-before-after><span class=secno>11.1. </span>The
17241732
to the computed value of the &lsquo;<a href="#voice-volume"><code
17251733
class=property>voice-volume</code></a>&rsquo; property within the <a
17261734
href="#aural-model">aural box model</a> of the selected element (as a
1727-
result, the volume level of audio cues changes when the &lsquo;<a
1735+
result, the volume level of an audio cue changes when the &lsquo;<a
17281736
href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
17291737
property changes). When omitted, the implied value computes to 0dB.</p>
17301738

1731-
<p> When the &lsquo;<a href="#voice-volume"><code
1732-
class=property>voice-volume</code></a>&rsquo; property is set to
1733-
&lsquo;<code class=property>silent</code>&rsquo;, the audio cue is also
1734-
set to &lsquo;<code class=property>silent</code>&rsquo; (regardless of
1735-
this specified &lt;decibel&gt; value). Otherwise (when not &lsquo;<code
1739+
<p> When the computed value of the &lsquo;<a href="#voice-volume"><code
1740+
class=property>voice-volume</code></a>&rsquo; property is &lsquo;<code
1741+
class=property>silent</code>&rsquo;, the audio cue is also set to
1742+
&lsquo;<code class=property>silent</code>&rsquo; (regardless of this
1743+
specified &lt;decibel&gt; value). Otherwise (when not &lsquo;<code
17361744
class=property>silent</code>&rsquo;), &lsquo;<a
17371745
href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
17381746
values are always specified relatively to the volume level keywords (see
17391747
the definition of &lsquo;<a href="#voice-volume"><code
17401748
class=property>voice-volume</code></a>&rsquo;), which map to a
1741-
user-configured scale of "preferred" loudness settings. If the inherited
1749+
user-calibrated scale of "preferred" loudness settings. If the inherited
17421750
&lsquo;<a href="#voice-volume"><code
17431751
class=property>voice-volume</code></a>&rsquo; value already contains a
17441752
decibel offset, the dB offset specific to the audio cue is combined
@@ -1789,48 +1797,54 @@ <h3 id=cue-props-volume><span class=secno>11.2. </span>Relation between
17891797
For example, the desired effect of an audio cue whose volume level is set
17901798
at +0dB (as specified by the &lt;decibel&gt; value) is that its perceived
17911799
loudness during playback is close to that of the speech synthesis
1792-
rendition of the selected element, as dictated by computed value of the
1800+
rendition of the selected element, as dictated by the computed value of
1801+
the &lsquo;<a href="#voice-volume"><code
1802+
class=property>voice-volume</code></a>&rsquo; property. Note that a
1803+
&lsquo;<code class=property>silent</code>&rsquo; computed value for the
17931804
&lsquo;<a href="#voice-volume"><code
1794-
class=property>voice-volume</code></a>&rsquo; property (which is itself
1795-
based on a user-configured volume level keyword). Similarly, a
1796-
&lsquo;<code class=property>silent</code>&rsquo; value for the &lsquo;<a
1797-
href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
1798-
property results on any audio cues being "silenced" as well.
1799-
1800-
<p> In order to achieve this effect, authors should ensure that the volume
1801-
level of audio cues (on average, as there may be discrete loudness
1802-
variations due to changes in the audio stream, such as intonation, stress,
1803-
etc.) matches that of a "typical" TTS voice output (based on the &lsquo;<a
1805+
class=property>voice-volume</code></a>&rsquo; property results in audio
1806+
cues being "forcefully" silenced as well (i.e. regardless of the specified
1807+
audio cue &lsquo;<code class=property>decibel</code>&rsquo; value)
1808+
1809+
<p> The volume keywords of the &lsquo;<a href="#voice-volume"><code
1810+
class=property>voice-volume</code></a>&rsquo; property are user-calibrated
1811+
to match requirements not known at authoring time (e.g. auditory
1812+
environment, personal preferences). Therefore, in order to achieve this
1813+
approximate loudness alignment of audio cues and speech synthesis, authors
1814+
should ensure that the volume level of audio cues (on average, as there
1815+
may be discrete variations of perceived loudness due to changes in the
1816+
audio stream, such as intonation, stress, etc.) matches the output of a
1817+
speech synthesis rendition based on the &lsquo;<a
18041818
href="#voice-family"><code class=property>voice-family</code></a>&rsquo;
1805-
intended for use), given "standard" listening conditions (i.e. default
1819+
intended for use, given "typical" listening conditions (i.e. default
18061820
system volume levels, centered equalization across the frequency
18071821
spectrum). As speech processors are capable of directly controlling the
18081822
waveform amplitude of generated text-to-speech audio, and because user
18091823
agents are able to adjust the volume output of audio cues (i.e. amplify or
18101824
attenuate audio signals based on the intrinsic waveform amplitude of
18111825
digitized sound clips), this sets a baseline that enables implementations
1812-
to "align" the loudness of both TTS and cue audio streams within the aural
1813-
box model, relative to user-configured volume levels (see the keywords
1826+
to manage the loudness of both TTS and cue audio streams within the aural
1827+
box model, relative to user-calibrated volume levels (see the keywords
18141828
defined in the &lsquo;<a href="#voice-volume"><code
18151829
class=property>voice-volume</code></a>&rsquo; property).
18161830

18171831
<p> Due to the complex relationship between perceived audio characteristics
18181832
(e.g. loudness) and the processing applied to the digitized audio signal
1819-
(e.g. "compression"), we refer to a simple scenario whereby the
1820-
attenuation is indicated in decibels, typically ranging from 0dB (maximum
1821-
audio input, near clipping threshold) to -60dB (total silence). Given this
1822-
context, a "standard" audio clip would oscillate between these values, the
1823-
loudest peak levels would be close to -3dB (to avoid distortion), and the
1824-
relevant audible passages would have average (RMS) volume levels as high
1825-
as possible (i.e. not too quiet, to avoid background noise during
1826-
amplification). This would roughly provide an audio experience that could
1827-
be seamlessly combined with text-to-speech output (i.e. there would be no
1828-
discernible difference in volume levels when switching from pre-recorded
1829-
audio to speech synthesis). Although there exists no industry-wide
1830-
standard to support such convention, different TTS engines tend to
1831-
generate comparably-loud audio signals when no gain or attenuation is
1832-
specified. For voice and soft music, -15dB RMS seems to be pretty
1833-
standard.
1833+
(e.g. signal compression), we refer to a simple scenario whereby the
1834+
attenuation is indicated in decibels, typically ranging from 0dB (i.e.
1835+
maximum audio input, near clipping threshold) to -60dB (i.e. total
1836+
silence). Given this context, a "standard" audio clip would oscillate
1837+
between these values, the loudest peak levels would be close to -3dB (to
1838+
avoid distortion), and the relevant audible passages would have average
1839+
(RMS) volume levels as high as possible (i.e. not too quiet, to avoid
1840+
background noise during amplification). This would roughly provide an
1841+
audio experience that could be seamlessly combined with text-to-speech
1842+
output (i.e. there would be no discernible difference in volume levels
1843+
when switching from pre-recorded audio to speech synthesis). Although
1844+
there exists no industry-wide standard to support such convention,
1845+
different TTS engines tend to generate comparably-loud audio signals when
1846+
no gain or attenuation is specified. For voice and soft music, -15dB RMS
1847+
seems to be pretty standard.
18341848

18351849
<h3 id=cue-props-cue><span class=secno>11.3. </span>The &lsquo;<a
18361850
href="#cue"><code class=property>cue</code></a>&rsquo; shorthand property</h3>
@@ -2257,7 +2271,10 @@ <h3 id=voice-props-voice-rate><span class=secno>12.2. </span>The &lsquo;<a
22572271
or otherwise to the inherited speaking rate (which may itself be a
22582272
combination of a keyword value and of a percentage, in which case
22592273
percentages are combined multiplicatively). For example, 50% means that
2260-
the speaking rate gets multiplied by 0.5 (half the value).</p>
2274+
the speaking rate gets multiplied by 0.5 (half the value). Percentages
2275+
above 100% result in faster speaking rates (relative to the base
2276+
keyword), whereas percentages below 100% result in slower speaking
2277+
rates.</p>
22612278
</dl>
22622279

22632280
<div class=example>
@@ -2289,7 +2306,7 @@ <h3 id=voice-props-voice-rate><span class=secno>12.2. </span>The &lsquo;<a
22892306
e2 { voice-rate: fast 120%; } /* the computed value is
22902307
['fast' and 120%], which will resolve
22912308
to the rate corresponding to 'fast'
2292-
multiplied by 1.2 (one and a half times the speaking rate) */
2309+
multiplied by 1.2 */
22932310

22942311
e3 { voice-rate: normal; /* "resets" the speaking rate to the intrinsic voice value,
22952312
the computed value is 'normal' (see comment below for actual value) */
@@ -2772,25 +2789,25 @@ <h3 id=voice-props-voice-stress><span class=secno>12.5. </span>The
27722789
<p>Examples of property values, with HTML sample:</p>
27732790

27742791
<pre>
2775-
span.default-emphasis { voice-stress: normal; }
2776-
span.lowered-emphasis { voice-stress: reduced; }
2777-
span.removed-emphasis { voice-stress: none; }
2778-
span.normal-emphasis { voice-stress: moderate; }
2779-
span.huge-emphasis { voice-stress: strong; }
2792+
.default-emphasis { voice-stress: normal; }
2793+
.lowered-emphasis { voice-stress: reduced; }
2794+
.removed-emphasis { voice-stress: none; }
2795+
.normal-emphasis { voice-stress: moderate; }
2796+
.huge-emphasis { voice-stress: strong; }
27802797

27812798
...
27822799

27832800
&lt;p&gt;This is a big car.&lt;/p&gt;
27842801
&lt;!-- The speech output from the line above is identical to the line below: --&gt;
2785-
&lt;p&gt;This is a &lt;span class="default-emphasis"&gt;big&lt;/span&gt; car.&lt;/p&gt;
2802+
&lt;p&gt;This is a &lt;em class="default-emphasis"&gt;big&lt;/em&gt; car.&lt;/p&gt;
27862803

2787-
&lt;p&gt;This car is &lt;span class="lowered-emphasis"&gt;massive&lt;/span&gt;!&lt;/p&gt;
2788-
&lt;!-- The "span" below is totally de-emphasized, whereas the emphasis in the line above is only reduced: --&gt;
2789-
&lt;p&gt;This car is &lt;span class="removed-emphasis"&gt;massive&lt;/span&gt;!&lt;/p&gt;
2804+
&lt;p&gt;This car is &lt;em class="lowered-emphasis"&gt;massive&lt;/em&gt;!&lt;/p&gt;
2805+
&lt;!-- The "em" below is totally de-emphasized, whereas the emphasis in the line above is only reduced: --&gt;
2806+
&lt;p&gt;This car is &lt;em class="removed-emphasis"&gt;massive&lt;/em&gt;!&lt;/p&gt;
27902807

27912808
&lt;!-- The lines below demonstrate increasing levels of emphasis: --&gt;
2792-
&lt;p&gt;This is a &lt;span class="normal-emphasis"&gt;big&lt;/span&gt; car!&lt;/p&gt;
2793-
&lt;p&gt;This is a &lt;span class="huge-emphasis"&gt;big&lt;/span&gt; car!!!&lt;/p&gt;</pre>
2809+
&lt;p&gt;This is a &lt;em class="normal-emphasis"&gt;big&lt;/em&gt; car!&lt;/p&gt;
2810+
&lt;p&gt;This is a &lt;em class="huge-emphasis"&gt;big&lt;/em&gt; car!!!&lt;/p&gt;</pre>
27942811
</div>
27952812

27962813
<h2 id=duration-props><span class=secno>13. </span>Voice duration property</h2>

0 commit comments

Comments
 (0)