8888
8989 < h1 id =top > CSS Speech Module</ h1 >
9090
91- < h2 class ="no-num no-toc " id =longstatus-date > Editor's Draft 20 February
91+ < h2 class ="no-num no-toc " id =longstatus-date > Editor's Draft 21 February
9292 2012</ h2 >
9393
9494 < dl id =versions >
9595 < dt > This version:
9696
9797 < dd >
98- <!--<a href="http://www.w3.org/TR/2012/WD-css3-speech-20120220 /">http://www.w3.org/TR/2012/ED-css3-speech-20120220 /</a>-->
98+ <!--<a href="http://www.w3.org/TR/2012/WD-css3-speech-20120221 /">http://www.w3.org/TR/2012/ED-css3-speech-20120221 /</a>-->
9999 < a
100100 href ="http://dev.w3.org/csswg/css3-speech "> http://dev.w3.org/csswg/css3-speech</ a >
101101
@@ -535,8 +535,11 @@ <h2 id=aural-model><span class=secno>6. </span>The aural formatting model</h2>
535535 < p > The following diagram illustrates the equivalence between properties of
536536 the visual and aural box models, applied to the selected <element>:
537537
538- < p > < img alt ="A graph depicting the aural 'box' model. " id =aural-box
539- src =aural-box.png >
538+ < p > < img
539+ alt ="The aural 'box' model, illustrated by a diagram: the selected element is positioned in the center, on its left side are (from innermost to outermost) rest-before, cue-before, pause-before, on its right side are (from innermost to outermost) rest-after, cue-after, pause-after, where rest is conceptually similar to padding, cue is similar to border, pause is similar to margin. "
540+ id =aural-box src =aural-box.png
541+ title ="The aural 'box' model, illustrated by a diagram: the selected element is positioned in the center, on its left side are (from innermost to outermost) rest-before, cue-before, pause-before, on its right side are (from innermost to outermost) rest-after, cue-after, pause-after, where rest is conceptually similar to padding, cue is similar to border, pause is similar to margin. ">
542+
540543
541544 < h2 id =mixing-props > < span class =secno > 7. </ span > Mixing properties</ h2 >
542545
@@ -585,15 +588,16 @@ <h3 id=mixing-props-voice-volume><span class=secno>7.1. </span>The
585588 < tr >
586589 < td > < em > Computed value:</ em >
587590
588- < td > a keyword value, and optionally also a decibel offset (if not zero)
591+ < td > ‘< code class =property > silent</ code > ’, or a keyword value
592+ and optionally also a decibel offset (if not zero)
589593 </ table >
590594
591595 < p > The ‘< a href ="#voice-volume "> < code
592596 class =property > voice-volume</ code > </ a > ’ property allows authors to
593597 control the amplitude of the audio waveform generated by the speech
594598 synthesiser, and is also used to adjust the relative volume level of < a
595- href ="#cue-props "> audio cues</ a > within the < a href ="#aural-model "> audio
596- box model</ a > .
599+ href ="#cue-props "> audio cues</ a > within the < a href ="#aural-model "> aural
600+ box model</ a > of the selected element .
597601
598602 < p class =note > Note that although the functionality provided by this
599603 property is similar to the < a
@@ -615,36 +619,40 @@ <h3 id=mixing-props-voice-volume><span class=secno>7.1. </span>The
615619 < dt > < strong > silent</ strong >
616620
617621 < dd >
618- < p > Specifies that no sound is generated (the text is read "silently").
619- Corresponds to negative infinity in dB units.</ p >
622+ < p > Specifies that no sound is generated (the text is read "silently").</ p >
620623
621- < p class =note > Note that there is a difference between an element whose
622- ‘< a href ="#voice-volume "> < code
624+ < p class =note > Note that this has the same effect as using negative
625+ infinity decibels. Also note that there is a difference between an
626+ element whose ‘< a href ="#voice-volume "> < code
623627 class =property > voice-volume</ code > </ a > ’ property has a value of
624628 ‘< code class =property > silent</ code > ’, and an element whose
625629 ‘< a href ="#speak "> < code class =property > speak</ code > </ a > ’
626630 property has the value ‘< code class =property > none</ code > ’.
627631 With the former, the selected element takes up the same time as if it
628632 was spoken, including any pause before and after the element, but no
629- sound is generated (descendants can override the ‘< a
633+ sound is generated (descendants within the < a href ="#aural-model "> aural
634+ box model</ a > of the selected element can override the ‘< a
630635 href ="#voice-volume "> < code class =property > voice-volume</ code > </ a > ’
631- value and may therefore generate audio output). With the latter, the
636+ value, and may therefore generate audio output). With the latter, the
632637 selected element is not rendered in the aural dimension and no time is
633- allocated for playback (descendants can override the ‘< a
634- href ="#speak "> < code class =property > speak</ code > </ a > ’ value and may
635- therefore generate audio output).</ p >
638+ allocated for playback (descendants within the < a
639+ href ="#aural-model "> aural box model</ a > of the selected element can
640+ override the ‘< a href ="#speak "> < code
641+ class =property > speak</ code > </ a > ’ value, and may therefore generate
642+ audio output).</ p >
636643
637644 < dt > < strong > x-soft</ strong > , < strong > soft</ strong > ,
638645 < strong > medium</ strong > , < strong > loud</ strong > , < strong > x-loud</ strong >
639646
640647 < dd >
641648 < p > This sequence of keywords corresponds to monotonically non-decreasing
642649 volume levels, mapped to implementation-dependent values that meet the
643- listener's requirements with regards to perceived sound loudness. These
644- audio levels are typically provided via a preference mechanism that
645- allow users to set options according to their auditory environment. The
646- keyword ‘< code class =property > x-soft</ code > ’ maps to the
647- user's < em > minimum audible</ em > volume level, ‘< code
650+ listener's requirements with regards to perceived loudness. These audio
651+ levels are typically provided via a preference mechanism that allow
652+ users to calibrate sound options according to their auditory
653+ environment. The keyword ‘< code
654+ class =property > x-soft</ code > ’ maps to the user's < em > minimum
655+ audible</ em > volume level, ‘< code
648656 class =property > x-loud</ code > ’ maps to the user's < em > maximum
649657 tolerable</ em > volume level, ‘< code
650658 class =property > medium</ code > ’ maps to the user's
@@ -674,12 +682,12 @@ <h3 id=mixing-props-voice-volume><span class=secno>7.1. </span>The
674682 the audio signal, and +6.0dB is approximately twice the amplitude.</ p >
675683 </ dl >
676684
677- < p class =note > Note that the actual perceived volume levels depend on
678- various factors, such as the listening environment and personal user
679- preferences. The effective volume variation between ‘< code
685+ < p class =note > Note that perceived loudness depends on various factors, such
686+ as the listening environment, user preferences or physical abilities. The
687+ effective volume variation between ‘< code
680688 class =property > x-soft</ code > ’ and ‘< code
681689 class =property > x-loud</ code > ’ represents the dynamic range (in terms
682- of loudness) of the speech output. Typically, this range would be
690+ of loudness) of the audio output. Typically, this range would be
683691 compressed in a noisy context, i.e. the perceived loudness corresponding
684692 to ‘< code class =property > x-soft</ code > ’ would effectively be
685693 closer to ‘< code class =property > x-loud</ code > ’ than it would
@@ -1485,7 +1493,7 @@ <h3 id=rest-props-rest-before-after><span class=secno>10.1. </span>The
14851493 href ="#rest-after "> < code class =property > rest-after</ code > </ a > ’
14861494 properties specify a prosodic boundary (silence with a specific duration)
14871495 that occurs before (or after) the speech synthesis rendition of an element
1488- within the < a href ="#aural-model "> audio box model</ a > .
1496+ within the < a href ="#aural-model "> aural box model</ a > .
14891497
14901498 < p class =note > Note that although the functionality provided by this
14911499 property is similar to the < a
@@ -1690,7 +1698,7 @@ <h3 id=cue-props-cue-before-after><span class=secno>11.1. </span>The
16901698 href ="#cue-after "> < code class =property > cue-after</ code > </ a > ’
16911699 properties specify auditory icons (i.e. pre-recorded / pre-generated sound
16921700 clips) to be played before (or after) the selected element within the < a
1693- href ="#aural-model "> audio box model</ a > .
1701+ href ="#aural-model "> aural box model</ a > .
16941702
16951703 < p class =note > Note that although the functionality provided by this
16961704 property may appear related to the < a
@@ -1724,21 +1732,21 @@ <h3 id=cue-props-cue-before-after><span class=secno>11.1. </span>The
17241732 to the computed value of the ‘< a href ="#voice-volume "> < code
17251733 class =property > voice-volume</ code > </ a > ’ property within the < a
17261734 href ="#aural-model "> aural box model</ a > of the selected element (as a
1727- result, the volume level of audio cues changes when the ‘< a
1735+ result, the volume level of an audio cue changes when the ‘< a
17281736 href ="#voice-volume "> < code class =property > voice-volume</ code > </ a > ’
17291737 property changes). When omitted, the implied value computes to 0dB.</ p >
17301738
1731- < p > When the ‘< a href ="#voice-volume "> < code
1732- class =property > voice-volume</ code > </ a > ’ property is set to
1733- ‘ < code class =property > silent</ code > ’, the audio cue is also
1734- set to ‘< code class =property > silent</ code > ’ (regardless of
1735- this specified <decibel> value). Otherwise (when not ‘< code
1739+ < p > When the computed value of the ‘< a href ="#voice-volume "> < code
1740+ class =property > voice-volume</ code > </ a > ’ property is ‘ < code
1741+ class =property > silent</ code > ’, the audio cue is also set to
1742+ ‘< code class =property > silent</ code > ’ (regardless of this
1743+ specified <decibel> value). Otherwise (when not ‘< code
17361744 class =property > silent</ code > ’), ‘< a
17371745 href ="#voice-volume "> < code class =property > voice-volume</ code > </ a > ’
17381746 values are always specified relatively to the volume level keywords (see
17391747 the definition of ‘< a href ="#voice-volume "> < code
17401748 class =property > voice-volume</ code > </ a > ’), which map to a
1741- user-configured scale of "preferred" loudness settings. If the inherited
1749+ user-calibrated scale of "preferred" loudness settings. If the inherited
17421750 ‘< a href ="#voice-volume "> < code
17431751 class =property > voice-volume</ code > </ a > ’ value already contains a
17441752 decibel offset, the dB offset specific to the audio cue is combined
@@ -1789,48 +1797,54 @@ <h3 id=cue-props-volume><span class=secno>11.2. </span>Relation between
17891797 For example, the desired effect of an audio cue whose volume level is set
17901798 at +0dB (as specified by the <decibel> value) is that its perceived
17911799 loudness during playback is close to that of the speech synthesis
1792- rendition of the selected element, as dictated by computed value of the
1800+ rendition of the selected element, as dictated by the computed value of
1801+ the ‘< a href ="#voice-volume "> < code
1802+ class =property > voice-volume</ code > </ a > ’ property. Note that a
1803+ ‘< code class =property > silent</ code > ’ computed value for the
17931804 ‘< a href ="#voice-volume "> < code
1794- class =property > voice-volume</ code > </ a > ’ property (which is itself
1795- based on a user-configured volume level keyword). Similarly, a
1796- ‘< code class =property > silent</ code > ’ value for the ‘< a
1797- href ="#voice-volume "> < code class =property > voice-volume</ code > </ a > ’
1798- property results on any audio cues being "silenced" as well.
1799-
1800- < p > In order to achieve this effect, authors should ensure that the volume
1801- level of audio cues (on average, as there may be discrete loudness
1802- variations due to changes in the audio stream, such as intonation, stress,
1803- etc.) matches that of a "typical" TTS voice output (based on the ‘< a
1805+ class =property > voice-volume</ code > </ a > ’ property results in audio
1806+ cues being "forcefully" silenced as well (i.e. regardless of the specified
1807+ audio cue ‘< code class =property > decibel</ code > ’ value)
1808+
1809+ < p > The volume keywords of the ‘< a href ="#voice-volume "> < code
1810+ class =property > voice-volume</ code > </ a > ’ property are user-calibrated
1811+ to match requirements not known at authoring time (e.g. auditory
1812+ environment, personal preferences). Therefore, in order to achieve this
1813+ approximate loudness alignment of audio cues and speech synthesis, authors
1814+ should ensure that the volume level of audio cues (on average, as there
1815+ may be discrete variations of perceived loudness due to changes in the
1816+ audio stream, such as intonation, stress, etc.) matches the output of a
1817+ speech synthesis rendition based on the ‘< a
18041818 href ="#voice-family "> < code class =property > voice-family</ code > </ a > ’
1805- intended for use) , given "standard " listening conditions (i.e. default
1819+ intended for use, given "typical " listening conditions (i.e. default
18061820 system volume levels, centered equalization across the frequency
18071821 spectrum). As speech processors are capable of directly controlling the
18081822 waveform amplitude of generated text-to-speech audio, and because user
18091823 agents are able to adjust the volume output of audio cues (i.e. amplify or
18101824 attenuate audio signals based on the intrinsic waveform amplitude of
18111825 digitized sound clips), this sets a baseline that enables implementations
1812- to "align" the loudness of both TTS and cue audio streams within the aural
1813- box model, relative to user-configured volume levels (see the keywords
1826+ to manage the loudness of both TTS and cue audio streams within the aural
1827+ box model, relative to user-calibrated volume levels (see the keywords
18141828 defined in the ‘< a href ="#voice-volume "> < code
18151829 class =property > voice-volume</ code > </ a > ’ property).
18161830
18171831 < p > Due to the complex relationship between perceived audio characteristics
18181832 (e.g. loudness) and the processing applied to the digitized audio signal
1819- (e.g. " compression" ), we refer to a simple scenario whereby the
1820- attenuation is indicated in decibels, typically ranging from 0dB (maximum
1821- audio input, near clipping threshold) to -60dB (total silence). Given this
1822- context, a "standard" audio clip would oscillate between these values, the
1823- loudest peak levels would be close to -3dB (to avoid distortion), and the
1824- relevant audible passages would have average (RMS) volume levels as high
1825- as possible (i.e. not too quiet, to avoid background noise during
1826- amplification). This would roughly provide an audio experience that could
1827- be seamlessly combined with text-to-speech output (i.e. there would be no
1828- discernible difference in volume levels when switching from pre-recorded
1829- audio to speech synthesis). Although there exists no industry-wide
1830- standard to support such convention, different TTS engines tend to
1831- generate comparably-loud audio signals when no gain or attenuation is
1832- specified. For voice and soft music, -15dB RMS seems to be pretty
1833- standard.
1833+ (e.g. signal compression), we refer to a simple scenario whereby the
1834+ attenuation is indicated in decibels, typically ranging from 0dB (i.e.
1835+ maximum audio input, near clipping threshold) to -60dB (i.e. total
1836+ silence). Given this context, a "standard" audio clip would oscillate
1837+ between these values, the loudest peak levels would be close to -3dB (to
1838+ avoid distortion), and the relevant audible passages would have average
1839+ (RMS) volume levels as high as possible (i.e. not too quiet, to avoid
1840+ background noise during amplification). This would roughly provide an
1841+ audio experience that could be seamlessly combined with text-to-speech
1842+ output (i.e. there would be no discernible difference in volume levels
1843+ when switching from pre-recorded audio to speech synthesis). Although
1844+ there exists no industry-wide standard to support such convention,
1845+ different TTS engines tend to generate comparably-loud audio signals when
1846+ no gain or attenuation is specified. For voice and soft music, -15dB RMS
1847+ seems to be pretty standard.
18341848
18351849 < h3 id =cue-props-cue > < span class =secno > 11.3. </ span > The ‘< a
18361850 href ="#cue "> < code class =property > cue</ code > </ a > ’ shorthand property</ h3 >
@@ -2257,7 +2271,10 @@ <h3 id=voice-props-voice-rate><span class=secno>12.2. </span>The ‘<a
22572271 or otherwise to the inherited speaking rate (which may itself be a
22582272 combination of a keyword value and of a percentage, in which case
22592273 percentages are combined multiplicatively). For example, 50% means that
2260- the speaking rate gets multiplied by 0.5 (half the value).</ p >
2274+ the speaking rate gets multiplied by 0.5 (half the value). Percentages
2275+ above 100% result in faster speaking rates (relative to the base
2276+ keyword), whereas percentages below 100% result in slower speaking
2277+ rates.</ p >
22612278 </ dl >
22622279
22632280 < div class =example >
@@ -2289,7 +2306,7 @@ <h3 id=voice-props-voice-rate><span class=secno>12.2. </span>The ‘<a
22892306e2 { voice-rate: fast 120%; } /* the computed value is
22902307 ['fast' and 120%], which will resolve
22912308 to the rate corresponding to 'fast'
2292- multiplied by 1.2 (one and a half times the speaking rate) */
2309+ multiplied by 1.2 */
22932310
22942311e3 { voice-rate: normal; /* "resets" the speaking rate to the intrinsic voice value,
22952312 the computed value is 'normal' (see comment below for actual value) */
@@ -2772,25 +2789,25 @@ <h3 id=voice-props-voice-stress><span class=secno>12.5. </span>The
27722789 < p > Examples of property values, with HTML sample:</ p >
27732790
27742791 < pre >
2775- span .default-emphasis { voice-stress: normal; }
2776- span .lowered-emphasis { voice-stress: reduced; }
2777- span .removed-emphasis { voice-stress: none; }
2778- span .normal-emphasis { voice-stress: moderate; }
2779- span .huge-emphasis { voice-stress: strong; }
2792+ .default-emphasis { voice-stress: normal; }
2793+ .lowered-emphasis { voice-stress: reduced; }
2794+ .removed-emphasis { voice-stress: none; }
2795+ .normal-emphasis { voice-stress: moderate; }
2796+ .huge-emphasis { voice-stress: strong; }
27802797
27812798...
27822799
27832800<p>This is a big car.</p>
27842801<!-- The speech output from the line above is identical to the line below: -->
2785- <p>This is a <span class="default-emphasis">big</span > car.</p>
2802+ <p>This is a <em class="default-emphasis">big</em > car.</p>
27862803
2787- <p>This car is <span class="lowered-emphasis">massive</span >!</p>
2788- <!-- The "span " below is totally de-emphasized, whereas the emphasis in the line above is only reduced: -->
2789- <p>This car is <span class="removed-emphasis">massive</span >!</p>
2804+ <p>This car is <em class="lowered-emphasis">massive</em >!</p>
2805+ <!-- The "em " below is totally de-emphasized, whereas the emphasis in the line above is only reduced: -->
2806+ <p>This car is <em class="removed-emphasis">massive</em >!</p>
27902807
27912808<!-- The lines below demonstrate increasing levels of emphasis: -->
2792- <p>This is a <span class="normal-emphasis">big</span > car!</p>
2793- <p>This is a <span class="huge-emphasis">big</span > car!!!</p></ pre >
2809+ <p>This is a <em class="normal-emphasis">big</em > car!</p>
2810+ <p>This is a <em class="huge-emphasis">big</em > car!!!</p></ pre >
27942811 </ div >
27952812
27962813 < h2 id =duration-props > < span class =secno > 13. </ span > Voice duration property</ h2 >
0 commit comments