@@ -427,7 +427,7 @@ <h2 id=example><span class=secno>3. </span>Example</h2>
427427 < div class =example >
428428 < p > This example shows how authors can tell the speech synthesizer to speak
429429 HTML headings with a voice called "paul", using "moderate" emphasis
430- (which is more than normal) and how to insert an audio cue (prerecorded
430+ (which is more than normal) and how to insert an audio cue (pre-recorded
431431 audio clip located at the given URL) before the start of TTS rendering
432432 for each heading. In a stereo-capable sound system, paragraphs marked
433433 with the CSS class "heidi" are rendered on the left audio channel (and
@@ -554,9 +554,9 @@ <h3 id=mixing-props-voice-volume><span class=secno>5.1. </span>The
554554 </ table >
555555
556556 < p > The ‘< a href ="#voice-volume "> < code
557- class =property > voice-volume</ code > </ a > ’ property manipulates the
558- amplitude of the audio waveform generated by the speech synthesiser, and
559- is also used when calculating the relative volume level of < a
557+ class =property > voice-volume</ code > </ a > ’ property allows authors to
558+ control the amplitude of the audio waveform generated by the speech
559+ synthesiser, and is also used to adjust the relative volume level of < a
560560 href ="#cue-props "> audio cues</ a > within the < a href ="#aural-model "> audio
561561 "box" model</ a > .
562562
@@ -1225,9 +1225,9 @@ <h3 id=pause-props-pause><span class=secno>7.2. </span>The ‘<a
12251225 < td > < em > Value:</ em >
12261226
12271227 < td > <‘< a href ="#pause-before "> < code
1228- class =property > pause-before</ code > </ a > ’> || <‘< a
1228+ class =property > pause-before</ code > </ a > ’> <‘< a
12291229 href ="#pause-after "> < code
1230- class =property > pause-after</ code > </ a > ’>
1230+ class =property > pause-after</ code > </ a > ’>?
12311231
12321232 < tr >
12331233 < td > < em > Initial:</ em >
@@ -1495,9 +1495,9 @@ <h3 id=rest-props-rest><span class=secno>8.2. </span>The ‘<a
14951495 < td > < em > Value:</ em >
14961496
14971497 < td > <‘< a href ="#rest-before "> < code
1498- class =property > rest-before</ code > </ a > ’> || <‘< a
1498+ class =property > rest-before</ code > </ a > ’> <‘< a
14991499 href ="#rest-after "> < code
1500- class =property > rest-after</ code > </ a > ’>
1500+ class =property > rest-after</ code > </ a > ’>?
15011501
15021502 < tr >
15031503 < td > < em > Initial:</ em >
@@ -1639,8 +1639,8 @@ <h3 id=cue-props-cue-before-after><span class=secno>9.1. </span>The
16391639 < p > The ‘< a href ="#cue-before "> < code
16401640 class =property > cue-before</ code > </ a > ’ and ‘< a
16411641 href ="#cue-after "> < code class =property > cue-after</ code > </ a > ’
1642- properties specify auditory icons (i.e. prerecorded audio clips) to be
1643- played before (or after) the selected element within the < a
1642+ properties specify auditory icons (i.e. pre-recorded / pre-generated sound
1643+ clips) to be played before (or after) the selected element within the < a
16441644 href ="#aural-model "> audio "box" model</ a > .
16451645
16461646 < p class =note > Note that the functionality provided by this property is
@@ -1670,15 +1670,61 @@ <h3 id=cue-props-cue-before-after><span class=secno>9.1. </span>The
16701670 (decibel unit). This represents a change (positive or negative) relative
16711671 to the computed value of the ‘< a href ="#voice-volume "> < code
16721672 class =property > voice-volume</ code > </ a > ’ property within the < a
1673- href ="#aural-model "> aural "box" model</ a > of the selected element. When
1674- the ‘< a href ="#voice-volume "> < code
1673+ href ="#aural-model "> aural "box" model</ a > of the selected element.
1674+ Decibels express the ratio of the squares of the new signal amplitude
1675+ (a1) and the current amplitude (a0), as per the following logarithmic
1676+ equation: volume(dB) = 20 log10 (a1 / a0)</ p >
1677+
1678+ < p > When the ‘< a href ="#voice-volume "> < code
16751679 class =property > voice-volume</ code > </ a > ’ property is set to
16761680 ‘< code class =property > silent</ code > ’, the audio cue is also
16771681 set to ‘< code class =property > silent</ code > ’ (regardless of
1678- the value specified for this <decibel>). Decibels express the
1679- ratio of the squares of the new signal amplitude (a1) and the current
1680- amplitude (a0), as per the following logarithmic equation: volume(dB) =
1681- 20 log10 (a1 / a0)</ p >
1682+ this specified <decibel> value). Otherwise (when not ‘< code
1683+ class =property > silent</ code > ’), ‘< a
1684+ href ="#voice-volume "> < code class =property > voice-volume</ code > </ a > ’
1685+ values are always specified relatively to the volume level keywords,
1686+ which map to a user-configured scale of "preferred" loudness settings
1687+ (see the definition of ‘< a href ="#voice-volume "> < code
1688+ class =property > voice-volume</ code > </ a > ’). If the inherited
1689+ ‘< a href ="#voice-volume "> < code
1690+ class =property > voice-volume</ code > </ a > ’ value already contains a
1691+ decibel offset, the dB offset specific to the audio cue is combined
1692+ additively.
1693+
1694+ < p > The desired effect of an audio cue set at +0dB is that the volume
1695+ level during playback of the pre-recorded / pre-generated audio signal
1696+ is effectively the same as the volume level of live (i.e. real-time)
1697+ speech synthesis rendition. In order to achieve this effect, speech
1698+ processors are capable of directly controlling the waveform amplitude of
1699+ generated text-to-speech audio, user agents must be able to adjust the
1700+ volume output of audio cues (i.e. amplify or attenuate audio signals
1701+ based on the intrinsic waveform amplitude of sound clips), and last but
1702+ not least, authors must ensure that the "normal" volume level of
1703+ pre-recorded audio cues (on average, as there may be discrete variations
1704+ due to changes in the audio stream, such as intonation, stress, etc.)
1705+ matches that of a "typical" TTS voice output (based on the ‘< a
1706+ href ="#voice-family "> < code class =property > voice-family</ code > </ a > ’
1707+ intended for use), given standard listening conditions (i.e. default
1708+ system volume levels, centered equalization across the frequency
1709+ spectrum). This latter prerequisite sets a baseline that enables a user
1710+ agent to align the volume outputs of both TTS and cue audio streams
1711+ within the same "aural box model". Due to the complex relationship
1712+ between perceived audio characteristics and the processing applied to
1713+ the digitized audio signal, we will simplify the definition of "normal"
1714+ volume levels by referring to a canonical recording scenario, whereby
1715+ the attenuation is typically indicated in decibels, ranging from 0dB
1716+ (maximum audio input, near clipping threshold) to -60dB (total silence).
1717+ In this common context, a "standard" audio clip would oscillate between
1718+ these values, the loudest peak levels would be close to -3dB (to avoid
1719+ distortion), and the audible passages would have average volume levels
1720+ as high as possible (i.e. not too quiet, to avoid background noise
1721+ during amplification). This would roughly provide an audio experience
1722+ that could be seamlessly combined with text-to-speech output (i.e. there
1723+ would be no discernible difference in volume levels when switching from
1724+ pre-recorded audio to speech synthesis). Although there exists no
1725+ industry-wide standard to backup such convention, TTS engines usually
1726+ generate comparably-loud audio signals when no amplification (or
1727+ attenuation) is specified.</ p >
16821728
16831729 < p class =note > Note that -6.0dB is approximately half the amplitude of
16841730 the audio signal, and +6.0dB is approximately twice the amplitude.</ p >
@@ -1906,15 +1952,16 @@ <h3 id=voice-props-voice-family><span class=secno>10.1. </span>The
19061952 ranges may be used by the processor-dependent voice-matching algorithm).
19071953 </ p >
19081954
1909- < p class =note > The interpretation of the relationship between a person's
1910- age and a recognizable type of voice cannot realistically be defined in
1911- a universal manner, as it effectively depends on numerous cultural and
1912- linguistic variations. The values provided by this specification
1913- therefore represent a simplified model that can be reasonably applied to
1914- a great variety of speech locales, albeit at the cost of a certain
1915- degree of approximation. Future versions of this specification may
1916- refine the level of precision of the voice-matching algorithm, as speech
1917- processor implementations become more standardized.</ p >
1955+ < p class =note > Note that the interpretation of the relationship between a
1956+ person's age and a recognizable type of voice cannot realistically be
1957+ defined in a universal manner, as it effectively depends on numerous
1958+ criteria (cultural, linguistic, biological, etc.). The values provided
1959+ by this specification therefore represent a simplified model that can be
1960+ reasonably applied to a broad variety of speech contexts, albeit at the
1961+ cost of a certain degree of approximation. Future versions of this
1962+ specification may refine the level of precision of the voice-matching
1963+ algorithm, as speech processor implementations become more standardized.
1964+ </ p >
19181965
19191966 < dt > < strong > <gender></ strong >
19201967
@@ -2218,10 +2265,11 @@ <h3 id=voice-props-voice-pitch><span class=secno>10.3. </span>The ‘<a
22182265 < tr >
22192266 < td > < em > Computed value:</ em >
22202267
2221- < td > one of the predefined keywords if only the keyword is specified by
2222- itself, otherwise a fixed frequency calculated by converting the
2223- keyword value (if any) to an absolute value based on the current
2224- voice-family and by applying the specified relative offset (if any)
2268+ < td > one of the predefined pitch keywords if only the keyword is
2269+ specified by itself, otherwise an absolute frequency calculated by
2270+ converting the keyword value (if any) to a fixed frequency based on the
2271+ current voice-family and by applying the specified relative offset (if
2272+ any)
22252273 </ table >
22262274
22272275 < p > The ‘< a href ="#voice-pitch "> < code
@@ -2306,14 +2354,14 @@ <h3 id=voice-props-voice-pitch><span class=secno>10.3. </span>The ‘<a
23062354 the conversion from a keyword to a concrete, voice-dependent frequency.</ p >
23072355 </ dl >
23082356
2309- < p > Computed absolute frequency values that are negative are clamped to
2310- zero Hertz. Speech-capable user agents are likely to support a specific
2311- range of values rather than the full range of possible calculated
2312- numerical values for frequencies. The actual values in user agents may
2313- therefore be clamped to implementation-dependent minimum and maximum
2314- boundaries. For example: although the 0Hz frequency can be legitimately
2315- calculated, it may be clamped to a more meaningful value in the context of
2316- the speech synthesizer.
2357+ < p > Computed absolute frequencies that are negative are clamped to zero
2358+ Hertz. Speech-capable user agents are likely to support a specific range
2359+ of values rather than the full range of possible calculated numerical
2360+ values for frequencies. The actual values in user agents may therefore be
2361+ clamped to implementation-dependent minimum and maximum boundaries. For
2362+ example: although the 0Hz frequency can be legitimately calculated, it may
2363+ be clamped to a more meaningful value in the context of the speech
2364+ synthesizer.
23172365
23182366 < div class =example >
23192367 < p > Examples of property values:</ p >
@@ -2377,10 +2425,11 @@ <h3 id=voice-props-voice-range><span class=secno>10.4. </span>The ‘<a
23772425 < tr >
23782426 < td > < em > Computed value:</ em >
23792427
2380- < td > one of the predefined keywords if only the keyword is specified by
2381- itself, otherwise a fixed frequency calculated by converting the
2382- keyword value (if any) to an absolute value based on the current
2383- voice-family and by applying the specified relative offset (if any)
2428+ < td > one of the predefined pitch keywords if only the keyword is
2429+ specified by itself, otherwise an absolute frequency calculated by
2430+ converting the keyword value (if any) to a fixed frequency based on the
2431+ current voice-family and by applying the specified relative offset (if
2432+ any)
23842433 </ table >
23852434
23862435 < p > The ‘< a href ="#voice-range "> < code
@@ -2465,14 +2514,14 @@ <h3 id=voice-props-voice-range><span class=secno>10.4. </span>The ‘<a
24652514 the conversion from a keyword to a concrete, voice-dependent frequency.</ p >
24662515 </ dl >
24672516
2468- < p > Computed absolute frequency values that are negative are clamped to
2469- zero Hertz. Speech-capable user agents are likely to support a specific
2470- range of values rather than the full range of possible calculated
2471- numerical values for frequencies. The actual values in user agents may
2472- therefore be clamped to implementation-dependent minimum and maximum
2473- boundaries. For example: although the 0Hz frequency can be legitimately
2474- calculated, it may be clamped to a more meaningful value in the context of
2475- the speech synthesizer.
2517+ < p > Computed absolute frequencies that are negative are clamped to zero
2518+ Hertz. Speech-capable user agents are likely to support a specific range
2519+ of values rather than the full range of possible calculated numerical
2520+ values for frequencies. The actual values in user agents may therefore be
2521+ clamped to implementation-dependent minimum and maximum boundaries. For
2522+ example: although the 0Hz frequency can be legitimately calculated, it may
2523+ be clamped to a more meaningful value in the context of the speech
2524+ synthesizer.
24762525
24772526 < div class =example >
24782527 < p > Examples of inherited values:</ p >
@@ -3000,8 +3049,8 @@ <h2 class=no-num id=property-index>Appendix A — Property index</h2>
30003049 < tr >
30013050 < th > < a class =property href ="#pause "> pause</ a >
30023051
3003- < td > <‘pause-before’> ||
3004- <‘pause-after’>
3052+ < td > <‘pause-before’>
3053+ <‘pause-after’>?
30053054
30063055 < td > N/A (see individual properties)
30073056
@@ -3046,8 +3095,7 @@ <h2 class=no-num id=property-index>Appendix A — Property index</h2>
30463095 < tr >
30473096 < th > < a class =property href ="#rest "> rest</ a >
30483097
3049- < td > <‘rest-before’> ||
3050- <‘rest-after’>
3098+ < td > <‘rest-before’> <‘rest-after’>?
30513099
30523100 < td > N/A (see individual properties)
30533101
0 commit comments