Skip to content

Commit 430b02a

Browse files
committed
attempt to normatively clarify the audio cue / TTS volume level normalization.
1 parent b7ac1c8 commit 430b02a

2 files changed

Lines changed: 172 additions & 94 deletions

File tree

css3-speech/Overview.html

Lines changed: 101 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -427,7 +427,7 @@ <h2 id=example><span class=secno>3. </span>Example</h2>
427427
<div class=example>
428428
<p>This example shows how authors can tell the speech synthesizer to speak
429429
HTML headings with a voice called "paul", using "moderate" emphasis
430-
(which is more than normal) and how to insert an audio cue (prerecorded
430+
(which is more than normal) and how to insert an audio cue (pre-recorded
431431
audio clip located at the given URL) before the start of TTS rendering
432432
for each heading. In a stereo-capable sound system, paragraphs marked
433433
with the CSS class "heidi" are rendered on the left audio channel (and
@@ -554,9 +554,9 @@ <h3 id=mixing-props-voice-volume><span class=secno>5.1. </span>The
554554
</table>
555555

556556
<p>The &lsquo;<a href="#voice-volume"><code
557-
class=property>voice-volume</code></a>&rsquo; property manipulates the
558-
amplitude of the audio waveform generated by the speech synthesiser, and
559-
is also used when calculating the relative volume level of <a
557+
class=property>voice-volume</code></a>&rsquo; property allows authors to
558+
control the amplitude of the audio waveform generated by the speech
559+
synthesiser, and is also used to adjust the relative volume level of <a
560560
href="#cue-props">audio cues</a> within the <a href="#aural-model">audio
561561
"box" model</a>.
562562

@@ -1225,9 +1225,9 @@ <h3 id=pause-props-pause><span class=secno>7.2. </span>The &lsquo;<a
12251225
<td> <em>Value:</em>
12261226

12271227
<td>&lt;&lsquo;<a href="#pause-before"><code
1228-
class=property>pause-before</code></a>&rsquo;&gt; || &lt;&lsquo;<a
1228+
class=property>pause-before</code></a>&rsquo;&gt; &lt;&lsquo;<a
12291229
href="#pause-after"><code
1230-
class=property>pause-after</code></a>&rsquo;&gt;
1230+
class=property>pause-after</code></a>&rsquo;&gt;?
12311231

12321232
<tr>
12331233
<td> <em>Initial:</em>
@@ -1495,9 +1495,9 @@ <h3 id=rest-props-rest><span class=secno>8.2. </span>The &lsquo;<a
14951495
<td> <em>Value:</em>
14961496

14971497
<td>&lt;&lsquo;<a href="#rest-before"><code
1498-
class=property>rest-before</code></a>&rsquo;&gt; || &lt;&lsquo;<a
1498+
class=property>rest-before</code></a>&rsquo;&gt; &lt;&lsquo;<a
14991499
href="#rest-after"><code
1500-
class=property>rest-after</code></a>&rsquo;&gt;
1500+
class=property>rest-after</code></a>&rsquo;&gt;?
15011501

15021502
<tr>
15031503
<td> <em>Initial:</em>
@@ -1639,8 +1639,8 @@ <h3 id=cue-props-cue-before-after><span class=secno>9.1. </span>The
16391639
<p>The &lsquo;<a href="#cue-before"><code
16401640
class=property>cue-before</code></a>&rsquo; and &lsquo;<a
16411641
href="#cue-after"><code class=property>cue-after</code></a>&rsquo;
1642-
properties specify auditory icons (i.e. prerecorded audio clips) to be
1643-
played before (or after) the selected element within the <a
1642+
properties specify auditory icons (i.e. pre-recorded / pre-generated sound
1643+
clips) to be played before (or after) the selected element within the <a
16441644
href="#aural-model">audio "box" model</a>.
16451645

16461646
<p class=note> Note that the functionality provided by this property is
@@ -1670,15 +1670,61 @@ <h3 id=cue-props-cue-before-after><span class=secno>9.1. </span>The
16701670
(decibel unit). This represents a change (positive or negative) relative
16711671
to the computed value of the &lsquo;<a href="#voice-volume"><code
16721672
class=property>voice-volume</code></a>&rsquo; property within the <a
1673-
href="#aural-model">aural "box" model</a> of the selected element. When
1674-
the &lsquo;<a href="#voice-volume"><code
1673+
href="#aural-model">aural "box" model</a> of the selected element.
1674+
Decibels express the ratio of the squares of the new signal amplitude
1675+
(a1) and the current amplitude (a0), as per the following logarithmic
1676+
equation: volume(dB) = 20 log10 (a1 / a0)</p>
1677+
1678+
<p> When the &lsquo;<a href="#voice-volume"><code
16751679
class=property>voice-volume</code></a>&rsquo; property is set to
16761680
&lsquo;<code class=property>silent</code>&rsquo;, the audio cue is also
16771681
set to &lsquo;<code class=property>silent</code>&rsquo; (regardless of
1678-
the value specified for this &lt;decibel&gt;). Decibels express the
1679-
ratio of the squares of the new signal amplitude (a1) and the current
1680-
amplitude (a0), as per the following logarithmic equation: volume(dB) =
1681-
20 log10 (a1 / a0)</p>
1682+
this specified &lt;decibel&gt; value). Otherwise (when not &lsquo;<code
1683+
class=property>silent</code>&rsquo;), &lsquo;<a
1684+
href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
1685+
values are always specified relatively to the volume level keywords,
1686+
which map to a user-configured scale of "preferred" loudness settings
1687+
(see the definition of &lsquo;<a href="#voice-volume"><code
1688+
class=property>voice-volume</code></a>&rsquo;). If the inherited
1689+
&lsquo;<a href="#voice-volume"><code
1690+
class=property>voice-volume</code></a>&rsquo; value already contains a
1691+
decibel offset, the dB offset specific to the audio cue is combined
1692+
additively.
1693+
1694+
<p> The desired effect of an audio cue set at +0dB is that the volume
1695+
level during playback of the pre-recorded / pre-generated audio signal
1696+
is effectively the same as the volume level of live (i.e. real-time)
1697+
speech synthesis rendition. In order to achieve this effect, speech
1698+
processors are capable of directly controlling the waveform amplitude of
1699+
generated text-to-speech audio, user agents must be able to adjust the
1700+
volume output of audio cues (i.e. amplify or attenuate audio signals
1701+
based on the intrinsic waveform amplitude of sound clips), and last but
1702+
not least, authors must ensure that the "normal" volume level of
1703+
pre-recorded audio cues (on average, as there may be discrete variations
1704+
due to changes in the audio stream, such as intonation, stress, etc.)
1705+
matches that of a "typical" TTS voice output (based on the &lsquo;<a
1706+
href="#voice-family"><code class=property>voice-family</code></a>&rsquo;
1707+
intended for use), given standard listening conditions (i.e. default
1708+
system volume levels, centered equalization across the frequency
1709+
spectrum). This latter prerequisite sets a baseline that enables a user
1710+
agent to align the volume outputs of both TTS and cue audio streams
1711+
within the same "aural box model". Due to the complex relationship
1712+
between perceived audio characteristics and the processing applied to
1713+
the digitized audio signal, we will simplify the definition of "normal"
1714+
volume levels by referring to a canonical recording scenario, whereby
1715+
the attenuation is typically indicated in decibels, ranging from 0dB
1716+
(maximum audio input, near clipping threshold) to -60dB (total silence).
1717+
In this common context, a "standard" audio clip would oscillate between
1718+
these values, the loudest peak levels would be close to -3dB (to avoid
1719+
distortion), and the audible passages would have average volume levels
1720+
as high as possible (i.e. not too quiet, to avoid background noise
1721+
during amplification). This would roughly provide an audio experience
1722+
that could be seamlessly combined with text-to-speech output (i.e. there
1723+
would be no discernible difference in volume levels when switching from
1724+
pre-recorded audio to speech synthesis). Although there exists no
1725+
industry-wide standard to backup such convention, TTS engines usually
1726+
generate comparably-loud audio signals when no amplification (or
1727+
attenuation) is specified.</p>
16821728

16831729
<p class=note> Note that -6.0dB is approximately half the amplitude of
16841730
the audio signal, and +6.0dB is approximately twice the amplitude.</p>
@@ -1906,15 +1952,16 @@ <h3 id=voice-props-voice-family><span class=secno>10.1. </span>The
19061952
ranges may be used by the processor-dependent voice-matching algorithm).
19071953
</p>
19081954

1909-
<p class=note> The interpretation of the relationship between a person's
1910-
age and a recognizable type of voice cannot realistically be defined in
1911-
a universal manner, as it effectively depends on numerous cultural and
1912-
linguistic variations. The values provided by this specification
1913-
therefore represent a simplified model that can be reasonably applied to
1914-
a great variety of speech locales, albeit at the cost of a certain
1915-
degree of approximation. Future versions of this specification may
1916-
refine the level of precision of the voice-matching algorithm, as speech
1917-
processor implementations become more standardized.</p>
1955+
<p class=note> Note that the interpretation of the relationship between a
1956+
person's age and a recognizable type of voice cannot realistically be
1957+
defined in a universal manner, as it effectively depends on numerous
1958+
criteria (cultural, linguistic, biological, etc.). The values provided
1959+
by this specification therefore represent a simplified model that can be
1960+
reasonably applied to a broad variety of speech contexts, albeit at the
1961+
cost of a certain degree of approximation. Future versions of this
1962+
specification may refine the level of precision of the voice-matching
1963+
algorithm, as speech processor implementations become more standardized.
1964+
</p>
19181965

19191966
<dt> <strong>&lt;gender&gt;</strong>
19201967

@@ -2218,10 +2265,11 @@ <h3 id=voice-props-voice-pitch><span class=secno>10.3. </span>The &lsquo;<a
22182265
<tr>
22192266
<td> <em>Computed value:</em>
22202267

2221-
<td> one of the predefined keywords if only the keyword is specified by
2222-
itself, otherwise a fixed frequency calculated by converting the
2223-
keyword value (if any) to an absolute value based on the current
2224-
voice-family and by applying the specified relative offset (if any)
2268+
<td> one of the predefined pitch keywords if only the keyword is
2269+
specified by itself, otherwise an absolute frequency calculated by
2270+
converting the keyword value (if any) to a fixed frequency based on the
2271+
current voice-family and by applying the specified relative offset (if
2272+
any)
22252273
</table>
22262274

22272275
<p>The &lsquo;<a href="#voice-pitch"><code
@@ -2306,14 +2354,14 @@ <h3 id=voice-props-voice-pitch><span class=secno>10.3. </span>The &lsquo;<a
23062354
the conversion from a keyword to a concrete, voice-dependent frequency.</p>
23072355
</dl>
23082356

2309-
<p> Computed absolute frequency values that are negative are clamped to
2310-
zero Hertz. Speech-capable user agents are likely to support a specific
2311-
range of values rather than the full range of possible calculated
2312-
numerical values for frequencies. The actual values in user agents may
2313-
therefore be clamped to implementation-dependent minimum and maximum
2314-
boundaries. For example: although the 0Hz frequency can be legitimately
2315-
calculated, it may be clamped to a more meaningful value in the context of
2316-
the speech synthesizer.
2357+
<p> Computed absolute frequencies that are negative are clamped to zero
2358+
Hertz. Speech-capable user agents are likely to support a specific range
2359+
of values rather than the full range of possible calculated numerical
2360+
values for frequencies. The actual values in user agents may therefore be
2361+
clamped to implementation-dependent minimum and maximum boundaries. For
2362+
example: although the 0Hz frequency can be legitimately calculated, it may
2363+
be clamped to a more meaningful value in the context of the speech
2364+
synthesizer.
23172365

23182366
<div class=example>
23192367
<p>Examples of property values:</p>
@@ -2377,10 +2425,11 @@ <h3 id=voice-props-voice-range><span class=secno>10.4. </span>The &lsquo;<a
23772425
<tr>
23782426
<td> <em>Computed value:</em>
23792427

2380-
<td> one of the predefined keywords if only the keyword is specified by
2381-
itself, otherwise a fixed frequency calculated by converting the
2382-
keyword value (if any) to an absolute value based on the current
2383-
voice-family and by applying the specified relative offset (if any)
2428+
<td> one of the predefined pitch keywords if only the keyword is
2429+
specified by itself, otherwise an absolute frequency calculated by
2430+
converting the keyword value (if any) to a fixed frequency based on the
2431+
current voice-family and by applying the specified relative offset (if
2432+
any)
23842433
</table>
23852434

23862435
<p> The &lsquo;<a href="#voice-range"><code
@@ -2465,14 +2514,14 @@ <h3 id=voice-props-voice-range><span class=secno>10.4. </span>The &lsquo;<a
24652514
the conversion from a keyword to a concrete, voice-dependent frequency.</p>
24662515
</dl>
24672516

2468-
<p> Computed absolute frequency values that are negative are clamped to
2469-
zero Hertz. Speech-capable user agents are likely to support a specific
2470-
range of values rather than the full range of possible calculated
2471-
numerical values for frequencies. The actual values in user agents may
2472-
therefore be clamped to implementation-dependent minimum and maximum
2473-
boundaries. For example: although the 0Hz frequency can be legitimately
2474-
calculated, it may be clamped to a more meaningful value in the context of
2475-
the speech synthesizer.
2517+
<p> Computed absolute frequencies that are negative are clamped to zero
2518+
Hertz. Speech-capable user agents are likely to support a specific range
2519+
of values rather than the full range of possible calculated numerical
2520+
values for frequencies. The actual values in user agents may therefore be
2521+
clamped to implementation-dependent minimum and maximum boundaries. For
2522+
example: although the 0Hz frequency can be legitimately calculated, it may
2523+
be clamped to a more meaningful value in the context of the speech
2524+
synthesizer.
24762525

24772526
<div class=example>
24782527
<p>Examples of inherited values:</p>
@@ -3000,8 +3049,8 @@ <h2 class=no-num id=property-index>Appendix A &mdash; Property index</h2>
30003049
<tr>
30013050
<th><a class=property href="#pause">pause</a>
30023051

3003-
<td>&lt;&lsquo;pause-before&rsquo;&gt; ||
3004-
&lt;&lsquo;pause-after&rsquo;&gt;
3052+
<td>&lt;&lsquo;pause-before&rsquo;&gt;
3053+
&lt;&lsquo;pause-after&rsquo;&gt;?
30053054

30063055
<td>N/A (see individual properties)
30073056

@@ -3046,8 +3095,7 @@ <h2 class=no-num id=property-index>Appendix A &mdash; Property index</h2>
30463095
<tr>
30473096
<th><a class=property href="#rest">rest</a>
30483097

3049-
<td>&lt;&lsquo;rest-before&rsquo;&gt; ||
3050-
&lt;&lsquo;rest-after&rsquo;&gt;
3098+
<td>&lt;&lsquo;rest-before&rsquo;&gt; &lt;&lsquo;rest-after&rsquo;&gt;?
30513099

30523100
<td>N/A (see individual properties)
30533101

0 commit comments

Comments
 (0)