Skip to content

Commit 3ef0139

Browse files
committed
minor reformulation
1 parent 430b02a commit 3ef0139

2 files changed

Lines changed: 54 additions & 51 deletions

File tree

css3-speech/Overview.html

Lines changed: 30 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -90,13 +90,13 @@
9090

9191
<h1 id=top>CSS Speech Module</h1>
9292

93-
<h2 class="no-num no-toc" id=longstatus-date>Editor's Draft 01 August 2011</h2>
93+
<h2 class="no-num no-toc" id=longstatus-date>Editor's Draft 02 August 2011</h2>
9494

9595
<dl>
9696
<dt>This version:
9797

9898
<dd>
99-
<!--<a href="http://www.w3.org/TR/2011/WD-css3-speech-20110801">http://www.w3.org/TR/2011/ED-css3-speech-20110801/</a>-->
99+
<!--<a href="http://www.w3.org/TR/2011/WD-css3-speech-20110802">http://www.w3.org/TR/2011/ED-css3-speech-20110802/</a>-->
100100
<a
101101
href="http://dev.w3.org/csswg/css3-speech">http://dev.w3.org/csswg/css3-speech</a>
102102

@@ -1693,38 +1693,40 @@ <h3 id=cue-props-cue-before-after><span class=secno>9.1. </span>The
16931693

16941694
<p> The desired effect of an audio cue set at +0dB is that the volume
16951695
level during playback of the pre-recorded / pre-generated audio signal
1696-
is effectively the same as the volume level of live (i.e. real-time)
1697-
speech synthesis rendition. In order to achieve this effect, speech
1698-
processors are capable of directly controlling the waveform amplitude of
1699-
generated text-to-speech audio, user agents must be able to adjust the
1700-
volume output of audio cues (i.e. amplify or attenuate audio signals
1701-
based on the intrinsic waveform amplitude of sound clips), and last but
1696+
is effectively the same as the loudness of live (i.e. real-time) speech
1697+
synthesis rendition. In order to achieve this effect, speech processors
1698+
are capable of directly controlling the waveform amplitude of generated
1699+
text-to-speech audio, user agents must be able to adjust the volume
1700+
output of audio cues (i.e. amplify or attenuate audio signals based on
1701+
the intrinsic waveform amplitude of digitized sound clips), and last but
17021702
not least, authors must ensure that the "normal" volume level of
1703-
pre-recorded audio cues (on average, as there may be discrete variations
1704-
due to changes in the audio stream, such as intonation, stress, etc.)
1705-
matches that of a "typical" TTS voice output (based on the &lsquo;<a
1706-
href="#voice-family"><code class=property>voice-family</code></a>&rsquo;
1707-
intended for use), given standard listening conditions (i.e. default
1708-
system volume levels, centered equalization across the frequency
1709-
spectrum). This latter prerequisite sets a baseline that enables a user
1710-
agent to align the volume outputs of both TTS and cue audio streams
1711-
within the same "aural box model". Due to the complex relationship
1712-
between perceived audio characteristics and the processing applied to
1713-
the digitized audio signal, we will simplify the definition of "normal"
1714-
volume levels by referring to a canonical recording scenario, whereby
1715-
the attenuation is typically indicated in decibels, ranging from 0dB
1716-
(maximum audio input, near clipping threshold) to -60dB (total silence).
1717-
In this common context, a "standard" audio clip would oscillate between
1718-
these values, the loudest peak levels would be close to -3dB (to avoid
1719-
distortion), and the audible passages would have average volume levels
1703+
pre-recorded audio cues (on average, as there may be discrete loudness
1704+
variations due to changes in the audio stream, such as intonation,
1705+
stress, etc.) matches that of a "typical" TTS voice output (based on the
1706+
&lsquo;<a href="#voice-family"><code
1707+
class=property>voice-family</code></a>&rsquo; intended for use), given
1708+
standard listening conditions (i.e. default system volume levels,
1709+
centered equalization across the frequency spectrum). This latter
1710+
prerequisite sets a baseline that enables a user agent to align the
1711+
volume outputs of both TTS and cue audio streams within the same "aural
1712+
box model". Due to the complex relationship between perceived audio
1713+
characteristics and the processing applied to the digitized audio
1714+
signal, we will simplify the definition of "normal" volume levels by
1715+
referring to a canonical recording scenario, whereby the attenuation is
1716+
typically indicated in decibels, ranging from 0dB (maximum audio input,
1717+
near clipping threshold) to -60dB (total silence). In this common
1718+
context, a "standard" audio clip would oscillate between these values,
1719+
the loudest peak levels would be close to -3dB (to avoid distortion),
1720+
and the relevant audible passages would have average (RMS) volume levels
17201721
as high as possible (i.e. not too quiet, to avoid background noise
17211722
during amplification). This would roughly provide an audio experience
17221723
that could be seamlessly combined with text-to-speech output (i.e. there
17231724
would be no discernible difference in volume levels when switching from
17241725
pre-recorded audio to speech synthesis). Although there exists no
1725-
industry-wide standard to backup such convention, TTS engines usually
1726-
generate comparably-loud audio signals when no amplification (or
1727-
attenuation) is specified.</p>
1726+
industry-wide standard to support such convention, TTS engines usually
1727+
generate comparably-loud audio signals when no gain or attenuation is
1728+
specified. For voice and soft music, -15dB RMS seems to be pretty
1729+
standard.</p>
17281730

17291731
<p class=note> Note that -6.0dB is approximately half the amplitude of
17301732
the audio signal, and +6.0dB is approximately twice the amplitude.</p>

css3-speech/Overview.src.html

Lines changed: 24 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1284,31 +1284,32 @@ <h3 id="cue-props-cue-before-after">The 'cue-before' and 'cue-after' properties<
12841284
definition of 'voice-volume'). If the inherited 'voice-volume' value already contains a
12851285
decibel offset, the dB offset specific to the audio cue is combined additively. </p><p>
12861286
The desired effect of an audio cue set at +0dB is that the volume level during playback of
1287-
the pre-recorded / pre-generated audio signal is effectively the same as the volume level
1288-
of live (i.e. real-time) speech synthesis rendition. In order to achieve this effect,
1289-
speech processors are capable of directly controlling the waveform amplitude of generated
1287+
the pre-recorded / pre-generated audio signal is effectively the same as the loudness of
1288+
live (i.e. real-time) speech synthesis rendition. In order to achieve this effect, speech
1289+
processors are capable of directly controlling the waveform amplitude of generated
12901290
text-to-speech audio, user agents must be able to adjust the volume output of audio cues
12911291
(i.e. amplify or attenuate audio signals based on the intrinsic waveform amplitude of
1292-
sound clips), and last but not least, authors must ensure that the "normal" volume level
1293-
of pre-recorded audio cues (on average, as there may be discrete variations due to changes
1294-
in the audio stream, such as intonation, stress, etc.) matches that of a "typical" TTS
1295-
voice output (based on the 'voice-family' intended for use), given standard listening
1296-
conditions (i.e. default system volume levels, centered equalization across the frequency
1297-
spectrum). This latter prerequisite sets a baseline that enables a user agent to align the
1298-
volume outputs of both TTS and cue audio streams within the same "aural box model". Due to
1299-
the complex relationship between perceived audio characteristics and the processing
1300-
applied to the digitized audio signal, we will simplify the definition of "normal" volume
1301-
levels by referring to a canonical recording scenario, whereby the attenuation is
1302-
typically indicated in decibels, ranging from 0dB (maximum audio input, near clipping
1303-
threshold) to -60dB (total silence). In this common context, a "standard" audio clip would
1304-
oscillate between these values, the loudest peak levels would be close to -3dB (to avoid
1305-
distortion), and the audible passages would have average volume levels as high as possible
1306-
(i.e. not too quiet, to avoid background noise during amplification). This would roughly
1307-
provide an audio experience that could be seamlessly combined with text-to-speech output
1308-
(i.e. there would be no discernible difference in volume levels when switching from
1309-
pre-recorded audio to speech synthesis). Although there exists no industry-wide standard
1310-
to backup such convention, TTS engines usually generate comparably-loud audio signals when
1311-
no amplification (or attenuation) is specified.</p>
1292+
digitized sound clips), and last but not least, authors must ensure that the "normal"
1293+
volume level of pre-recorded audio cues (on average, as there may be discrete loudness
1294+
variations due to changes in the audio stream, such as intonation, stress, etc.) matches
1295+
that of a "typical" TTS voice output (based on the 'voice-family' intended for use), given
1296+
standard listening conditions (i.e. default system volume levels, centered equalization
1297+
across the frequency spectrum). This latter prerequisite sets a baseline that enables a
1298+
user agent to align the volume outputs of both TTS and cue audio streams within the same
1299+
"aural box model". Due to the complex relationship between perceived audio characteristics
1300+
and the processing applied to the digitized audio signal, we will simplify the definition
1301+
of "normal" volume levels by referring to a canonical recording scenario, whereby the
1302+
attenuation is typically indicated in decibels, ranging from 0dB (maximum audio input,
1303+
near clipping threshold) to -60dB (total silence). In this common context, a "standard"
1304+
audio clip would oscillate between these values, the loudest peak levels would be close to
1305+
-3dB (to avoid distortion), and the relevant audible passages would have average (RMS)
1306+
volume levels as high as possible (i.e. not too quiet, to avoid background noise during
1307+
amplification). This would roughly provide an audio experience that could be seamlessly
1308+
combined with text-to-speech output (i.e. there would be no discernible difference in
1309+
volume levels when switching from pre-recorded audio to speech synthesis). Although there
1310+
exists no industry-wide standard to support such convention, TTS engines usually generate
1311+
comparably-loud audio signals when no gain or attenuation is specified. For voice and soft
1312+
music, -15dB RMS seems to be pretty standard. </p>
13121313
<p class="note"> Note that -6.0dB is approximately half the amplitude of the audio signal,
13131314
and +6.0dB is approximately twice the amplitude.</p>
13141315
<p class="note"> Note that there is a difference between an audio cue whose volume is set to

0 commit comments

Comments
 (0)