|
90 | 90 |
|
91 | 91 | <h1 id=top>CSS Speech Module</h1> |
92 | 92 |
|
93 | | - <h2 class="no-num no-toc" id=longstatus-date>Editor's Draft 01 August 2011</h2> |
| 93 | + <h2 class="no-num no-toc" id=longstatus-date>Editor's Draft 02 August 2011</h2> |
94 | 94 |
|
95 | 95 | <dl> |
96 | 96 | <dt>This version: |
97 | 97 |
|
98 | 98 | <dd> |
99 | | - <!--<a href="http://www.w3.org/TR/2011/WD-css3-speech-20110801">http://www.w3.org/TR/2011/ED-css3-speech-20110801/</a>--> |
| 99 | + <!--<a href="http://www.w3.org/TR/2011/WD-css3-speech-20110802">http://www.w3.org/TR/2011/ED-css3-speech-20110802/</a>--> |
100 | 100 | <a |
101 | 101 | href="http://dev.w3.org/csswg/css3-speech">http://dev.w3.org/csswg/css3-speech</a> |
102 | 102 |
|
@@ -1693,38 +1693,40 @@ <h3 id=cue-props-cue-before-after><span class=secno>9.1. </span>The |
1693 | 1693 |
|
1694 | 1694 | <p> The desired effect of an audio cue set at +0dB is that the volume |
1695 | 1695 | level during playback of the pre-recorded / pre-generated audio signal |
1696 | | - is effectively the same as the volume level of live (i.e. real-time) |
1697 | | - speech synthesis rendition. In order to achieve this effect, speech |
1698 | | - processors are capable of directly controlling the waveform amplitude of |
1699 | | - generated text-to-speech audio, user agents must be able to adjust the |
1700 | | - volume output of audio cues (i.e. amplify or attenuate audio signals |
1701 | | - based on the intrinsic waveform amplitude of sound clips), and last but |
| 1696 | + is effectively the same as the loudness of live (i.e. real-time) speech |
| 1697 | + synthesis rendition. In order to achieve this effect, speech processors |
| 1698 | + are capable of directly controlling the waveform amplitude of generated |
| 1699 | + text-to-speech audio, user agents must be able to adjust the volume |
| 1700 | + output of audio cues (i.e. amplify or attenuate audio signals based on |
| 1701 | + the intrinsic waveform amplitude of digitized sound clips), and last but |
1702 | 1702 | not least, authors must ensure that the "normal" volume level of |
1703 | | - pre-recorded audio cues (on average, as there may be discrete variations |
1704 | | - due to changes in the audio stream, such as intonation, stress, etc.) |
1705 | | - matches that of a "typical" TTS voice output (based on the ‘<a |
1706 | | - href="#voice-family"><code class=property>voice-family</code></a>’ |
1707 | | - intended for use), given standard listening conditions (i.e. default |
1708 | | - system volume levels, centered equalization across the frequency |
1709 | | - spectrum). This latter prerequisite sets a baseline that enables a user |
1710 | | - agent to align the volume outputs of both TTS and cue audio streams |
1711 | | - within the same "aural box model". Due to the complex relationship |
1712 | | - between perceived audio characteristics and the processing applied to |
1713 | | - the digitized audio signal, we will simplify the definition of "normal" |
1714 | | - volume levels by referring to a canonical recording scenario, whereby |
1715 | | - the attenuation is typically indicated in decibels, ranging from 0dB |
1716 | | - (maximum audio input, near clipping threshold) to -60dB (total silence). |
1717 | | - In this common context, a "standard" audio clip would oscillate between |
1718 | | - these values, the loudest peak levels would be close to -3dB (to avoid |
1719 | | - distortion), and the audible passages would have average volume levels |
| 1703 | + pre-recorded audio cues (on average, as there may be discrete loudness |
| 1704 | + variations due to changes in the audio stream, such as intonation, |
| 1705 | + stress, etc.) matches that of a "typical" TTS voice output (based on the |
| 1706 | + ‘<a href="#voice-family"><code |
| 1707 | + class=property>voice-family</code></a>’ intended for use), given |
| 1708 | + standard listening conditions (i.e. default system volume levels, |
| 1709 | + centered equalization across the frequency spectrum). This latter |
| 1710 | + prerequisite sets a baseline that enables a user agent to align the |
| 1711 | + volume outputs of both TTS and cue audio streams within the same "aural |
| 1712 | + box model". Due to the complex relationship between perceived audio |
| 1713 | + characteristics and the processing applied to the digitized audio |
| 1714 | + signal, we will simplify the definition of "normal" volume levels by |
| 1715 | + referring to a canonical recording scenario, whereby the attenuation is |
| 1716 | + typically indicated in decibels, ranging from 0dB (maximum audio input, |
| 1717 | + near clipping threshold) to -60dB (total silence). In this common |
| 1718 | + context, a "standard" audio clip would oscillate between these values, |
| 1719 | + the loudest peak levels would be close to -3dB (to avoid distortion), |
| 1720 | + and the relevant audible passages would have average (RMS) volume levels |
1720 | 1721 | as high as possible (i.e. not too quiet, to avoid background noise |
1721 | 1722 | during amplification). This would roughly provide an audio experience |
1722 | 1723 | that could be seamlessly combined with text-to-speech output (i.e. there |
1723 | 1724 | would be no discernible difference in volume levels when switching from |
1724 | 1725 | pre-recorded audio to speech synthesis). Although there exists no |
1725 | | - industry-wide standard to backup such convention, TTS engines usually |
1726 | | - generate comparably-loud audio signals when no amplification (or |
1727 | | - attenuation) is specified.</p> |
| 1726 | + industry-wide standard to support such convention, TTS engines usually |
| 1727 | + generate comparably-loud audio signals when no gain or attenuation is |
| 1728 | + specified. For voice and soft music, -15dB RMS seems to be pretty |
| 1729 | + standard.</p> |
1728 | 1730 |
|
1729 | 1731 | <p class=note> Note that -6.0dB is approximately half the amplitude of |
1730 | 1732 | the audio signal, and +6.0dB is approximately twice the amplitude.</p> |
|
0 commit comments