minor reformulation

danielweck · danielweck · commit 3ef0139ccbc7 · 2011-08-02T07:41:49.000Z
diff --git a/css3-speech/Overview.html b/css3-speech/Overview.html
@@ -90,13 +90,13 @@
 
    <h1 id=top>CSS Speech Module</h1>
 
-   <h2 class="no-num no-toc" id=longstatus-date>Editor's Draft 01 August 2011</h2>
+   <h2 class="no-num no-toc" id=longstatus-date>Editor's Draft 02 August 2011</h2>
 
    <dl>
     <dt>This version:
 
     <dd>
-     <!--<a href="http://www.w3.org/TR/2011/WD-css3-speech-20110801">http://www.w3.org/TR/2011/ED-css3-speech-20110801/</a>-->
+     <!--<a href="http://www.w3.org/TR/2011/WD-css3-speech-20110802">http://www.w3.org/TR/2011/ED-css3-speech-20110802/</a>-->
      <a
      href="http://dev.w3.org/csswg/css3-speech">http://dev.w3.org/csswg/css3-speech</a>
      
@@ -1693,38 +1693,40 @@ <h3 id=cue-props-cue-before-after><span class=secno>9.1. </span>The
 
     <p> The desired effect of an audio cue set at +0dB is that the volume
      level during playback of the pre-recorded / pre-generated audio signal
-     is effectively the same as the volume level of live (i.e. real-time)
-     speech synthesis rendition. In order to achieve this effect, speech
-     processors are capable of directly controlling the waveform amplitude of
-     generated text-to-speech audio, user agents must be able to adjust the
-     volume output of audio cues (i.e. amplify or attenuate audio signals
-     based on the intrinsic waveform amplitude of sound clips), and last but
+     is effectively the same as the loudness of live (i.e. real-time) speech
+     synthesis rendition. In order to achieve this effect, speech processors
+     are capable of directly controlling the waveform amplitude of generated
+     text-to-speech audio, user agents must be able to adjust the volume
+     output of audio cues (i.e. amplify or attenuate audio signals based on
+     the intrinsic waveform amplitude of digitized sound clips), and last but
      not least, authors must ensure that the "normal" volume level of
-     pre-recorded audio cues (on average, as there may be discrete variations
-     due to changes in the audio stream, such as intonation, stress, etc.)
-     matches that of a "typical" TTS voice output (based on the &lsquo;<a
-     href="#voice-family"><code class=property>voice-family</code></a>&rsquo;
-     intended for use), given standard listening conditions (i.e. default
-     system volume levels, centered equalization across the frequency
-     spectrum). This latter prerequisite sets a baseline that enables a user
-     agent to align the volume outputs of both TTS and cue audio streams
-     within the same "aural box model". Due to the complex relationship
-     between perceived audio characteristics and the processing applied to
-     the digitized audio signal, we will simplify the definition of "normal"
-     volume levels by referring to a canonical recording scenario, whereby
-     the attenuation is typically indicated in decibels, ranging from 0dB
-     (maximum audio input, near clipping threshold) to -60dB (total silence).
-     In this common context, a "standard" audio clip would oscillate between
-     these values, the loudest peak levels would be close to -3dB (to avoid
-     distortion), and the audible passages would have average volume levels
+     pre-recorded audio cues (on average, as there may be discrete loudness
+     variations due to changes in the audio stream, such as intonation,
+     stress, etc.) matches that of a "typical" TTS voice output (based on the
+     &lsquo;<a href="#voice-family"><code
+     class=property>voice-family</code></a>&rsquo; intended for use), given
+     standard listening conditions (i.e. default system volume levels,
+     centered equalization across the frequency spectrum). This latter
+     prerequisite sets a baseline that enables a user agent to align the
+     volume outputs of both TTS and cue audio streams within the same "aural
+     box model". Due to the complex relationship between perceived audio
+     characteristics and the processing applied to the digitized audio
+     signal, we will simplify the definition of "normal" volume levels by
+     referring to a canonical recording scenario, whereby the attenuation is
+     typically indicated in decibels, ranging from 0dB (maximum audio input,
+     near clipping threshold) to -60dB (total silence). In this common
+     context, a "standard" audio clip would oscillate between these values,
+     the loudest peak levels would be close to -3dB (to avoid distortion),
+     and the relevant audible passages would have average (RMS) volume levels
      as high as possible (i.e. not too quiet, to avoid background noise
      during amplification). This would roughly provide an audio experience
      that could be seamlessly combined with text-to-speech output (i.e. there
      would be no discernible difference in volume levels when switching from
      pre-recorded audio to speech synthesis). Although there exists no
-     industry-wide standard to backup such convention, TTS engines usually
-     generate comparably-loud audio signals when no amplification (or
-     attenuation) is specified.</p>
+     industry-wide standard to support such convention, TTS engines usually
+     generate comparably-loud audio signals when no gain or attenuation is
+     specified. For voice and soft music, -15dB RMS seems to be pretty
+     standard.</p>
 
     <p class=note> Note that -6.0dB is approximately half the amplitude of
      the audio signal, and +6.0dB is approximately twice the amplitude.</p>
diff --git a/css3-speech/Overview.src.html b/css3-speech/Overview.src.html
@@ -1284,31 +1284,32 @@ <h3 id="cue-props-cue-before-after">The 'cue-before' and 'cue-after' properties<
           definition of 'voice-volume'). If the inherited 'voice-volume' value already contains a
           decibel offset, the dB offset specific to the audio cue is combined additively. </p><p>
           The desired effect of an audio cue set at +0dB is that the volume level during playback of
-          the pre-recorded / pre-generated audio signal is effectively the same as the volume level
-          of live (i.e. real-time) speech synthesis rendition. In order to achieve this effect,
-          speech processors are capable of directly controlling the waveform amplitude of generated
+          the pre-recorded / pre-generated audio signal is effectively the same as the loudness of
+          live (i.e. real-time) speech synthesis rendition. In order to achieve this effect, speech
+          processors are capable of directly controlling the waveform amplitude of generated
           text-to-speech audio, user agents must be able to adjust the volume output of audio cues
           (i.e. amplify or attenuate audio signals based on the intrinsic waveform amplitude of
-          sound clips), and last but not least, authors must ensure that the "normal" volume level
-          of pre-recorded audio cues (on average, as there may be discrete variations due to changes
-          in the audio stream, such as intonation, stress, etc.) matches that of a "typical" TTS
-          voice output (based on the 'voice-family' intended for use), given standard listening
-          conditions (i.e. default system volume levels, centered equalization across the frequency
-          spectrum). This latter prerequisite sets a baseline that enables a user agent to align the
-          volume outputs of both TTS and cue audio streams within the same "aural box model". Due to
-          the complex relationship between perceived audio characteristics and the processing
-          applied to the digitized audio signal, we will simplify the definition of "normal" volume
-          levels by referring to a canonical recording scenario, whereby the attenuation is
-          typically indicated in decibels, ranging from 0dB (maximum audio input, near clipping
-          threshold) to -60dB (total silence). In this common context, a "standard" audio clip would
-          oscillate between these values, the loudest peak levels would be close to -3dB (to avoid
-          distortion), and the audible passages would have average volume levels as high as possible
-          (i.e. not too quiet, to avoid background noise during amplification). This would roughly
-          provide an audio experience that could be seamlessly combined with text-to-speech output
-          (i.e. there would be no discernible difference in volume levels when switching from
-          pre-recorded audio to speech synthesis). Although there exists no industry-wide standard
-          to backup such convention, TTS engines usually generate comparably-loud audio signals when
-          no amplification (or attenuation) is specified.</p>
+          digitized sound clips), and last but not least, authors must ensure that the "normal"
+          volume level of pre-recorded audio cues (on average, as there may be discrete loudness
+          variations due to changes in the audio stream, such as intonation, stress, etc.) matches
+          that of a "typical" TTS voice output (based on the 'voice-family' intended for use), given
+          standard listening conditions (i.e. default system volume levels, centered equalization
+          across the frequency spectrum). This latter prerequisite sets a baseline that enables a
+          user agent to align the volume outputs of both TTS and cue audio streams within the same
+          "aural box model". Due to the complex relationship between perceived audio characteristics
+          and the processing applied to the digitized audio signal, we will simplify the definition
+          of "normal" volume levels by referring to a canonical recording scenario, whereby the
+          attenuation is typically indicated in decibels, ranging from 0dB (maximum audio input,
+          near clipping threshold) to -60dB (total silence). In this common context, a "standard"
+          audio clip would oscillate between these values, the loudest peak levels would be close to
+          -3dB (to avoid distortion), and the relevant audible passages would have average (RMS)
+          volume levels as high as possible (i.e. not too quiet, to avoid background noise during
+          amplification). This would roughly provide an audio experience that could be seamlessly
+          combined with text-to-speech output (i.e. there would be no discernible difference in
+          volume levels when switching from pre-recorded audio to speech synthesis). Although there
+          exists no industry-wide standard to support such convention, TTS engines usually generate
+          comparably-loud audio signals when no gain or attenuation is specified. For voice and soft
+          music, -15dB RMS seems to be pretty standard. </p>
         <p class="note"> Note that -6.0dB is approximately half the amplitude of the audio signal,
           and +6.0dB is approximately twice the amplitude.</p>
         <p class="note"> Note that there is a difference between an audio cue whose volume is set to