w3c
diff --git a/‎css3-speech/Overview.html‎
Lines changed: 93 additions & 76 deletions b/‎css3-speech/Overview.html‎
Lines changed: 93 additions & 76 deletions
@@ -88,14 +88,14 @@
 
    <h1 id=top>CSS Speech Module</h1>
 
-   <h2 class="no-num no-toc" id=longstatus-date>Editor's Draft 20 February
+   <h2 class="no-num no-toc" id=longstatus-date>Editor's Draft 21 February
     2012</h2>
 
    <dl id=versions>
     <dt>This version:
 
     <dd>
-     <!--<a href="http://www.w3.org/TR/2012/WD-css3-speech-20120220/">http://www.w3.org/TR/2012/ED-css3-speech-20120220/</a>-->
+     <!--<a href="http://www.w3.org/TR/2012/WD-css3-speech-20120221/">http://www.w3.org/TR/2012/ED-css3-speech-20120221/</a>-->
      <a
      href="http://dev.w3.org/csswg/css3-speech">http://dev.w3.org/csswg/css3-speech</a>
 
@@ -535,8 +535,11 @@ <h2 id=aural-model><span class=secno>6. </span>The aural formatting model</h2>
   <p> The following diagram illustrates the equivalence between properties of
    the visual and aural box models, applied to the selected &lt;element&gt;:
 
-  <p> <img alt="A graph depicting the aural 'box' model." id=aural-box
-   src=aural-box.png>
+  <p> <img
+   alt="The aural 'box' model, illustrated by a diagram: the selected element is positioned in the center, on its left side are (from innermost to outermost) rest-before, cue-before, pause-before, on its right side are (from innermost to outermost) rest-after, cue-after, pause-after, where rest is conceptually similar to padding, cue is similar to border, pause is similar to margin."
+   id=aural-box src=aural-box.png
+   title="The aural 'box' model, illustrated by a diagram: the selected element is positioned in the center, on its left side are (from innermost to outermost) rest-before, cue-before, pause-before, on its right side are (from innermost to outermost) rest-after, cue-after, pause-after, where rest is conceptually similar to padding, cue is similar to border, pause is similar to margin.">
+   
 
   <h2 id=mixing-props><span class=secno>7. </span>Mixing properties</h2>
 
@@ -585,15 +588,16 @@ <h3 id=mixing-props-voice-volume><span class=secno>7.1. </span>The
     <tr>
      <td> <em>Computed value:</em>
 
-     <td>a keyword value, and optionally also a decibel offset (if not zero)
+     <td>&lsquo;<code class=property>silent</code>&rsquo;, or a keyword value
+      and optionally also a decibel offset (if not zero)
   </table>
 
   <p>The &lsquo;<a href="#voice-volume"><code
    class=property>voice-volume</code></a>&rsquo; property allows authors to
    control the amplitude of the audio waveform generated by the speech
    synthesiser, and is also used to adjust the relative volume level of <a
-   href="#cue-props">audio cues</a> within the <a href="#aural-model">audio
-   box model</a>.
+   href="#cue-props">audio cues</a> within the <a href="#aural-model">aural
+   box model</a> of the selected element.
 
   <p class=note> Note that although the functionality provided by this
    property is similar to the <a
@@ -615,36 +619,40 @@ <h3 id=mixing-props-voice-volume><span class=secno>7.1. </span>The
    <dt> <strong>silent</strong>
 
    <dd>
-    <p> Specifies that no sound is generated (the text is read "silently").
-     Corresponds to negative infinity in dB units.</p>
+    <p> Specifies that no sound is generated (the text is read "silently").</p>
 
-    <p class=note> Note that there is a difference between an element whose
-     &lsquo;<a href="#voice-volume"><code
+    <p class=note> Note that this has the same effect as using negative
+     infinity decibels. Also note that there is a difference between an
+     element whose &lsquo;<a href="#voice-volume"><code
      class=property>voice-volume</code></a>&rsquo; property has a value of
      &lsquo;<code class=property>silent</code>&rsquo;, and an element whose
      &lsquo;<a href="#speak"><code class=property>speak</code></a>&rsquo;
      property has the value &lsquo;<code class=property>none</code>&rsquo;.
      With the former, the selected element takes up the same time as if it
      was spoken, including any pause before and after the element, but no
-     sound is generated (descendants can override the &lsquo;<a
+     sound is generated (descendants within the <a href="#aural-model">aural
+     box model</a> of the selected element can override the &lsquo;<a
      href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
-     value and may therefore generate audio output). With the latter, the
+     value, and may therefore generate audio output). With the latter, the
      selected element is not rendered in the aural dimension and no time is
-     allocated for playback (descendants can override the &lsquo;<a
-     href="#speak"><code class=property>speak</code></a>&rsquo; value and may
-     therefore generate audio output).</p>
+     allocated for playback (descendants within the <a
+     href="#aural-model">aural box model</a> of the selected element can
+     override the &lsquo;<a href="#speak"><code
+     class=property>speak</code></a>&rsquo; value, and may therefore generate
+     audio output).</p>
 
    <dt><strong>x-soft</strong>, <strong>soft</strong>,
     <strong>medium</strong>, <strong>loud</strong>, <strong>x-loud</strong>
 
    <dd>
     <p>This sequence of keywords corresponds to monotonically non-decreasing
      volume levels, mapped to implementation-dependent values that meet the
-     listener's requirements with regards to perceived sound loudness. These
-     audio levels are typically provided via a preference mechanism that
-     allow users to set options according to their auditory environment. The
-     keyword &lsquo;<code class=property>x-soft</code>&rsquo; maps to the
-     user's <em>minimum audible</em> volume level, &lsquo;<code
+     listener's requirements with regards to perceived loudness. These audio
+     levels are typically provided via a preference mechanism that allow
+     users to calibrate sound options according to their auditory
+     environment. The keyword &lsquo;<code
+     class=property>x-soft</code>&rsquo; maps to the user's <em>minimum
+     audible</em> volume level, &lsquo;<code
      class=property>x-loud</code>&rsquo; maps to the user's <em>maximum
      tolerable</em> volume level, &lsquo;<code
      class=property>medium</code>&rsquo; maps to the user's
@@ -674,12 +682,12 @@ <h3 id=mixing-props-voice-volume><span class=secno>7.1. </span>The
      the audio signal, and +6.0dB is approximately twice the amplitude.</p>
   </dl>
 
-  <p class=note>Note that the actual perceived volume levels depend on
-   various factors, such as the listening environment and personal user
-   preferences. The effective volume variation between &lsquo;<code
+  <p class=note>Note that perceived loudness depends on various factors, such
+   as the listening environment, user preferences or physical abilities. The
+   effective volume variation between &lsquo;<code
    class=property>x-soft</code>&rsquo; and &lsquo;<code
    class=property>x-loud</code>&rsquo; represents the dynamic range (in terms
-   of loudness) of the speech output. Typically, this range would be
+   of loudness) of the audio output. Typically, this range would be
    compressed in a noisy context, i.e. the perceived loudness corresponding
    to &lsquo;<code class=property>x-soft</code>&rsquo; would effectively be
    closer to &lsquo;<code class=property>x-loud</code>&rsquo; than it would
@@ -1485,7 +1493,7 @@ <h3 id=rest-props-rest-before-after><span class=secno>10.1. </span>The
    href="#rest-after"><code class=property>rest-after</code></a>&rsquo;
    properties specify a prosodic boundary (silence with a specific duration)
    that occurs before (or after) the speech synthesis rendition of an element
-   within the <a href="#aural-model">audio box model</a>.
+   within the <a href="#aural-model">aural box model</a>.
 
   <p class=note> Note that although the functionality provided by this
    property is similar to the <a
@@ -1690,7 +1698,7 @@ <h3 id=cue-props-cue-before-after><span class=secno>11.1. </span>The
    href="#cue-after"><code class=property>cue-after</code></a>&rsquo;
    properties specify auditory icons (i.e. pre-recorded / pre-generated sound
    clips) to be played before (or after) the selected element within the <a
-   href="#aural-model">audio box model</a>.
+   href="#aural-model">aural box model</a>.
 
   <p class=note> Note that although the functionality provided by this
    property may appear related to the <a
@@ -1724,21 +1732,21 @@ <h3 id=cue-props-cue-before-after><span class=secno>11.1. </span>The
      to the computed value of the &lsquo;<a href="#voice-volume"><code
      class=property>voice-volume</code></a>&rsquo; property within the <a
      href="#aural-model">aural box model</a> of the selected element (as a
-     result, the volume level of audio cues changes when the &lsquo;<a
+     result, the volume level of an audio cue changes when the &lsquo;<a
      href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
      property changes). When omitted, the implied value computes to 0dB.</p>
 
-    <p> When the &lsquo;<a href="#voice-volume"><code
-     class=property>voice-volume</code></a>&rsquo; property is set to
-     &lsquo;<code class=property>silent</code>&rsquo;, the audio cue is also
-     set to &lsquo;<code class=property>silent</code>&rsquo; (regardless of
-     this specified &lt;decibel&gt; value). Otherwise (when not &lsquo;<code
+    <p> When the computed value of the &lsquo;<a href="#voice-volume"><code
+     class=property>voice-volume</code></a>&rsquo; property is &lsquo;<code
+     class=property>silent</code>&rsquo;, the audio cue is also set to
+     &lsquo;<code class=property>silent</code>&rsquo; (regardless of this
+     specified &lt;decibel&gt; value). Otherwise (when not &lsquo;<code
      class=property>silent</code>&rsquo;), &lsquo;<a
      href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
      values are always specified relatively to the volume level keywords (see
      the definition of &lsquo;<a href="#voice-volume"><code
      class=property>voice-volume</code></a>&rsquo;), which map to a
-     user-configured scale of "preferred" loudness settings. If the inherited
+     user-calibrated scale of "preferred" loudness settings. If the inherited
      &lsquo;<a href="#voice-volume"><code
      class=property>voice-volume</code></a>&rsquo; value already contains a
      decibel offset, the dB offset specific to the audio cue is combined
@@ -1789,48 +1797,54 @@ <h3 id=cue-props-volume><span class=secno>11.2. </span>Relation between
    For example, the desired effect of an audio cue whose volume level is set
    at +0dB (as specified by the &lt;decibel&gt; value) is that its perceived
    loudness during playback is close to that of the speech synthesis
-   rendition of the selected element, as dictated by computed value of the
+   rendition of the selected element, as dictated by the computed value of
+   the &lsquo;<a href="#voice-volume"><code
+   class=property>voice-volume</code></a>&rsquo; property. Note that a
+   &lsquo;<code class=property>silent</code>&rsquo; computed value for the
    &lsquo;<a href="#voice-volume"><code
-   class=property>voice-volume</code></a>&rsquo; property (which is itself
-   based on a user-configured volume level keyword). Similarly, a
-   &lsquo;<code class=property>silent</code>&rsquo; value for the &lsquo;<a
-   href="#voice-volume"><code class=property>voice-volume</code></a>&rsquo;
-   property results on any audio cues being "silenced" as well.
-
-  <p> In order to achieve this effect, authors should ensure that the volume
-   level of audio cues (on average, as there may be discrete loudness
-   variations due to changes in the audio stream, such as intonation, stress,
-   etc.) matches that of a "typical" TTS voice output (based on the &lsquo;<a
+   class=property>voice-volume</code></a>&rsquo; property results in audio
+   cues being "forcefully" silenced as well (i.e. regardless of the specified
+   audio cue &lsquo;<code class=property>decibel</code>&rsquo; value)
+
+  <p> The volume keywords of the &lsquo;<a href="#voice-volume"><code
+   class=property>voice-volume</code></a>&rsquo; property are user-calibrated
+   to match requirements not known at authoring time (e.g. auditory
+   environment, personal preferences). Therefore, in order to achieve this
+   approximate loudness alignment of audio cues and speech synthesis, authors
+   should ensure that the volume level of audio cues (on average, as there
+   may be discrete variations of perceived loudness due to changes in the
+   audio stream, such as intonation, stress, etc.) matches the output of a
+   speech synthesis rendition based on the &lsquo;<a
    href="#voice-family"><code class=property>voice-family</code></a>&rsquo;
-   intended for use), given "standard" listening conditions (i.e. default
+   intended for use, given "typical" listening conditions (i.e. default
    system volume levels, centered equalization across the frequency
    spectrum). As speech processors are capable of directly controlling the
    waveform amplitude of generated text-to-speech audio, and because user
    agents are able to adjust the volume output of audio cues (i.e. amplify or
    attenuate audio signals based on the intrinsic waveform amplitude of
    digitized sound clips), this sets a baseline that enables implementations
-   to "align" the loudness of both TTS and cue audio streams within the aural
-   box model, relative to user-configured volume levels (see the keywords
+   to manage the loudness of both TTS and cue audio streams within the aural
+   box model, relative to user-calibrated volume levels (see the keywords
    defined in the &lsquo;<a href="#voice-volume"><code
    class=property>voice-volume</code></a>&rsquo; property).
 
   <p> Due to the complex relationship between perceived audio characteristics
    (e.g. loudness) and the processing applied to the digitized audio signal
-   (e.g. "compression"), we refer to a simple scenario whereby the
-   attenuation is indicated in decibels, typically ranging from 0dB (maximum
-   audio input, near clipping threshold) to -60dB (total silence). Given this
-   context, a "standard" audio clip would oscillate between these values, the
-   loudest peak levels would be close to -3dB (to avoid distortion), and the
-   relevant audible passages would have average (RMS) volume levels as high
-   as possible (i.e. not too quiet, to avoid background noise during
-   amplification). This would roughly provide an audio experience that could
-   be seamlessly combined with text-to-speech output (i.e. there would be no
-   discernible difference in volume levels when switching from pre-recorded
-   audio to speech synthesis). Although there exists no industry-wide
-   standard to support such convention, different TTS engines tend to
-   generate comparably-loud audio signals when no gain or attenuation is
-   specified. For voice and soft music, -15dB RMS seems to be pretty
-   standard.
+   (e.g. signal compression), we refer to a simple scenario whereby the
+   attenuation is indicated in decibels, typically ranging from 0dB (i.e.
+   maximum audio input, near clipping threshold) to -60dB (i.e. total
+   silence). Given this context, a "standard" audio clip would oscillate
+   between these values, the loudest peak levels would be close to -3dB (to
+   avoid distortion), and the relevant audible passages would have average
+   (RMS) volume levels as high as possible (i.e. not too quiet, to avoid
+   background noise during amplification). This would roughly provide an
+   audio experience that could be seamlessly combined with text-to-speech
+   output (i.e. there would be no discernible difference in volume levels
+   when switching from pre-recorded audio to speech synthesis). Although
+   there exists no industry-wide standard to support such convention,
+   different TTS engines tend to generate comparably-loud audio signals when
+   no gain or attenuation is specified. For voice and soft music, -15dB RMS
+   seems to be pretty standard.
 
   <h3 id=cue-props-cue><span class=secno>11.3. </span>The &lsquo;<a
    href="#cue"><code class=property>cue</code></a>&rsquo; shorthand property</h3>
@@ -2257,7 +2271,10 @@ <h3 id=voice-props-voice-rate><span class=secno>12.2. </span>The &lsquo;<a
      or otherwise to the inherited speaking rate (which may itself be a
      combination of a keyword value and of a percentage, in which case
      percentages are combined multiplicatively). For example, 50% means that
-     the speaking rate gets multiplied by 0.5 (half the value).</p>
+     the speaking rate gets multiplied by 0.5 (half the value). Percentages
+     above 100% result in faster speaking rates (relative to the base
+     keyword), whereas percentages below 100% result in slower speaking
+     rates.</p>
   </dl>
 
   <div class=example>
@@ -2289,7 +2306,7 @@ <h3 id=voice-props-voice-rate><span class=secno>12.2. </span>The &lsquo;<a
 e2 { voice-rate: fast 120%; } /* the computed value is
                           ['fast' and 120%], which will resolve
                           to the rate corresponding to 'fast'
-                          multiplied by 1.2 (one and a half times the speaking rate) */
+                          multiplied by 1.2 */
 
 e3 { voice-rate: normal; /* "resets" the speaking rate to the intrinsic voice value,
                             the computed value is 'normal' (see comment below for actual value) */
@@ -2772,25 +2789,25 @@ <h3 id=voice-props-voice-stress><span class=secno>12.5. </span>The
    <p>Examples of property values, with HTML sample:</p>
 
    <pre>
-span.default-emphasis { voice-stress: normal; }
-span.lowered-emphasis { voice-stress: reduced; }
-span.removed-emphasis { voice-stress: none; }
-span.normal-emphasis { voice-stress: moderate; }
-span.huge-emphasis { voice-stress: strong; }
+.default-emphasis { voice-stress: normal; }
+.lowered-emphasis { voice-stress: reduced; }
+.removed-emphasis { voice-stress: none; }
+.normal-emphasis { voice-stress: moderate; }
+.huge-emphasis { voice-stress: strong; }
 
 ...
 
 &lt;p&gt;This is a big car.&lt;/p&gt;
 &lt;!-- The speech output from the line above is identical to the line below: --&gt;
-&lt;p&gt;This is a &lt;span class="default-emphasis"&gt;big&lt;/span&gt; car.&lt;/p&gt;
+&lt;p&gt;This is a &lt;em class="default-emphasis"&gt;big&lt;/em&gt; car.&lt;/p&gt;
 
-&lt;p&gt;This car is &lt;span class="lowered-emphasis"&gt;massive&lt;/span&gt;!&lt;/p&gt;
-&lt;!-- The "span" below is totally de-emphasized, whereas the emphasis in the line above is only reduced: --&gt;
-&lt;p&gt;This car is &lt;span class="removed-emphasis"&gt;massive&lt;/span&gt;!&lt;/p&gt;
+&lt;p&gt;This car is &lt;em class="lowered-emphasis"&gt;massive&lt;/em&gt;!&lt;/p&gt;
+&lt;!-- The "em" below is totally de-emphasized, whereas the emphasis in the line above is only reduced: --&gt;
+&lt;p&gt;This car is &lt;em class="removed-emphasis"&gt;massive&lt;/em&gt;!&lt;/p&gt;
 
 &lt;!-- The lines below demonstrate increasing levels of emphasis: --&gt;
-&lt;p&gt;This is a &lt;span class="normal-emphasis"&gt;big&lt;/span&gt; car!&lt;/p&gt;
-&lt;p&gt;This is a &lt;span class="huge-emphasis"&gt;big&lt;/span&gt; car!!!&lt;/p&gt;</pre>
+&lt;p&gt;This is a &lt;em class="normal-emphasis"&gt;big&lt;/em&gt; car!&lt;/p&gt;
+&lt;p&gt;This is a &lt;em class="huge-emphasis"&gt;big&lt;/em&gt; car!!!&lt;/p&gt;</pre>
   </div>
 
   <h2 id=duration-props><span class=secno>13. </span>Voice duration property</h2>