Skip to content

Commit b5657da

Browse files
committed
Apply documentation patch from Matthew Pocock. Thank you Matthew!
git-svn-id: https://svn.apache.org/repos/asf/commons/proper/codec/trunk@1201511 13f79535-47bb-0310-9956-ffa450edef68
1 parent b453bb6 commit b5657da

7 files changed

Lines changed: 157 additions & 8 deletions

File tree

src/main/java/org/apache/commons/codec/language/bm/BeiderMorseEncoder.java

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,11 +31,56 @@
3131
* This encoder is intentionally mutable to allow dynamic configuration through bean properties. As such, it is mutable, and may not be
3232
* thread-safe. If you require a guaranteed thread-safe encoding then use {@link PhoneticEngine} directly.
3333
* </p>
34+
*
35+
* <h2>Encoding overview</h2>
36+
*
37+
* <p>
38+
* Beider-Morse phonetic encodings is a multi-step process. Firstly, a table of rules is consulted to guess what
39+
* language the word comes from. For example, if it ends in "<code>ault</code>" then it infers that the word is French. Next,
40+
* the word is translated into a phonetic representation using a language-specific phonetics table. Some runs of letters
41+
* can be pronounced in multiple ways, and a single run of letters may be potentially broken up into phonemes at
42+
* different places, so this stage results in a set of possible language-specific phonetic representations. Lastly,
43+
* this language-specific phonetic representation is processed by a table of rules that re-writes it phonetically taking
44+
* into account systematic pronunciation differences between languages, to move it towards a pan-indo-european phonetic
45+
* representation. Again, sometimes there are multiple ways this could be done and sometimes things that can be
46+
* pronounced in several ways in the source language have only one way to represent them in this average phonetic
47+
* language, so the result is again a set of phonetic spellings.
48+
* </p>
49+
*
50+
* <p>
51+
* Some names are treated as having multiple parts. This can be due to two things. Firstly, they may be hyphenated. In
52+
* this case, each individual hyphenated word is encoded, and then these are combined end-to-end for the final encoding.
53+
* Secondly, some names have standard prefixes, for example, "<code>Mac/Mc</code>" in Scottish (English) names. As sometimes it is
54+
* ambiguous whether the prefix is intended or is an accident of the spelling, the word is encoded once with the prefix
55+
* and once without it. The resulting encoding contains one and then the other result.
56+
* </p>
57+
*
58+
*
59+
* <h2>Encoding format</h2>
60+
*
61+
* Individual phonetic spellings of an input word are represented in upper- and lower-case roman characters. Where there
62+
* are multiple possible phonetic representations, these are joined with a pipe (<code>|</code>) character. If multiple hyphenated
63+
* words where found, or if the word may contain a name prefix, each encoded word is placed in elipses and these blocks
64+
* are then joined with hyphens. For example, "<code>d'ortley</code>" has a possible prefix. The form without prefix encodes to
65+
* "<code>ortlaj|ortlej</code>", while the form with prefix encodes to "<code>dortlaj|dortlej</code>". Thus, the full, combined encoding is
66+
* "<code>(ortlaj|ortlej)-(dortlaj|dortlej)</code>".
67+
*
68+
* <p>
69+
* The encoded forms are often quite a bit longer than the input strings. This is because a single input may have many
70+
* potential phonetic interpretations. For example, "<code>Renault</code>" encodes to
71+
* "<code>rYnDlt|rYnalt|rYnult|rinDlt|rinalt|rinult</code>". The <code>APPROX</code> rules will tend to produce larger
72+
* encodings as they consider a wider range of possible, approximate phonetic interpretations of the original word.
73+
* Down-stream applications may wish to further process the encoding for indexing or lookup purposes, for example, by
74+
* splitting on pipe (<code>|</code>) and indexing under each of these alternatives.
75+
* </p>
3476
*
3577
* @author Apache Software Foundation
3678
* @since 1.6
3779
*/
3880
public class BeiderMorseEncoder implements StringEncoder {
81+
// implementation note: This class is a spring-friendly facade to PhoneticEngine. It allows read/write configuration
82+
// of an immutable PhoneticEngine instance that will be delegated to for the actual encoding.
83+
3984
// a cached object
4085
private PhoneticEngine engine = new PhoneticEngine(NameType.GENERIC, RuleType.APPROX, true);
4186

src/main/java/org/apache/commons/codec/language/bm/Lang.java

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,13 @@
7171
* @since 1.6
7272
*/
7373
public class Lang {
74+
// implementation note: This class is divided into two sections. The first part is a static factory interface that
75+
// exposes the LANGUAGE_RULES_RN resource as a Lang instance. The second part is the Lang instance methods that
76+
// encapsulate a particular language-guessing rule table and the language guessing itself.
77+
//
78+
// It may make sense in the future to expose the private constructor to allow power users to build custom language-
79+
// guessing rules, perhaps by marking it protected and allowing sub-classing. However, the vast majority of users
80+
// should be strongly encouraged to use the static factory <code>instance</code> method to get their Lang instances.
7481

7582
private static final class LangRule {
7683
private final boolean acceptOnMatch;

src/main/java/org/apache/commons/codec/language/bm/Languages.java

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,9 @@
5353
* @since 1.6
5454
*/
5555
public class Languages {
56+
// implementation note: This class is divided into two sections. The first part is a static factory interface that
57+
// exposes org/apache/commons/codec/language/bm/%s_languages.txt for %s in NameType.* as a list of supported
58+
// languages, and a second part that provides instance methods for accessing this set fo supported languages.
5659

5760
/**
5861
* A set of languages.

src/main/java/org/apache/commons/codec/language/bm/NameType.java

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,9 @@
1818
package org.apache.commons.codec.language.bm;
1919

2020
/**
21-
* Supported types of names. Unless you are matching particular family names, use {@link #GENERIC}.
21+
* Supported types of names. Unless you are matching particular family names, use {@link #GENERIC}. The
22+
* <code>GENERIC</code> NameType should work reasonably well for non-name words. The other encodings are specifically
23+
* tuned to family names, and may not work well at all for general text.
2224
*
2325
* @author Apache Software Foundation
2426
* @since 1.6

src/main/java/org/apache/commons/codec/language/bm/PhoneticEngine.java

Lines changed: 87 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,23 @@
5151
*/
5252
public class PhoneticEngine {
5353

54+
/**
55+
* Utility for manipulating a set of phonemes as they are being built up. Not intended for use outside this package,
56+
* and probably not outside the {@link PhoneticEngine} class.
57+
*
58+
* @author Apache Software Foundation
59+
* @since 1.6
60+
*/
5461
static final class PhonemeBuilder {
5562

63+
/**
64+
* An empty builder where all phonemes must come from some set of languages. This will contain a single
65+
* phoneme of zero characters. This can then be appended to. This should be the only way to create a new
66+
* phoneme from scratch.
67+
*
68+
* @param languages the set of languages
69+
* @return a new, empty phoneme builder
70+
*/
5671
public static PhonemeBuilder empty(Languages.LanguageSet languages) {
5772
return new PhonemeBuilder(Collections.singleton(new Rule.Phoneme("", languages)));
5873
}
@@ -63,6 +78,12 @@ private PhonemeBuilder(Set<Rule.Phoneme> phonemes) {
6378
this.phonemes = phonemes;
6479
}
6580

81+
/**
82+
* Create a new phoneme builder containing all phonemes in this one extended by <code>str</code>.
83+
*
84+
* @param str the characters to append to the phonemes
85+
* @return a new phoneme builder lenghened by <code>str</code>
86+
*/
6687
public PhonemeBuilder append(CharSequence str) {
6788
Set<Rule.Phoneme> newPhonemes = new HashSet<Rule.Phoneme>();
6889

@@ -73,6 +94,16 @@ public PhonemeBuilder append(CharSequence str) {
7394
return new PhonemeBuilder(newPhonemes);
7495
}
7596

97+
/**
98+
* Create a new phoneme builder containing the application of the expression to all phonemes in this builder.
99+
*
100+
* This will lengthen phonemes that have compatible language sets to the expression, and drop those that are
101+
* incompatible.
102+
*
103+
* @param phonemeExpr the expression to apply
104+
* @return a new phoneme builder containing the results of <code>phonemeExpr</code> applied to each phoneme
105+
* in turn
106+
*/
76107
public PhonemeBuilder apply(Rule.PhonemeExpr phonemeExpr) {
77108
Set<Rule.Phoneme> newPhonemes = new HashSet<Rule.Phoneme>();
78109

@@ -88,10 +119,22 @@ public PhonemeBuilder apply(Rule.PhonemeExpr phonemeExpr) {
88119
return new PhonemeBuilder(newPhonemes);
89120
}
90121

122+
/**
123+
* The underlying phoneme set. Please don't mutate.
124+
*
125+
* @return the phoneme set
126+
*/
91127
public Set<Rule.Phoneme> getPhonemes() {
92128
return this.phonemes;
93129
}
94130

131+
/**
132+
* Stringify the phoneme set. This produces a single string of the strings of each phoneme, joined with a pipe.
133+
* This is explicitly provied in place of toString as it is a potentially expensive operation, which should be
134+
* avoided when debugging.
135+
*
136+
* @return the stringified phoneme set
137+
*/
95138
public String makeString() {
96139

97140
StringBuilder sb = new StringBuilder();
@@ -108,6 +151,17 @@ public String makeString() {
108151
}
109152
}
110153

154+
/**
155+
* A function closure capturing the application of a list of rules to an input sequence at a particular offset.
156+
* After invocation, the values <code>i</code> and <code>found</code> are updated. <code>i</code> points to the
157+
* index of the next char in <code>input</code> that must be processed next (the input up to that index having been
158+
* processed already), and <code>found</code> indicates if a matching rule was found or not. In the case where a
159+
* matching rule was found, <code>phonemeBuilder</code> is replaced with a new buidler containing the phonemes
160+
* updated by the matching rule.
161+
*
162+
* @author Apache Software Foundation
163+
* @since 1.6
164+
*/
111165
private static final class RulesApplication {
112166
private final List<Rule> finalRules;
113167
private final CharSequence input;
@@ -134,6 +188,13 @@ public PhonemeBuilder getPhonemeBuilder() {
134188
return this.phonemeBuilder;
135189
}
136190

191+
/**
192+
* This invokes the rules. It loops over the rules list, stopping at the first one that has a matching context
193+
* and pattern. It then applies this rule to the phoneme builder to produce updated phonemes. If there was no
194+
* match, <code>i</code> is advanced one and the character is silently dropped from the phonetic spelling.
195+
*
196+
* @return <code>this</code>
197+
*/
137198
public RulesApplication invoke() {
138199
this.found = false;
139200
int patternLength = 0;
@@ -176,6 +237,12 @@ public boolean isFound() {
176237
"de la", "della", "des", "di", "do", "dos", "du", "van", "von"))));
177238
}
178239

240+
/**
241+
* This is a performance hack to avoid overhead associated with very frequent CharSequence.subSequence calls.
242+
*
243+
* @param cached the character sequence to cache
244+
* @return a <code>CharSequence</code> that internally memoises subSequence values
245+
*/
179246
private static CharSequence cacheSubSequence(final CharSequence cached) {
180247
// return cached;
181248
final CharSequence[][] cache = new CharSequence[cached.length()][cached.length()];
@@ -203,6 +270,12 @@ public CharSequence subSequence(int start, int end) {
203270
};
204271
}
205272

273+
/**
274+
* Join some strings with an internal separater.
275+
* @param strings Strings to join
276+
* @param sep String to separate them with
277+
* @return a single String consisting of each element of <code>strings</code> interlieved by <code>sep</code>
278+
*/
206279
private static String join(Iterable<String> strings, String sep) {
207280
StringBuilder sb = new StringBuilder();
208281
Iterator<String> si = strings.iterator();
@@ -244,6 +317,14 @@ public PhoneticEngine(NameType nameType, RuleType ruleType, boolean concat) {
244317
this.lang = Lang.instance(nameType);
245318
}
246319

320+
/**
321+
* Apply the final rules to convert from a language-specific phonetic representation to a language-independent
322+
* representation.
323+
*
324+
* @param phonemeBuilder
325+
* @param finalRules
326+
* @return
327+
*/
247328
private PhonemeBuilder applyFinalRules(PhonemeBuilder phonemeBuilder, List<Rule> finalRules) {
248329
if (finalRules == null) {
249330
throw new NullPointerException("finalRules can not be null");
@@ -304,8 +385,11 @@ public String encode(String input) {
304385
*/
305386
public String encode(String input, final Languages.LanguageSet languageSet) {
306387
final List<Rule> rules = Rule.getInstance(this.nameType, RuleType.RULES, languageSet);
388+
// rules common across many (all) languages
307389
final List<Rule> finalRules1 = Rule.getInstance(this.nameType, this.ruleType, "common");
390+
// rules that apply to a specific language that may be ambiguous or wrong if applied to other languages
308391
final List<Rule> finalRules2 = Rule.getInstance(this.nameType, this.ruleType, languageSet);
392+
309393
// System.err.println("Languages: " + languageSet);
310394
// System.err.println("Rules: " + rules);
311395

@@ -333,6 +417,7 @@ public String encode(String input, final Languages.LanguageSet languageSet) {
333417
final List<String> words = Arrays.asList(input.split("\\s+"));
334418
final List<String> words2 = new ArrayList<String>();
335419

420+
// special-case handling of word prefixes based upon the name type
336421
switch (this.nameType) {
337422
case SEPHARDIC:
338423
for (String aWord : words) {
@@ -380,13 +465,10 @@ public String encode(String input, final Languages.LanguageSet languageSet) {
380465
// System.err.println(input + " " + i + ": " + phonemeBuilder.makeString());
381466
}
382467

383-
// System.err.println("Applying general rules");
468+
// Apply the general rules
384469
phonemeBuilder = applyFinalRules(phonemeBuilder, finalRules1);
385-
// System.err.println("Now got: " + phonemeBuilder.makeString());
386-
// System.err.println("Applying language-specific rules");
470+
// Apply the language-specific rules
387471
phonemeBuilder = applyFinalRules(phonemeBuilder, finalRules2);
388-
// System.err.println("Now got: " + phonemeBuilder.makeString());
389-
// System.err.println("Done");
390472

391473
return phonemeBuilder.makeString();
392474
}

src/main/java/org/apache/commons/codec/language/bm/Rule.java

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -583,7 +583,9 @@ public RPattern getRContext() {
583583
}
584584

585585
/**
586-
* Decides if the pattern and context match the input starting at a position.
586+
* Decides if the pattern and context match the input starting at a position. It is a match if the
587+
* <code>lContext</code> matches <code>input</code> up to <code>i</code>, <code>pattern</code> matches at i and
588+
* <code>rContext</code> matches from the end of the match of <code>pattern</code> to the end of <code>input</code>.
587589
*
588590
* @param input
589591
* the input String
@@ -604,6 +606,9 @@ public boolean patternAndContextMatches(CharSequence input, int i) {
604606
return false;
605607
}
606608

609+
// fixme: this is a readability/speed trade-off - these 3 expressions should be inlined for speed to avoid
610+
// evaluating latter ones if earlier ones have already failed, but that would make the code a lot harder to
611+
// read
607612
boolean patternMatches = input.subSequence(i, ipl).equals(this.pattern);
608613
boolean rContextMatches = this.rContext.isMatch(input.subSequence(ipl, input.length()));
609614
boolean lContextMatches = this.lContext.isMatch(input.subSequence(0, i));

src/main/java/org/apache/commons/codec/language/bm/RuleType.java

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,12 @@
2525
*/
2626
public enum RuleType {
2727

28-
APPROX("approx"), EXACT("exact"), RULES("rules");
28+
/** Approximate rules, which will lead to the largest number of phonetic interpretations. */
29+
APPROX("approx"),
30+
/** Exact rules, which will lead to a minimum number of phonetic interpretations. */
31+
EXACT("exact"),
32+
/** For internal use only. Please use {@link #APPROX} or {@link #EXACT}. */
33+
RULES("rules");
2934

3035
private final String name;
3136

0 commit comments

Comments
 (0)