|
23 | 23 | /** |
24 | 24 | * Encodes strings into their Beider-Morse phonetic encoding. |
25 | 25 | * <p> |
26 | | - * Beider-Morse phonetic encodings are optimised for family names. However, they may be useful for a wide range |
27 | | - * of words. |
| 26 | + * Beider-Morse phonetic encodings are optimised for family names. However, they may be useful for a wide range of |
| 27 | + * words. |
28 | 28 | * <p> |
29 | | - * This encoder is intentionally mutable to allow dynamic configuration through bean properties. As such, it |
30 | | - * is mutable, and may not be thread-safe. If you require a guaranteed thread-safe encoding then use |
31 | | - * {@link PhoneticEngine} directly. |
| 29 | + * This encoder is intentionally mutable to allow dynamic configuration through bean properties. As such, it is mutable, |
| 30 | + * and may not be thread-safe. If you require a guaranteed thread-safe encoding then use {@link PhoneticEngine} |
| 31 | + * directly. |
32 | 32 | * <p> |
33 | 33 | * <b>Encoding overview</b> |
34 | 34 | * <p> |
35 | 35 | * Beider-Morse phonetic encodings is a multi-step process. Firstly, a table of rules is consulted to guess what |
36 | 36 | * language the word comes from. For example, if it ends in "<code>ault</code>" then it infers that the word is French. |
37 | | - * Next, the word is translated into a phonetic representation using a language-specific phonetics table. Some |
38 | | - * runs of letters can be pronounced in multiple ways, and a single run of letters may be potentially broken up |
39 | | - * into phonemes at different places, so this stage results in a set of possible language-specific phonetic |
40 | | - * representations. Lastly, this language-specific phonetic representation is processed by a table of rules that |
41 | | - * re-writes it phonetically taking into account systematic pronunciation differences between languages, to move |
42 | | - * it towards a pan-indo-european phonetic representation. Again, sometimes there are multiple ways this could be |
43 | | - * done and sometimes things that can be pronounced in several ways in the source language have only one way to |
44 | | - * represent them in this average phonetic language, so the result is again a set of phonetic spellings. |
| 37 | + * Next, the word is translated into a phonetic representation using a language-specific phonetics table. Some runs of |
| 38 | + * letters can be pronounced in multiple ways, and a single run of letters may be potentially broken up into phonemes at |
| 39 | + * different places, so this stage results in a set of possible language-specific phonetic representations. Lastly, this |
| 40 | + * language-specific phonetic representation is processed by a table of rules that re-writes it phonetically taking into |
| 41 | + * account systematic pronunciation differences between languages, to move it towards a pan-indo-european phonetic |
| 42 | + * representation. Again, sometimes there are multiple ways this could be done and sometimes things that can be |
| 43 | + * pronounced in several ways in the source language have only one way to represent them in this average phonetic |
| 44 | + * language, so the result is again a set of phonetic spellings. |
45 | 45 | * <p> |
46 | | - * Some names are treated as having multiple parts. This can be due to two things. Firstly, they may be hyphenated. |
47 | | - * In this case, each individual hyphenated word is encoded, and then these are combined end-to-end for the final |
48 | | - * encoding. Secondly, some names have standard prefixes, for example, "<code>Mac/Mc</code>" in Scottish (English) |
49 | | - * names. As sometimes it is ambiguous whether the prefix is intended or is an accident of the spelling, the word |
50 | | - * is encoded once with the prefix and once without it. The resulting encoding contains one and then the other result. |
| 46 | + * Some names are treated as having multiple parts. This can be due to two things. Firstly, they may be hyphenated. In |
| 47 | + * this case, each individual hyphenated word is encoded, and then these are combined end-to-end for the final encoding. |
| 48 | + * Secondly, some names have standard prefixes, for example, "<code>Mac/Mc</code>" in Scottish (English) names. As |
| 49 | + * sometimes it is ambiguous whether the prefix is intended or is an accident of the spelling, the word is encoded once |
| 50 | + * with the prefix and once without it. The resulting encoding contains one and then the other result. |
51 | 51 | * <p> |
52 | 52 | * <b>Encoding format</b> |
53 | 53 | * <p> |
54 | | - * Individual phonetic spellings of an input word are represented in upper- and lower-case roman characters. Where |
55 | | - * there are multiple possible phonetic representations, these are joined with a pipe (<code>|</code>) character. |
56 | | - * If multiple hyphenated words where found, or if the word may contain a name prefix, each encoded word is placed |
57 | | - * in elipses and these blocks are then joined with hyphens. For example, "<code>d'ortley</code>" has a possible |
58 | | - * prefix. The form without prefix encodes to "<code>ortlaj|ortlej</code>", while the form with prefix encodes to |
59 | | - * "<code>dortlaj|dortlej</code>". Thus, the full, combined encoding is "<code>(ortlaj|ortlej)-(dortlaj|dortlej)</code>". |
| 54 | + * Individual phonetic spellings of an input word are represented in upper- and lower-case roman characters. Where there |
| 55 | + * are multiple possible phonetic representations, these are joined with a pipe (<code>|</code>) character. If multiple |
| 56 | + * hyphenated words where found, or if the word may contain a name prefix, each encoded word is placed in elipses and |
| 57 | + * these blocks are then joined with hyphens. For example, "<code>d'ortley</code>" has a possible prefix. The form |
| 58 | + * without prefix encodes to "<code>ortlaj|ortlej</code>", while the form with prefix encodes to " |
| 59 | + * <code>dortlaj|dortlej</code>". Thus, the full, combined encoding is "<code>(ortlaj|ortlej)-(dortlaj|dortlej)</code>". |
60 | 60 | * <p> |
61 | 61 | * The encoded forms are often quite a bit longer than the input strings. This is because a single input may have many |
62 | | - * potential phonetic interpretations. For example, "<code>Renault</code>" encodes to |
63 | | - * "<code>rYnDlt|rYnalt|rYnult|rinDlt|rinalt|rinult</code>". The <code>APPROX</code> rules will tend to produce larger |
| 62 | + * potential phonetic interpretations. For example, "<code>Renault</code>" encodes to " |
| 63 | + * <code>rYnDlt|rYnalt|rYnult|rinDlt|rinalt|rinult</code>". The <code>APPROX</code> rules will tend to produce larger |
64 | 64 | * encodings as they consider a wider range of possible, approximate phonetic interpretations of the original word. |
65 | 65 | * Down-stream applications may wish to further process the encoding for indexing or lookup purposes, for example, by |
66 | 66 | * splitting on pipe (<code>|</code>) and indexing under each of these alternatives. |
|
0 commit comments