高远

Discovering Prefixes and Suffixes

Posted on September 25, 2012

language, text processing, morphology, coding, unix, shell script

English is a language that has a concatinative morphology. For example, the word preinstallation can be split into three parts – a prefix pre, a stem install and a suffix ation. Given a corpus of english words, how do we discover the prefixes and suffixes computationally? This is an interesting question, for we can use similar methods to analyze other languages that also share a concatinative morphology.

The corpus of english words can be represented by a forward tree (and a backward tree). The root of the forward tree is the token that denotes the start of word. Each node in the tree corresponds to a character. A word in the corpus is therefore a path starting from the root. Similarly, for the backward tree, the root is the end of word token. A path starting from the root corresponds to a word reading backwards.

After building this tree representation, we can calculate the frequency of each node, which is defined as the frequency of the path starting from the root to that node. Since prefixes (or suffixes) tend to appear more often, high frequency candidates are more likely to be considered prefixes (or suffixes) given a fixed character sequence length, or equivalently, a fixed depth in the tree. For instance, for a forward tree, we sort the candidates with tree depth 5 (word length 4) by their frequencies, we get

             
OVER 378
INTE 338
COMP 233
CONS 212
UNDE 205
TRAN 183
CONT 179
STRA 150
COMM 142
PRES 133

It seems that we can capture some prefixes solely by their frequencies. The guy on the top of the list – OVER is a common prefix. However, a lot of the high frequency stuff are not actually prefixes, like INTE and UNDE. Why do they have a high frequency though? That is because INTER and UNDER are prefixes. So high frequency itself is not a sufficient criterion for a prefix.

It turns out we can eliminate the bad ones by the distribution of their children. For all the characters following INTE, most of their occurancies should be R, therefore the distribution of the children of INTE should be highly biased. A measure that people usually use to discribe this is called Entropy. It is defined as \(-\sum_ip_i \ln p_i\), where \(p_i\) is the probability of the i-th child. A high entropy implies the children are evenly distributed. Node with a high entropy is evidence for a good cut between prefix and stem. That is because varies stems can be attached to the same prefix, so the character following the prefix is rather chaotic.

It turns out there exists some cases where the boundary of a prefix or suffix does not have a high entropy. The suffix -ing is a good example. Unfortunately, ling appears more often than other -ing’s, like ting or ming. Therefore if you look at ing, it doesn’t have a very high entropy. Combining the frequency and entropy criterions helps a lot, but there are still some corner cases.

What if we print the frequency and entropy path of each candidate? Below we selected some of the top depth-6 candidates that have both high frequencies and high entropies. We print the frequency and entropy path of each candidate:

candidate   freq   entropy
I N T E R 275 2.71212
I N T E 338 0.745482
I N T 439 0.818788
I N 1663 2.41109
I 2849 1.64657
M I C R O 87 2.63535
M I C R 87 0
M I C 179 1.33012
M I 1377 2.23578
M 8424 1.7602
U N D E R 187 2.67801
U N D E 205 0.466284
U N D 247 0.711374
U N 1109 2.68328
U 1608 1.33352
W A T E R 51 2.52729
W A T E 51 0
W A T 99 1.67263
W A 854 2.45031
W 3543 1.7672

The feature becomes clear – a prefix should be one such that its parent has a low entropy and itself has a high entropy. The same applies to suffixes. Now if we look at the entropy path of ing, we find

candidate   entropy
ING 2.66174
NG 0.282628
G 0.646555

Since NG has a low entropy and ING has a high entropy, it becomes clear that ING should be a suffix.

The experiment above uses the corpus CMU Pronouncing Dictionary. Below is the shell script to generate the frequencies and entropies for the forward and backward tree. It uses ngram-count, the tool from SRILM that generates all the ngrams based on the training text.

\#!/bin/sh
\#--Generate frequency and entropy for forward&backward tree.--
\# == parameters ==
DICT=../data/cmudict.0.7a.txt
ORDER=7
\# == function ngramFrequencyEntropy ==
\# Generate 1-$2 order grams with frequency and entropy
\# $1 : file name
\# $2 : order
ngramFrequencyEntropy(){
  CMD="ngram-count -text $1 -order $2"
  for (( i = 1; i <= $2; i ++ )); do
    CMD=$CMD" -write$i $1.ORDER$i"
  done
  eval $CMD
  for (( i = 1; i <= $2; i ++ )); do
    grep '^\<s\>' $1.ORDER$i > $1.ORDER$i.PRE
  done  
  eval "cat $1.ORDER[1-$(($ORDER-1))].PRE > $3" \# cat all the prefixes
  \# add the entropy of each node
  LINENUM=1
  AWKCMD='{sum+=$NF; a[++i]=$NF } END ' \# simply because its too long...
  AWKCMD+='{ for(j=1;j<=i;j++) {p = a[j]/sum; entropy += -p\*log(p)} } END '
  AWKCMD+='{print entropy}'
  for (( i = 1; i < $2; i ++ )); do
    while IFS= read -r line; do 
      ENTROPY=`echo "$line" | rev | cut -f2- | rev | \
      grep -f- $1.ORDER$(($i+1)).PRE | awk "$AWKCMD"`
      if [ "x$ENTROPY" = "x" ]; then
        ENTROPY=0
      fi
      awk -v ent=$ENTROPY -v ln=$LINENUM \
      '{if (NR==ln) {print $0, ent} else {print $0}}' $3 > TEMP && mv TEMP $3
      LINENUM=$(($LINENUM+1))
    done < $1.ORDER$i.PRE
  done 
}
\# == Generate phoneme and grapheme suffix and prefix trees ==
echo 'Generating grapheme trees ...'
sed '/\;\;\;/d;s/  .\*//g;/[^A-Za-z]/d' $DICT | \
uniq | sed 's/./ &/g;s/^ //' > CH \# char-only
ngramFrequencyEntropy CH $ORDER ch.order$ORDER
rev CH > CH_REV
ngramFrequencyEntropy CH_REV $ORDER ch_rev.order$ORDER
echo 'Generating phoneme trees ...'
sed '/\;\;\;/d;s/.\*  //g' $DICT > PH \# phone-only
ngramFrequencyEntropy PH $ORDER ph.order$ORDER
rev PH > PH_REV
ngramFrequencyEntropy PH_REV $ORDER ph_rev.order$ORDER
\# == Remove temp files ==
rm CH CH_REV PH PH_REV \*ORDER\*

Below is the code that generates possible candidates for prefixes and suffixes by selecting those that have both high frequencies and high entropies:

\#!/bin/sh
\# -- generate the candidates using frequency and entropy --
\# == parameters ==
ORDER=7
FREQ_CUTOFF=50
ENTR_CUTOFF=50
topFreqEntr(){
  for (( i = 3; i < $2; i ++ )); do
    cat $1.order$i | sort -k$((i+1)) -r -n | head -$FREQ_CUTOFF > topfreq
    cat $1.order$i | sort -k$((i+2)) -r -n | head -$ENTR_CUTOFF > topentr
    sort -o topfreq topfreq
    sort -o topentr topentr
    comm -12 topfreq topentr | sed "s/^\t//g" \> $1.order$i.candidates
  done
}
topFreqEntr ch $ORDER
topFreqEntr ph $ORDER
topFreqEntr ch_rev $ORDER
topFreqEntr ph_rev $ORDER
\# == clear junk files == 
rm topfreq topentr

Below is the code that prints the frequency and entropy path of candidates:

\#!/bin/sh
\# -- print the path of the candidates --
\# == parameters ==
ORDER=7
printPath(){
  while read line; do
    NUM_COL=`expr $2 - 1`
    echo $line
    while [ $NUM_COL -ge 2 ]; do
      echo $line | cut -d' ' -f 1-$NUM_COL | grep -f- $1.order$NUM_COL
      NUM_COL=`expr $NUM_COL - 1`
    done
    echo ""
  done < $1.order$2.candidates
}
for (( i = 3; i < $ORDER; i++ )); do
  printPath ch $i > ch.order$i.candidates.path
  printPath ph $i > ph.order$i.candidates.path
  printPath ch_rev $i > ch_rev.order$i.candidates.path
  printPath ph_rev $i > ph_rev.order$i.candidates.path
done