Object
This is the Porter 2 stemming algorithm, as described at snowball.tartarus.org/algorithms/english/stemmer.html The original paper is:
Porter, 1980, “An algorithm for suffix stripping”, Program, Vol. 14, no. 3, pp 130-137
Constants for the stemmer are in the Porter2 module.
Procedures that implement the stemmer are added to the String class.
The stemmer algorithm is implemented in the porter2_stem procedure.
There isn’t much, as this is a stemmer that only works for English.
The gb_english flag to the various procedures allows the stemmer to treat the British English ’-ise’ the same as the American English ’-ize’.
Several places in the algorithm require matching the longest suffix of a word. The regexp engine in Ruby 1.9 seems to handle alterntives in regexps by finding the alternative that matches at the first position in the string. As we’re only talking about suffixes, that first match is also the longest suffix. If the regexp engine changes, this behaviour may change and break the stemmer.
Returns true if the word ends with a short syllable
# File lib/porter2.rb, line 87 87: def porter2_ends_with_short_syllable? 88: self =~ /#{Porter2::SHORT_SYLLABLE}$/ ? true : false 89: end
A word is short if it ends in a short syllable, and R1 is null
# File lib/porter2.rb, line 93 93: def porter2_is_short_word? 94: self.porter2_ends_with_short_syllable? and self.porter2_r1.empty? 95: end
Turn all Y letters into y
# File lib/porter2.rb, line 289 289: def porter2_postprocess 290: self.gsub(/Y/, 'y') 291: end
Preprocess the word. Remove any initial ’, if present. Then, set initial y, or y after a vowel, to Y
(The comment to ‘establish the regions R1 and R2’ in the original description is an implementation optimisation that identifies where the regions start. As no modifications are made to the word that affect those positions, you may want to cache them now. This implementation doesn’t do that.)
# File lib/porter2.rb, line 53 53: def porter2_preprocess 54: w = self.dup 55: 56: # remove any initial apostrophe 57: w.gsub!(/^'*(.)/, '\1') 58: 59: # set initial y, or y after a vowel, to Y 60: w.gsub!(/^y/, "Y") 61: w.gsub!(/(#{Porter2::V})y/, '\1Y') 62: 63: w 64: end
R1 is the portion of the word after the first non-vowel after the first vowel (with words beginning ‘gener-’, ‘commun-’, and ‘arsen-’ treated as special cases
# File lib/porter2.rb, line 69 69: def porter2_r1 70: if self =~ /^(gener|commun|arsen)(?<r1>.*)/ 71: Regexp.last_match(:r1) 72: else 73: self =~ /#{Porter2::V}#{Porter2::C}(?<r1>.*)$/ 74: Regexp.last_match(:r1) || "" 75: end 76: end
R2 is the portion of R1 (porter2_r1) after the first non-vowel after the first vowel
# File lib/porter2.rb, line 80 80: def porter2_r2 81: self.porter2_r1 =~ /#{Porter2::V}#{Porter2::C}(?<r2>.*)$/ 82: Regexp.last_match(:r2) || "" 83: end
Perform the stemming procedure. If gb_english is true, treat ’-ise’ and similar suffixes as ’-ize’ in American English.
# File lib/porter2.rb, line 297 297: def porter2_stem(gb_english = false) 298: preword = self.porter2_tidy 299: return preword if preword.length <= 2 300: 301: word = preword.porter2_preprocess 302: 303: if Porter2::SPECIAL_CASES.has_key? word 304: Porter2::SPECIAL_CASES[word] 305: else 306: w1a = word.porter2_step0.porter2_step1a 307: if Porter2::STEP_1A_SPECIAL_CASES.include? w1a 308: w1a 309: else 310: w1a.porter2_step1b(gb_english).porter2_step1c.porter2_step2(gb_english).porter2_step3(gb_english).porter2_step4(gb_english).porter2_step5.porter2_postprocess 311: end 312: end 313: end
A verbose version of porter2_stem that prints the output of each stage to STDOUT
# File lib/porter2.rb, line 316 316: def porter2_stem_verbose(gb_english = false) 317: preword = self.porter2_tidy 318: puts "Preword: #{preword}" 319: return preword if preword.length <= 2 320: 321: word = preword.porter2_preprocess 322: puts "Preprocessed: #{word}" 323: 324: if Porter2::SPECIAL_CASES.has_key? word 325: puts "Returning #{word} as special case #{Porter2::SPECIAL_CASES[word]}" 326: Porter2::SPECIAL_CASES[word] 327: else 328: r1 = word.porter2_r1 329: r2 = word.porter2_r2 330: puts "R1 = #{r1}, R2 = #{r2}" 331: 332: w0 = word.porter2_step0 ; puts "After step 0: #{w0} (R1 = #{w0.porter2_r1}, R2 = #{w0.porter2_r2})" 333: w1a = w0.porter2_step1a ; puts "After step 1a: #{w1a} (R1 = #{w1a.porter2_r1}, R2 = #{w1a.porter2_r2})" 334: 335: if Porter2::STEP_1A_SPECIAL_CASES.include? w1a 336: puts "Returning #{w1a} as 1a special case" 337: w1a 338: else 339: w1b = w1a.porter2_step1b(gb_english) ; puts "After step 1b: #{w1b} (R1 = #{w1b.porter2_r1}, R2 = #{w1b.porter2_r2})" 340: w1c = w1b.porter2_step1c ; puts "After step 1c: #{w1c} (R1 = #{w1c.porter2_r1}, R2 = #{w1c.porter2_r2})" 341: w2 = w1c.porter2_step2(gb_english) ; puts "After step 2: #{w2} (R1 = #{w2.porter2_r1}, R2 = #{w2.porter2_r2})" 342: w3 = w2.porter2_step3(gb_english) ; puts "After step 3: #{w3} (R1 = #{w3.porter2_r1}, R2 = #{w3.porter2_r2})" 343: w4 = w3.porter2_step4(gb_english) ; puts "After step 4: #{w4} (R1 = #{w4.porter2_r1}, R2 = #{w4.porter2_r2})" 344: w5 = w4.porter2_step5 ; puts "After step 5: #{w5}" 345: wpost = w5.porter2_postprocess ; puts "After postprocess: #{wpost}" 346: wpost 347: end 348: end 349: end
Search for the longest among the suffixes,
‘
’s
’s’
and remove if found.
# File lib/porter2.rb, line 103 103: def porter2_step0 104: self.sub!(/(.)('s'|'s|')$/, '\1') || self 105: end
Search for the longest among the following suffixes, and perform the action indicated.
sses | replace by ss |
ied, ies | replace by i if preceded by more than one letter, otherwise by ie |
s | delete if the preceding word part contains a vowel not immediately before the s |
us, ss | do nothing |
# File lib/porter2.rb, line 113 113: def porter2_step1a 114: if self =~ /sses$/ 115: self.sub(/sses$/, 'ss') 116: elsif self =~ /..(ied|ies)$/ 117: self.sub(/(ied|ies)$/, 'i') 118: elsif self =~ /(ied|ies)$/ 119: self.sub(/(ied|ies)$/, 'ie') 120: elsif self =~ /(us|ss)$/ 121: self 122: elsif self =~ /s$/ 123: if self =~ /(#{Porter2::V}.+)s$/ 124: self.sub(/s$/, '') 125: else 126: self 127: end 128: else 129: self 130: end 131: end
Search for the longest among the following suffixes, and perform the action indicated.
eed, eedly | replace by ee if the suffix is also in R1 |
ed, edly, ing, ingly | delete if the preceding word part contains a vowel and, after the deletion:
|
(If gb_english is true, treat the ‘is’ suffix as ‘iz’ above.)
# File lib/porter2.rb, line 143 143: def porter2_step1b(gb_english = false) 144: if self =~ /(eed|eedly)$/ 145: if self.porter2_r1 =~ /(eed|eedly)$/ 146: self.sub(/(eed|eedly)$/, 'ee') 147: else 148: self 149: end 150: else 151: w = self.dup 152: if w =~ /#{Porter2::V}.*(ed|edly|ing|ingly)$/ 153: w.sub!(/(ed|edly|ing|ingly)$/, '') 154: if w =~ /(at|lb|iz)$/ 155: w += 'e' 156: elsif w =~ /is$/ and gb_english 157: w += 'e' 158: elsif w =~ /#{Porter2::Double}$/ 159: w.chop! 160: elsif w.porter2_is_short_word? 161: w += 'e' 162: end 163: end 164: w 165: end 166: end
Replace a suffix of y or Y by i if it is preceded by a non-vowel which is not the first letter of the word.
# File lib/porter2.rb, line 171 171: def porter2_step1c 172: if self =~ /.+#{Porter2::C}(y|Y)$/ 173: self.sub(/(y|Y)$/, 'i') 174: else 175: self 176: end 177: end
Search for the longest among the suffixes listed in the keys of Porter2::STEP_2_MAPS. If one is found and that suffix occurs in R1, replace it with the value found in STEP_2_MAPS.
(Suffixes ‘ogi’ and ‘li’ are treated as special cases in the procedure.)
(If gb_english is true, replace the ‘iser’ and ‘isation’ suffixes with ‘ise’, similarly to how ‘izer’ and ‘ization’ are treated.)
# File lib/porter2.rb, line 188 188: def porter2_step2(gb_english = false) 189: r1 = self.porter2_r1 190: s2m = Porter2::STEP_2_MAPS.dup 191: if gb_english 192: s2m["iser"] = "ise" 193: s2m["isation"] = "ise" 194: end 195: step_2_re = Regexp.union(s2m.keys.map {|r| Regexp.new(r + "$")}) 196: if self =~ step_2_re 197: if r1 =~ /#{$&}$/ 198: self.sub(/#{$&}$/, s2m[$&]) 199: else 200: self 201: end 202: elsif r1 =~ /li$/ and self =~ /(#{Porter2::Valid_LI})li$/ 203: self.sub(/li$/, '') 204: elsif r1 =~ /ogi$/ and self =~ /logi$/ 205: self.sub(/ogi$/, 'og') 206: else 207: self 208: end 209: end
Search for the longest among the suffixes listed in the keys of Porter2::STEP_3_MAPS. If one is found and that suffix occurs in R1, replace it with the value found in STEP_3_MAPS.
(Suffix ‘ative’ is treated as a special case in the procedure.)
(If gb_english is true, replace the ‘alise’ suffix with ‘al’, similarly to how ‘alize’ is treated.)
# File lib/porter2.rb, line 220 220: def porter2_step3(gb_english = false) 221: if self =~ /ative$/ and self.porter2_r2 =~ /ative$/ 222: self.sub(/ative$/, '') 223: else 224: s3m = Porter2::STEP_3_MAPS.dup 225: if gb_english 226: s3m["alise"] = "al" 227: end 228: step_3_re = Regexp.union(s3m.keys.map {|r| Regexp.new(r + "$")}) 229: r1 = self.porter2_r1 230: if self =~ step_3_re and r1 =~ /#{$&}$/ 231: self.sub(/#{$&}$/, s3m[$&]) 232: else 233: self 234: end 235: end 236: end
Search for the longest among the suffixes listed in the keys of Porter2::STEP_4_MAPS. If one is found and that suffix occurs in R2, replace it with the value found in STEP_4_MAPS.
(Suffix ‘ion’ is treated as a special case in the procedure.)
(If gb_english is true, delete the ‘ise’ suffix if found.)
# File lib/porter2.rb, line 246 246: def porter2_step4(gb_english = false) 247: if self.porter2_r2 =~ /ion$/ and self =~ /(s|t)ion$/ 248: self.sub(/ion$/, '') 249: else 250: s4m = Porter2::STEP_4_MAPS.dup 251: if gb_english 252: s4m["ise"] = "" 253: end 254: step_4_re = Regexp.union(s4m.keys.map {|r| Regexp.new(r + "$")}) 255: r2 = self.porter2_r2 256: if self =~ step_4_re 257: if r2 =~ /#{$&}/ 258: self.sub(/#{$&}$/, s4m[$&]) 259: else 260: self 261: end 262: else 263: self 264: end 265: end 266: end
Search for the the following suffixes, and, if found, perform the action indicated.
e | delete if in R2, or in R1 and not preceded by a short syllable |
l | delete if in R2 and preceded by l |
# File lib/porter2.rb, line 272 272: def porter2_step5 273: if self =~ /ll$/ and self.porter2_r2 =~ /l$/ 274: self.sub(/ll$/, 'l') 275: elsif self =~ /e$/ and self.porter2_r2 =~ /e$/ 276: self.sub(/e$/, '') 277: else 278: r1 = self.porter2_r1 279: if self =~ /e$/ and r1 =~ /e$/ and not self =~ /#{Porter2::SHORT_SYLLABLE}e$/ 280: self.sub(/e$/, '') 281: else 282: self 283: end 284: end 285: end
Tidy up the word before we get down to the algorithm
# File lib/porter2.rb, line 35 35: def porter2_tidy 36: preword = self.to_s.strip.downcase 37: 38: # map apostrophe-like characters to apostrophes 39: preword.gsub!(/‘/, "'") 40: preword.gsub!(/’/, "'") 41: 42: preword 43: end
Disabled; run with --debug to generate this.
Generated with the Darkfish Rdoc Generator 1.1.6.