Porter stemmer in Ruby.
This is the Porter 2 stemming algorithm, as described at snowball.tartarus.org/algorithms/english/stemmer.html The original paper is:
Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14, no. 3, pp 130-137
A non-vowel
A vowel
A non-vowel other than w, x, or Y
Doubles created when added a suffix: these are undoubled when stemmed
A valid letter that can come before ‘li’
A specification for a short syllable
Suffix transformations used in Step 2. (ogi, li endings dealt with in procedure)
Suffix transformations used in Step 3. (ative ending dealt with in procedure)
Suffix transformations used in Step 4.
Special-case stemmings
Special case words to ignore after step 1a.
A short syllable in a word is either
a vowel followed by a non-vowel other than w, x or Y and preceded by
a non-vowel, or
a vowel at the beginning of the word followed by a non-vowel.
# File lib/porter2.rb, line 155 155: def porter2_ends_with_short_syllable? 156: self =~ /#{SHORT_SYLLABLE}$/ ? true : false 157: end
A word is short if it ends in a short syllable, and if R1 is null
# File lib/porter2.rb, line 160 160: def porter2_is_short_word? 161: self.porter2_ends_with_short_syllable? and self.porter2_r1.empty? 162: end
# File lib/porter2.rb, line 311 311: def porter2_postprocess 312: self.gsub(/Y/, 'y') 313: end
# File lib/porter2.rb, line 122 122: def porter2_preprocess 123: w = self.dup 124: 125: # remove any initial apostrophe 126: w.gsub!(/^'*(.)/, '\1') 127: 128: # set initial y, or y after a vowel, to Y 129: w.gsub!(/^y/, "Y") 130: w.gsub!(/(#{V})y/, '\1Y') 131: 132: w 133: end
The word after the first non-vowel after the first vowel
# File lib/porter2.rb, line 136 136: def porter2_r1 137: if self =~ /^(gener|commun|arsen)(?<r1>.*)/ 138: Regexp.last_match(:r1) 139: else 140: self =~ /#{V}#{C}(?<r1>.*)$/ 141: Regexp.last_match(:r1) || "" 142: end 143: end
R1 after the first non-vowel after the first vowel
# File lib/porter2.rb, line 146 146: def porter2_r2 147: self.porter2_r1 =~ /#{V}#{C}(?<r2>.*)$/ 148: Regexp.last_match(:r2) || "" 149: end
# File lib/porter2.rb, line 316 316: def porter2_stem(gb_english = false) 317: preword = self.porter2_tidy 318: return preword if preword.length <= 2 319: 320: word = preword.porter2_preprocess 321: 322: if SPECIAL_CASES.has_key? word 323: SPECIAL_CASES[word] 324: else 325: w1a = word.step_0.step_1a 326: if STEP_1A_SPECIAL_CASES.include? w1a 327: w1a 328: else 329: w1a.step_1b(gb_english).step_1c.step_2(gb_english).step_3(gb_english).step_4(gb_english).step_5.porter2_postprocess 330: end 331: end 332: end
# File lib/porter2.rb, line 334 334: def porter2_stem_verbose(gb_english = false) 335: preword = self.porter2_tidy 336: puts "Preword: #{preword}" 337: return preword if preword.length <= 2 338: 339: word = preword.porter2_preprocess 340: puts "Preprocessed: #{word}" 341: 342: if SPECIAL_CASES.has_key? word 343: puts "Returning #{word} as special case #{SPECIAL_CASES[word]}" 344: SPECIAL_CASES[word] 345: else 346: r1 = word.porter2_r1 347: r2 = word.porter2_r2 348: puts "R1 = #{r1}, R2 = #{r2}" 349: 350: w0 = word.step_0 ; puts "After step 0: #{w0} (R1 = #{w0.porter2_r1}, R2 = #{w0.porter2_r2})" 351: w1a = w0.step_1a ; puts "After step 1a: #{w1a} (R1 = #{w1a.porter2_r1}, R2 = #{w1a.porter2_r2})" 352: 353: if STEP_1A_SPECIAL_CASES.include? w1a 354: puts "Returning #{w1a} as 1a special case" 355: w1a 356: else 357: w1b = w1a.step_1b(gb_english) ; puts "After step 1b: #{w1b} (R1 = #{w1b.porter2_r1}, R2 = #{w1b.porter2_r2})" 358: w1c = w1b.step_1c ; puts "After step 1c: #{w1c} (R1 = #{w1c.porter2_r1}, R2 = #{w1c.porter2_r2})" 359: w2 = w1c.step_2(gb_english) ; puts "After step 2: #{w2} (R1 = #{w2.porter2_r1}, R2 = #{w2.porter2_r2})" 360: w3 = w2.step_3(gb_english) ; puts "After step 3: #{w3} (R1 = #{w3.porter2_r1}, R2 = #{w3.porter2_r2})" 361: w4 = w3.step_4(gb_english) ; puts "After step 4: #{w4} (R1 = #{w4.porter2_r1}, R2 = #{w4.porter2_r2})" 362: w5 = w4.step_5 ; puts "After step 5: #{w5}" 363: wpost = w5.porter2_postprocess ; puts "After postprocess: #{wpost}" 364: wpost 365: end 366: end 367: end
Tidy up the word before we get down to the algorithm
# File lib/porter2.rb, line 112 112: def porter2_tidy 113: preword = self.to_s.strip.downcase 114: 115: # map apostrophe-like characters to apostrophes 116: preword.gsub!(/‘/, "'") 117: preword.gsub!(/’/, "'") 118: 119: preword 120: end
Search for the longest among the suffixes,
’
’s
’s’
and remove if found.
# File lib/porter2.rb, line 169 169: def step_0 170: self.sub!(/(.)('s'|'s|')$/, '\1') || self 171: end
Remove plural suffixes
# File lib/porter2.rb, line 174 174: def step_1a 175: if self =~ /sses$/ 176: self.sub(/sses$/, 'ss') 177: elsif self =~ /..(ied|ies)$/ 178: self.sub(/(ied|ies)$/, 'i') 179: elsif self =~ /(ied|ies)$/ 180: self.sub(/(ied|ies)$/, 'ie') 181: elsif self =~ /(us|ss)$/ 182: self 183: elsif self =~ /s$/ 184: if self =~ /(#{V}.+)s$/ 185: self.sub(/s$/, '') 186: else 187: self 188: end 189: else 190: self 191: end 192: end
# File lib/porter2.rb, line 194 194: def step_1b(gb_english = false) 195: if self =~ /(eed|eedly)$/ 196: if self.porter2_r1 =~ /(eed|eedly)$/ 197: self.sub(/(eed|eedly)$/, 'ee') 198: else 199: self 200: end 201: else 202: w = self.dup 203: if w =~ /#{V}.*(ed|edly|ing|ingly)$/ 204: w.sub!(/(ed|edly|ing|ingly)$/, '') 205: if w =~ /(at|lb|iz)$/ 206: w += 'e' 207: elsif w =~ /is$/ and gb_english 208: w += 'e' 209: elsif w =~ /#{Double}$/ 210: w.chop! 211: elsif w.porter2_is_short_word? 212: w += 'e' 213: end 214: end 215: w 216: end 217: end
# File lib/porter2.rb, line 220 220: def step_1c 221: if self =~ /.+#{C}(y|Y)$/ 222: self.sub(/(y|Y)$/, 'i') 223: else 224: self 225: end 226: end
# File lib/porter2.rb, line 229 229: def step_2(gb_english = false) 230: r1 = self.porter2_r1 231: s2m = STEP_2_MAPS.dup 232: if gb_english 233: s2m["iser"] = "ise" 234: s2m["isation"] = "ise" 235: end 236: step_2_re = Regexp.union(s2m.keys.map {|r| Regexp.new(r + "$")}) 237: if self =~ step_2_re 238: if r1 =~ /#{$&}$/ 239: self.sub(/#{$&}$/, s2m[$&]) 240: else 241: self 242: end 243: elsif r1 =~ /li$/ and self =~ /(#{Valid_LI})li$/ 244: self.sub(/li$/, '') 245: elsif r1 =~ /ogi$/ and self =~ /logi$/ 246: self.sub(/ogi$/, 'og') 247: else 248: self 249: end 250: end
# File lib/porter2.rb, line 253 253: def step_3(gb_english = false) 254: if self =~ /ative$/ and self.porter2_r2 =~ /ative$/ 255: self.sub(/ative$/, '') 256: else 257: s3m = STEP_3_MAPS.dup 258: if gb_english 259: s3m["alise"] = "al" 260: end 261: step_3_re = Regexp.union(s3m.keys.map {|r| Regexp.new(r + "$")}) 262: r1 = self.porter2_r1 263: if self =~ step_3_re and r1 =~ /#{$&}$/ 264: self.sub(/#{$&}$/, s3m[$&]) 265: else 266: self 267: end 268: end 269: end
# File lib/porter2.rb, line 272 272: def step_4(gb_english = false) 273: if self.porter2_r2 =~ /ion$/ and self =~ /(s|t)ion$/ 274: self.sub(/ion$/, '') 275: else 276: s4m = STEP_4_MAPS.dup 277: if gb_english 278: s4m["ise"] = "" 279: end 280: step_4_re = Regexp.union(s4m.keys.map {|r| Regexp.new(r + "$")}) 281: r2 = self.porter2_r2 282: if self =~ step_4_re 283: if r2 =~ /#{$&}/ 284: self.sub(/#{$&}$/, s4m[$&]) 285: else 286: self 287: end 288: else 289: self 290: end 291: end 292: end
# File lib/porter2.rb, line 295 295: def step_5 296: if self =~ /ll$/ and self.porter2_r2 =~ /l$/ 297: self.sub(/ll$/, 'l') 298: elsif self =~ /e$/ and self.porter2_r2 =~ /e$/ 299: self.sub(/e$/, '') 300: else 301: r1 = self.porter2_r1 302: if self =~ /e$/ and r1 =~ /e$/ and not self =~ /#{SHORT_SYLLABLE}e$/ 303: self.sub(/e$/, '') 304: else 305: self 306: end 307: end 308: end
Disabled; run with --debug to generate this.
Generated with the Darkfish Rdoc Generator 1.1.6.