Stemmable

Porter stemmer in Ruby.

This is the Porter 2 stemming algorithm, as described at snowball.tartarus.org/algorithms/english/stemmer.html The original paper is:

  Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
  no. 3, pp 130-137

Constants

C: A non-vowel
V: A vowel
CW: A non-vowel other than w, x, or Y
Double: Doubles created when added a suffix: these are undoubled when stemmed
Valid_LI: A valid letter that can come before ‘li’
SHORT_SYLLABLE: A specification for a short syllable
STEP_2_MAPS: Suffix transformations used in Step 2. (ogi, li endings dealt with in procedure)
STEP_3_MAPS: Suffix transformations used in Step 3. (ative ending dealt with in procedure)
STEP_4_MAPS: Suffix transformations used in Step 4.
SPECIAL_CASES: Special-case stemmings
STEP_1A_SPECIAL_CASES: Special case words to ignore after step 1a.

Public Instance Methods

porter2_ends_with_short_syllable?() click to toggle source

A short syllable in a word is either

a vowel followed by a non-vowel other than w, x or Y and preceded by

a non-vowel, or

a vowel at the beginning of the word followed by a non-vowel.

     # File lib/porter2.rb, line 155
155:   def porter2_ends_with_short_syllable?
156:     self =~ /#{SHORT_SYLLABLE}$/ ? true : false
157:   end

porter2_is_short_word?() click to toggle source

A word is short if it ends in a short syllable, and if R1 is null

     # File lib/porter2.rb, line 160
160:   def porter2_is_short_word?
161:     self.porter2_ends_with_short_syllable? and self.porter2_r1.empty?
162:   end

porter2_postprocess() click to toggle source

     # File lib/porter2.rb, line 311
311:   def porter2_postprocess
312:     self.gsub(/Y/, 'y')
313:   end

porter2_preprocess() click to toggle source

     # File lib/porter2.rb, line 122
122:   def porter2_preprocess    
123:     w = self.dup
124: 
125:     # remove any initial apostrophe

126:     w.gsub!(/^'*(.)/, '\1')
127:     
128:     # set initial y, or y after a vowel, to Y

129:     w.gsub!(/^y/, "Y")
130:     w.gsub!(/(#{V})y/, '\1Y')
131:     
132:     w
133:   end

porter2_r1() click to toggle source

The word after the first non-vowel after the first vowel

     # File lib/porter2.rb, line 136
136:   def porter2_r1
137:     if self =~ /^(gener|commun|arsen)(?<r1>.*)/
138:       Regexp.last_match(:r1)
139:     else
140:       self =~ /#{V}#{C}(?<r1>.*)$/
141:       Regexp.last_match(:r1) || ""
142:     end
143:   end

porter2_r2() click to toggle source

R1 after the first non-vowel after the first vowel

     # File lib/porter2.rb, line 146
146:   def porter2_r2
147:     self.porter2_r1 =~ /#{V}#{C}(?<r2>.*)$/
148:     Regexp.last_match(:r2) || ""
149:   end

porter2_stem(gb_english = false) click to toggle source

     # File lib/porter2.rb, line 316
316:   def porter2_stem(gb_english = false)
317:     preword = self.porter2_tidy
318:     return preword if preword.length <= 2
319: 
320:     word = preword.porter2_preprocess
321:     
322:     if SPECIAL_CASES.has_key? word
323:       SPECIAL_CASES[word]
324:     else
325:       w1a = word.step_0.step_1a
326:       if STEP_1A_SPECIAL_CASES.include? w1a 
327:         w1a
328:       else
329:         w1a.step_1b(gb_english).step_1c.step_2(gb_english).step_3(gb_english).step_4(gb_english).step_5.porter2_postprocess
330:       end
331:     end
332:   end

Also aliased as: stem

porter2_stem_verbose(gb_english = false) click to toggle source

     # File lib/porter2.rb, line 334
334:   def porter2_stem_verbose(gb_english = false)
335:     preword = self.porter2_tidy
336:     puts "Preword: #{preword}"
337:     return preword if preword.length <= 2
338: 
339:     word = preword.porter2_preprocess
340:     puts "Preprocessed: #{word}"
341:     
342:     if SPECIAL_CASES.has_key? word
343:       puts "Returning #{word} as special case #{SPECIAL_CASES[word]}"
344:       SPECIAL_CASES[word]
345:     else
346:       r1 = word.porter2_r1
347:       r2 = word.porter2_r2
348:       puts "R1 = #{r1}, R2 = #{r2}"
349:     
350:       w0 = word.step_0 ; puts "After step 0:  #{w0} (R1 = #{w0.porter2_r1}, R2 = #{w0.porter2_r2})"
351:       w1a = w0.step_1a ; puts "After step 1a: #{w1a} (R1 = #{w1a.porter2_r1}, R2 = #{w1a.porter2_r2})"
352:       
353:       if STEP_1A_SPECIAL_CASES.include? w1a
354:         puts "Returning #{w1a} as 1a special case"
355:         w1a
356:       else
357:         w1b = w1a.step_1b(gb_english) ; puts "After step 1b: #{w1b} (R1 = #{w1b.porter2_r1}, R2 = #{w1b.porter2_r2})"
358:         w1c = w1b.step_1c ; puts "After step 1c: #{w1c} (R1 = #{w1c.porter2_r1}, R2 = #{w1c.porter2_r2})"
359:         w2 = w1c.step_2(gb_english) ; puts "After step 2:  #{w2} (R1 = #{w2.porter2_r1}, R2 = #{w2.porter2_r2})"
360:         w3 = w2.step_3(gb_english) ; puts "After step 3:  #{w3} (R1 = #{w3.porter2_r1}, R2 = #{w3.porter2_r2})"
361:         w4 = w3.step_4(gb_english) ; puts "After step 4:  #{w4} (R1 = #{w4.porter2_r1}, R2 = #{w4.porter2_r2})"
362:         w5 = w4.step_5 ; puts "After step 5:  #{w5}"
363:         wpost = w5.porter2_postprocess ; puts "After postprocess: #{wpost}"
364:         wpost
365:       end
366:     end
367:   end

porter2_tidy() click to toggle source

Tidy up the word before we get down to the algorithm

     # File lib/porter2.rb, line 112
112:   def porter2_tidy
113:     preword = self.to_s.strip.downcase
114:     
115:     # map apostrophe-like characters to apostrophes

116:     preword.gsub!(/‘/, "'")
117:     preword.gsub!(/’/, "'")
118: 
119:     preword
120:   end

stem(gb_english = false) click to toggle source

Alias for: porter2_stem

step_0() click to toggle source

Search for the longest among the suffixes,

’
’s
’s’

and remove if found.

     # File lib/porter2.rb, line 169
169:   def step_0
170:     self.sub!(/(.)('s'|'s|')$/, '\1') || self
171:   end

step_1a() click to toggle source

Remove plural suffixes

     # File lib/porter2.rb, line 174
174:   def step_1a
175:     if self =~ /sses$/
176:       self.sub(/sses$/, 'ss')
177:     elsif self =~ /..(ied|ies)$/
178:       self.sub(/(ied|ies)$/, 'i')
179:     elsif self =~ /(ied|ies)$/
180:       self.sub(/(ied|ies)$/, 'ie')
181:     elsif self =~ /(us|ss)$/
182:       self
183:     elsif self =~ /s$/
184:       if self =~ /(#{V}.+)s$/
185:         self.sub(/s$/, '') 
186:       else
187:         self
188:       end
189:     else
190:       self
191:     end
192:   end

step_1b(gb_english = false) click to toggle source

     # File lib/porter2.rb, line 194
194:   def step_1b(gb_english = false)
195:     if self =~ /(eed|eedly)$/
196:       if self.porter2_r1 =~ /(eed|eedly)$/
197:         self.sub(/(eed|eedly)$/, 'ee')
198:       else
199:         self
200:       end
201:     else
202:       w = self.dup
203:       if w =~ /#{V}.*(ed|edly|ing|ingly)$/
204:         w.sub!(/(ed|edly|ing|ingly)$/, '')
205:         if w =~ /(at|lb|iz)$/
206:           w += 'e' 
207:         elsif w =~ /is$/ and gb_english
208:           w += 'e' 
209:         elsif w =~ /#{Double}$/
210:           w.chop!
211:         elsif w.porter2_is_short_word?
212:           w += 'e'
213:         end
214:       end
215:       w
216:     end
217:   end

step_1c() click to toggle source

     # File lib/porter2.rb, line 220
220:   def step_1c
221:     if self =~ /.+#{C}(y|Y)$/
222:       self.sub(/(y|Y)$/, 'i')
223:     else
224:       self
225:     end
226:   end

step_2(gb_english = false) click to toggle source

     # File lib/porter2.rb, line 229
229:   def step_2(gb_english = false)
230:     r1 = self.porter2_r1
231:     s2m = STEP_2_MAPS.dup
232:     if gb_english
233:       s2m["iser"] = "ise"
234:       s2m["isation"] = "ise"
235:     end
236:     step_2_re = Regexp.union(s2m.keys.map {|r| Regexp.new(r + "$")})
237:     if self =~ step_2_re
238:       if r1 =~ /#{$&}$/
239:         self.sub(/#{$&}$/, s2m[$&])
240:       else
241:         self
242:       end
243:     elsif r1 =~ /li$/ and self =~ /(#{Valid_LI})li$/
244:       self.sub(/li$/, '')
245:     elsif r1 =~ /ogi$/ and self =~ /logi$/
246:       self.sub(/ogi$/, 'og')
247:     else
248:       self
249:     end
250:   end

step_3(gb_english = false) click to toggle source

     # File lib/porter2.rb, line 253
253:   def step_3(gb_english = false)
254:     if self =~ /ative$/ and self.porter2_r2 =~ /ative$/
255:       self.sub(/ative$/, '')
256:     else
257:       s3m = STEP_3_MAPS.dup
258:       if gb_english
259:         s3m["alise"] = "al"
260:       end
261:       step_3_re = Regexp.union(s3m.keys.map {|r| Regexp.new(r + "$")})
262:       r1 = self.porter2_r1
263:       if self =~ step_3_re and r1 =~ /#{$&}$/ 
264:         self.sub(/#{$&}$/, s3m[$&])
265:       else
266:         self
267:       end
268:     end
269:   end

step_4(gb_english = false) click to toggle source

     # File lib/porter2.rb, line 272
272:   def step_4(gb_english = false)
273:     if self.porter2_r2 =~ /ion$/ and self =~ /(s|t)ion$/
274:       self.sub(/ion$/, '')
275:     else
276:       s4m = STEP_4_MAPS.dup
277:       if gb_english
278:         s4m["ise"] = ""
279:       end
280:       step_4_re = Regexp.union(s4m.keys.map {|r| Regexp.new(r + "$")})
281:       r2 = self.porter2_r2
282:       if self =~ step_4_re
283:         if r2 =~ /#{$&}/
284:           self.sub(/#{$&}$/, s4m[$&])
285:         else
286:           self
287:         end
288:       else
289:         self
290:       end
291:     end
292:   end

step_5() click to toggle source

     # File lib/porter2.rb, line 295
295:   def step_5
296:     if self =~ /ll$/ and self.porter2_r2 =~ /l$/
297:       self.sub(/ll$/, 'l') 
298:     elsif self =~ /e$/ and self.porter2_r2 =~ /e$/ 
299:       self.sub(/e$/, '') 
300:     else
301:       r1 = self.porter2_r1
302:       if self =~ /e$/ and r1 =~ /e$/ and not self =~ /#{SHORT_SYLLABLE}e$/
303:         self.sub(/e$/, '')
304:       else
305:         self
306:       end
307:     end
308:   end

Home Classes Methods

In Files

Methods

Class Index

Stemmable

Constants

Public Instance Methods