Parent

Files

Class Index [+]

Quicksearch

String

Implementation of the Porter 2 stemmer. String#porter2_stem is the main stemming procedure.

Public Instance Methods

porter2_ends_with_short_syllable?() click to toggle source

Returns true if the word ends with a short syllable

    # File lib/porter2stemmer/implementation.rb, line 59
59:   def porter2_ends_with_short_syllable?
60:     self =~ /#{Porter2::SHORT_SYLLABLE}$/ ? true : false
61:   end
porter2_is_short_word?() click to toggle source

A word is short if it ends in a short syllable, and R1 is null

    # File lib/porter2stemmer/implementation.rb, line 65
65:   def porter2_is_short_word?
66:     self.porter2_ends_with_short_syllable? and self.porter2_r1.empty?
67:   end
porter2_postprocess() click to toggle source

Turn all Y letters into y

     # File lib/porter2stemmer/implementation.rb, line 261
261:   def porter2_postprocess
262:     self.gsub(/Y/, 'y')
263:   end
porter2_preprocess() click to toggle source

Preprocess the word. Remove any initial ’, if present. Then, set initial y, or y after a vowel, to Y

(The comment to ‘establish the regions R1 and R2’ in the original description is an implementation optimisation that identifies where the regions start. As no modifications are made to the word that affect those positions, you may want to cache them now. This implementation doesn’t do that.)

    # File lib/porter2stemmer/implementation.rb, line 25
25:   def porter2_preprocess    
26:     w = self.dup
27: 
28:     # remove any initial apostrophe
29:     w.gsub!(/^'*(.)/, '\1')
30:     
31:     # set initial y, or y after a vowel, to Y
32:     w.gsub!(/^y/, "Y")
33:     w.gsub!(/(#{Porter2::V})y/, '\1Y')
34:     
35:     w
36:   end
porter2_r1() click to toggle source

R1 is the portion of the word after the first non-vowel after the first vowel (with words beginning ‘gener-’, ‘commun-’, and ‘arsen-’ treated as special cases

    # File lib/porter2stemmer/implementation.rb, line 41
41:   def porter2_r1
42:     if self =~ /^(gener|commun|arsen)(?<r1>.*)/
43:       Regexp.last_match(:r1)
44:     else
45:       self =~ /#{Porter2::V}#{Porter2::C}(?<r1>.*)$/
46:       Regexp.last_match(:r1) || ""
47:     end
48:   end
porter2_r2() click to toggle source

R2 is the portion of R1 (porter2_r1) after the first non-vowel after the first vowel

    # File lib/porter2stemmer/implementation.rb, line 52
52:   def porter2_r2
53:     self.porter2_r1 =~ /#{Porter2::V}#{Porter2::C}(?<r2>.*)$/
54:     Regexp.last_match(:r2) || ""
55:   end
porter2_stem(gb_english = false) click to toggle source

Perform the stemming procedure. If gb_english is true, treat ’-ise’ and similar suffixes as ’-ize’ in American English.

     # File lib/porter2stemmer/implementation.rb, line 269
269:   def porter2_stem(gb_english = false)
270:     preword = self.porter2_tidy
271:     return preword if preword.length <= 2
272: 
273:     word = preword.porter2_preprocess
274:     
275:     if Porter2::SPECIAL_CASES.has_key? word
276:       Porter2::SPECIAL_CASES[word]
277:     else
278:       w1a = word.porter2_step0.porter2_step1a
279:       if Porter2::STEP_1A_SPECIAL_CASES.include? w1a 
280:         w1a
281:       else
282:         w1a.porter2_step1b(gb_english).porter2_step1c.porter2_step2(gb_english).porter2_step3(gb_english).porter2_step4(gb_english).porter2_step5.porter2_postprocess
283:       end
284:     end
285:   end
Also aliased as: stem
porter2_stem_verbose(gb_english = false) click to toggle source

A verbose version of porter2_stem that prints the output of each stage to STDOUT

     # File lib/porter2stemmer/implementation.rb, line 288
288:   def porter2_stem_verbose(gb_english = false)
289:     preword = self.porter2_tidy
290:     puts "Preword: #{preword}"
291:     return preword if preword.length <= 2
292: 
293:     word = preword.porter2_preprocess
294:     puts "Preprocessed: #{word}"
295:     
296:     if Porter2::SPECIAL_CASES.has_key? word
297:       puts "Returning #{word} as special case #{Porter2::SPECIAL_CASES[word]}"
298:       Porter2::SPECIAL_CASES[word]
299:     else
300:       r1 = word.porter2_r1
301:       r2 = word.porter2_r2
302:       puts "R1 = #{r1}, R2 = #{r2}"
303:     
304:       w0 = word.porter2_step0 ; puts "After step 0:  #{w0} (R1 = #{w0.porter2_r1}, R2 = #{w0.porter2_r2})"
305:       w1a = w0.porter2_step1a ; puts "After step 1a: #{w1a} (R1 = #{w1a.porter2_r1}, R2 = #{w1a.porter2_r2})"
306:       
307:       if Porter2::STEP_1A_SPECIAL_CASES.include? w1a
308:         puts "Returning #{w1a} as 1a special case"
309:         w1a
310:       else
311:         w1b = w1a.porter2_step1b(gb_english) ; puts "After step 1b: #{w1b} (R1 = #{w1b.porter2_r1}, R2 = #{w1b.porter2_r2})"
312:         w1c = w1b.porter2_step1c ; puts "After step 1c: #{w1c} (R1 = #{w1c.porter2_r1}, R2 = #{w1c.porter2_r2})"
313:         w2 = w1c.porter2_step2(gb_english) ; puts "After step 2:  #{w2} (R1 = #{w2.porter2_r1}, R2 = #{w2.porter2_r2})"
314:         w3 = w2.porter2_step3(gb_english) ; puts "After step 3:  #{w3} (R1 = #{w3.porter2_r1}, R2 = #{w3.porter2_r2})"
315:         w4 = w3.porter2_step4(gb_english) ; puts "After step 4:  #{w4} (R1 = #{w4.porter2_r1}, R2 = #{w4.porter2_r2})"
316:         w5 = w4.porter2_step5 ; puts "After step 5:  #{w5}"
317:         wpost = w5.porter2_postprocess ; puts "After postprocess: #{wpost}"
318:         wpost
319:       end
320:     end
321:   end
porter2_step0() click to toggle source

Search for the longest among the suffixes,

  • ’s

  • ’s’

and remove if found.

    # File lib/porter2stemmer/implementation.rb, line 75
75:   def porter2_step0
76:     self.sub!(/(.)('s'|'s|')$/, '\1') || self
77:   end
porter2_step1a() click to toggle source

Search for the longest among the following suffixes, and perform the action indicated.

sses

replace by ss

ied, ies

replace by i if preceded by more than one letter, otherwise by ie

s

delete if the preceding word part contains a vowel not immediately before the s

us, ss

do nothing

     # File lib/porter2stemmer/implementation.rb, line 85
 85:   def porter2_step1a
 86:     if self =~ /sses$/
 87:       self.sub(/sses$/, 'ss')
 88:     elsif self =~ /..(ied|ies)$/
 89:       self.sub(/(ied|ies)$/, 'i')
 90:     elsif self =~ /(ied|ies)$/
 91:       self.sub(/(ied|ies)$/, 'ie')
 92:     elsif self =~ /(us|ss)$/
 93:       self
 94:     elsif self =~ /s$/
 95:       if self =~ /(#{Porter2::V}.+)s$/
 96:         self.sub(/s$/, '') 
 97:       else
 98:         self
 99:       end
100:     else
101:       self
102:     end
103:   end
porter2_step1b(gb_english = false) click to toggle source

Search for the longest among the following suffixes, and perform the action indicated.

eed, eedly

replace by ee if the suffix is also in R1

ed, edly, ing, ingly

delete if the preceding word part contains a vowel and, after the deletion:

  • if the word ends at, bl or iz: add e, or

  • if the word ends with a double: remove the last letter, or

  • if the word is short: add e

(If gb_english is true, treat the ‘is’ suffix as ‘iz’ above.)

     # File lib/porter2stemmer/implementation.rb, line 115
115:   def porter2_step1b(gb_english = false)
116:     if self =~ /(eed|eedly)$/
117:       if self.porter2_r1 =~ /(eed|eedly)$/
118:         self.sub(/(eed|eedly)$/, 'ee')
119:       else
120:         self
121:       end
122:     else
123:       w = self.dup
124:       if w =~ /#{Porter2::V}.*(ed|edly|ing|ingly)$/
125:         w.sub!(/(ed|edly|ing|ingly)$/, '')
126:         if w =~ /(at|lb|iz)$/
127:           w += 'e' 
128:         elsif w =~ /is$/ and gb_english
129:           w += 'e' 
130:         elsif w =~ /#{Porter2::Double}$/
131:           w.chop!
132:         elsif w.porter2_is_short_word?
133:           w += 'e'
134:         end
135:       end
136:       w
137:     end
138:   end
porter2_step1c() click to toggle source

Replace a suffix of y or Y by i if it is preceded by a non-vowel which is not the first letter of the word.

     # File lib/porter2stemmer/implementation.rb, line 143
143:   def porter2_step1c
144:     if self =~ /.+#{Porter2::C}(y|Y)$/
145:       self.sub(/(y|Y)$/, 'i')
146:     else
147:       self
148:     end
149:   end
porter2_step2(gb_english = false) click to toggle source

Search for the longest among the suffixes listed in the keys of Porter2::STEP_2_MAPS. If one is found and that suffix occurs in R1, replace it with the value found in STEP_2_MAPS.

(Suffixes ‘ogi’ and ‘li’ are treated as special cases in the procedure.)

(If gb_english is true, replace the ‘iser’ and ‘isation’ suffixes with ‘ise’, similarly to how ‘izer’ and ‘ization’ are treated.)

     # File lib/porter2stemmer/implementation.rb, line 160
160:   def porter2_step2(gb_english = false)
161:     r1 = self.porter2_r1
162:     s2m = Porter2::STEP_2_MAPS.dup
163:     if gb_english
164:       s2m["iser"] = "ise"
165:       s2m["isation"] = "ise"
166:     end
167:     step_2_re = Regexp.union(s2m.keys.map {|r| Regexp.new(r + "$")})
168:     if self =~ step_2_re
169:       if r1 =~ /#{$&}$/
170:         self.sub(/#{$&}$/, s2m[$&])
171:       else
172:         self
173:       end
174:     elsif r1 =~ /li$/ and self =~ /(#{Porter2::Valid_LI})li$/
175:       self.sub(/li$/, '')
176:     elsif r1 =~ /ogi$/ and self =~ /logi$/
177:       self.sub(/ogi$/, 'og')
178:     else
179:       self
180:     end
181:   end
porter2_step3(gb_english = false) click to toggle source

Search for the longest among the suffixes listed in the keys of Porter2::STEP_3_MAPS. If one is found and that suffix occurs in R1, replace it with the value found in STEP_3_MAPS.

(Suffix ‘ative’ is treated as a special case in the procedure.)

(If gb_english is true, replace the ‘alise’ suffix with ‘al’, similarly to how ‘alize’ is treated.)

     # File lib/porter2stemmer/implementation.rb, line 192
192:   def porter2_step3(gb_english = false)
193:     if self =~ /ative$/ and self.porter2_r2 =~ /ative$/
194:       self.sub(/ative$/, '')
195:     else
196:       s3m = Porter2::STEP_3_MAPS.dup
197:       if gb_english
198:         s3m["alise"] = "al"
199:       end
200:       step_3_re = Regexp.union(s3m.keys.map {|r| Regexp.new(r + "$")})
201:       r1 = self.porter2_r1
202:       if self =~ step_3_re and r1 =~ /#{$&}$/ 
203:         self.sub(/#{$&}$/, s3m[$&])
204:       else
205:         self
206:       end
207:     end
208:   end
porter2_step4(gb_english = false) click to toggle source

Search for the longest among the suffixes listed in the keys of Porter2::STEP_4_MAPS. If one is found and that suffix occurs in R2, replace it with the value found in STEP_4_MAPS.

(Suffix ‘ion’ is treated as a special case in the procedure.)

(If gb_english is true, delete the ‘ise’ suffix if found.)

     # File lib/porter2stemmer/implementation.rb, line 218
218:   def porter2_step4(gb_english = false)
219:     if self.porter2_r2 =~ /ion$/ and self =~ /(s|t)ion$/
220:       self.sub(/ion$/, '')
221:     else
222:       s4m = Porter2::STEP_4_MAPS.dup
223:       if gb_english
224:         s4m["ise"] = ""
225:       end
226:       step_4_re = Regexp.union(s4m.keys.map {|r| Regexp.new(r + "$")})
227:       r2 = self.porter2_r2
228:       if self =~ step_4_re
229:         if r2 =~ /#{$&}/
230:           self.sub(/#{$&}$/, s4m[$&])
231:         else
232:           self
233:         end
234:       else
235:         self
236:       end
237:     end
238:   end
porter2_step5() click to toggle source

Search for the the following suffixes, and, if found, perform the action indicated.

e

delete if in R2, or in R1 and not preceded by a short syllable

l

delete if in R2 and preceded by l

     # File lib/porter2stemmer/implementation.rb, line 244
244:   def porter2_step5
245:     if self =~ /ll$/ and self.porter2_r2 =~ /l$/
246:       self.sub(/ll$/, 'l') 
247:     elsif self =~ /e$/ and self.porter2_r2 =~ /e$/ 
248:       self.sub(/e$/, '') 
249:     else
250:       r1 = self.porter2_r1
251:       if self =~ /e$/ and r1 =~ /e$/ and not self =~ /#{Porter2::SHORT_SYLLABLE}e$/
252:         self.sub(/e$/, '')
253:       else
254:         self
255:       end
256:     end
257:   end
porter2_tidy() click to toggle source

Tidy up the word before we get down to the algorithm

    # File lib/porter2stemmer/implementation.rb, line 7
 7:   def porter2_tidy
 8:     preword = self.to_s.strip.downcase
 9:     
10:     # map apostrophe-like characters to apostrophes
11:     preword.gsub!(/‘/, "'")
12:     preword.gsub!(/’/, "'")
13: 
14:     preword
15:   end
stem(gb_english = false) click to toggle source
Alias for: porter2_stem

Disabled; run with --debug to generate this.

[Validate]

Generated with the Darkfish Rdoc Generator 1.1.6.