X-Git-Url: https://git.njae.me.uk/?a=blobdiff_plain;f=doc%2FString.html;fp=doc%2FString.html;h=0000000000000000000000000000000000000000;hb=c8c08d5fafba205c5a8e4138edb0df059a63de36;hp=f04ae9aa7771b3591944e48aa30427256ac4f35d;hpb=aa5699112116010af665e83447bd7d1a17d0e2f7;p=porter2stemmer.git diff --git a/doc/String.html b/doc/String.html deleted file mode 100644 index f04ae9a..0000000 --- a/doc/String.html +++ /dev/null @@ -1,1144 +0,0 @@ - - - -
- - -Object
- --Implementation of the Porter 2 stemmer. String#porter2_stem is the -main stemming procedure. -
- --Returns true if the word ends with a short syllable -
- - - -- # File lib/porter2_implementation.rb, line 59 -59: def porter2_ends_with_short_syllable? -60: self =~ /#{Porter2::SHORT_SYLLABLE}$/ ? true : false -61: end-
-A word is short if it ends in a short syllable, and R1 is null -
- - - -- # File lib/porter2_implementation.rb, line 65 -65: def porter2_is_short_word? -66: self.porter2_ends_with_short_syllable? and self.porter2_r1.empty? -67: end-
-Turn all Y letters into y -
- - - -- # File lib/porter2_implementation.rb, line 261 -261: def porter2_postprocess -262: self.gsub(/Y/, 'y') -263: end-
-Preprocess the word. Remove any initial ’, if present. Then, set -initial y, or y after a vowel, to Y -
--(The comment to ‘establish the regions R1 and R2’ in the -original description is an implementation optimisation that identifies -where the regions start. As no modifications are made to the word that -affect those positions, you may want to cache them now. This implementation -doesn’t do that.) -
- - - -- # File lib/porter2_implementation.rb, line 25 -25: def porter2_preprocess -26: w = self.dup -27: -28: # remove any initial apostrophe -29: w.gsub!(/^'*(.)/, '\1') -30: -31: # set initial y, or y after a vowel, to Y -32: w.gsub!(/^y/, "Y") -33: w.gsub!(/(#{Porter2::V})y/, '\1Y') -34: -35: w -36: end-
-R1 is the portion of the word after the first non-vowel after the first -vowel (with words beginning ‘gener-’, ‘commun-’, -and ‘arsen-’ treated as special cases -
- - - -- # File lib/porter2_implementation.rb, line 41 -41: def porter2_r1 -42: if self =~ /^(gener|commun|arsen)(?<r1>.*)/ -43: Regexp.last_match(:r1) -44: else -45: self =~ /#{Porter2::V}#{Porter2::C}(?<r1>.*)$/ -46: Regexp.last_match(:r1) || "" -47: end -48: end-
-R2 is the portion of R1 (porter2_r1) after the first -non-vowel after the first vowel -
- - - -- # File lib/porter2_implementation.rb, line 52 -52: def porter2_r2 -53: self.porter2_r1 =~ /#{Porter2::V}#{Porter2::C}(?<r2>.*)$/ -54: Regexp.last_match(:r2) || "" -55: end-
-Perform the stemming procedure. If gb_english is true, treat -’-ise’ and similar suffixes as ’-ize’ in American -English. -
- - - -- # File lib/porter2_implementation.rb, line 269 -269: def porter2_stem(gb_english = false) -270: preword = self.porter2_tidy -271: return preword if preword.length <= 2 -272: -273: word = preword.porter2_preprocess -274: -275: if Porter2::SPECIAL_CASES.has_key? word -276: Porter2::SPECIAL_CASES[word] -277: else -278: w1a = word.porter2_step0.porter2_step1a -279: if Porter2::STEP_1A_SPECIAL_CASES.include? w1a -280: w1a -281: else -282: w1a.porter2_step1b(gb_english).porter2_step1c.porter2_step2(gb_english).porter2_step3(gb_english).porter2_step4(gb_english).porter2_step5.porter2_postprocess -283: end -284: end -285: end-
-A verbose version of porter2_stem that prints the -output of each stage to STDOUT -
- - - -- # File lib/porter2_implementation.rb, line 288 -288: def porter2_stem_verbose(gb_english = false) -289: preword = self.porter2_tidy -290: puts "Preword: #{preword}" -291: return preword if preword.length <= 2 -292: -293: word = preword.porter2_preprocess -294: puts "Preprocessed: #{word}" -295: -296: if Porter2::SPECIAL_CASES.has_key? word -297: puts "Returning #{word} as special case #{Porter2::SPECIAL_CASES[word]}" -298: Porter2::SPECIAL_CASES[word] -299: else -300: r1 = word.porter2_r1 -301: r2 = word.porter2_r2 -302: puts "R1 = #{r1}, R2 = #{r2}" -303: -304: w0 = word.porter2_step0 ; puts "After step 0: #{w0} (R1 = #{w0.porter2_r1}, R2 = #{w0.porter2_r2})" -305: w1a = w0.porter2_step1a ; puts "After step 1a: #{w1a} (R1 = #{w1a.porter2_r1}, R2 = #{w1a.porter2_r2})" -306: -307: if Porter2::STEP_1A_SPECIAL_CASES.include? w1a -308: puts "Returning #{w1a} as 1a special case" -309: w1a -310: else -311: w1b = w1a.porter2_step1b(gb_english) ; puts "After step 1b: #{w1b} (R1 = #{w1b.porter2_r1}, R2 = #{w1b.porter2_r2})" -312: w1c = w1b.porter2_step1c ; puts "After step 1c: #{w1c} (R1 = #{w1c.porter2_r1}, R2 = #{w1c.porter2_r2})" -313: w2 = w1c.porter2_step2(gb_english) ; puts "After step 2: #{w2} (R1 = #{w2.porter2_r1}, R2 = #{w2.porter2_r2})" -314: w3 = w2.porter2_step3(gb_english) ; puts "After step 3: #{w3} (R1 = #{w3.porter2_r1}, R2 = #{w3.porter2_r2})" -315: w4 = w3.porter2_step4(gb_english) ; puts "After step 4: #{w4} (R1 = #{w4.porter2_r1}, R2 = #{w4.porter2_r2})" -316: w5 = w4.porter2_step5 ; puts "After step 5: #{w5}" -317: wpost = w5.porter2_postprocess ; puts "After postprocess: #{wpost}" -318: wpost -319: end -320: end -321: end-
-Search for the longest among the suffixes, -
--‘ -
--’s -
--’s’ -
--and remove if found. -
- - - -- # File lib/porter2_implementation.rb, line 75 -75: def porter2_step0 -76: self.sub!(/(.)('s'|'s|')$/, '\1') || self -77: end-
-Search for the longest among the following suffixes, and perform the action -indicated. -
-sses | -replace by ss - - |
ied, ies | -replace by i if preceded by more than one letter, otherwise by ie - - |
s | -delete if the preceding word part contains a vowel not immediately before -the s - - |
us, ss | -do nothing - - |
- # File lib/porter2_implementation.rb, line 85 - 85: def porter2_step1a - 86: if self =~ /sses$/ - 87: self.sub(/sses$/, 'ss') - 88: elsif self =~ /..(ied|ies)$/ - 89: self.sub(/(ied|ies)$/, 'i') - 90: elsif self =~ /(ied|ies)$/ - 91: self.sub(/(ied|ies)$/, 'ie') - 92: elsif self =~ /(us|ss)$/ - 93: self - 94: elsif self =~ /s$/ - 95: if self =~ /(#{Porter2::V}.+)s$/ - 96: self.sub(/s$/, '') - 97: else - 98: self - 99: end -100: else -101: self -102: end -103: end-
-Search for the longest among the following suffixes, and perform the action -indicated. -
-eed, eedly | -replace by ee if the suffix is also in R1 - - |
ed, edly, ing, ingly | -delete if the preceding word part contains a vowel and, after the -deletion: - -
|
-(If gb_english is true, treat the ‘is’ suffix as -‘iz’ above.) -
- - - -- # File lib/porter2_implementation.rb, line 115 -115: def porter2_step1b(gb_english = false) -116: if self =~ /(eed|eedly)$/ -117: if self.porter2_r1 =~ /(eed|eedly)$/ -118: self.sub(/(eed|eedly)$/, 'ee') -119: else -120: self -121: end -122: else -123: w = self.dup -124: if w =~ /#{Porter2::V}.*(ed|edly|ing|ingly)$/ -125: w.sub!(/(ed|edly|ing|ingly)$/, '') -126: if w =~ /(at|lb|iz)$/ -127: w += 'e' -128: elsif w =~ /is$/ and gb_english -129: w += 'e' -130: elsif w =~ /#{Porter2::Double}$/ -131: w.chop! -132: elsif w.porter2_is_short_word? -133: w += 'e' -134: end -135: end -136: w -137: end -138: end-
-Replace a suffix of y or Y by i if it is preceded by a non-vowel which is -not the first letter of the word. -
- - - -- # File lib/porter2_implementation.rb, line 143 -143: def porter2_step1c -144: if self =~ /.+#{Porter2::C}(y|Y)$/ -145: self.sub(/(y|Y)$/, 'i') -146: else -147: self -148: end -149: end-
-Search for the longest among the suffixes listed in the keys of -Porter2::STEP_2_MAPS. If one is found and that suffix occurs in R1, -replace it with the value found in STEP_2_MAPS. -
--(Suffixes ‘ogi’ and ‘li’ are treated as special -cases in the procedure.) -
--(If gb_english is true, replace the ‘iser’ and -‘isation’ suffixes with ‘ise’, similarly to how -‘izer’ and ‘ization’ are treated.) -
- - - -- # File lib/porter2_implementation.rb, line 160 -160: def porter2_step2(gb_english = false) -161: r1 = self.porter2_r1 -162: s2m = Porter2::STEP_2_MAPS.dup -163: if gb_english -164: s2m["iser"] = "ise" -165: s2m["isation"] = "ise" -166: end -167: step_2_re = Regexp.union(s2m.keys.map {|r| Regexp.new(r + "$")}) -168: if self =~ step_2_re -169: if r1 =~ /#{$&}$/ -170: self.sub(/#{$&}$/, s2m[$&]) -171: else -172: self -173: end -174: elsif r1 =~ /li$/ and self =~ /(#{Porter2::Valid_LI})li$/ -175: self.sub(/li$/, '') -176: elsif r1 =~ /ogi$/ and self =~ /logi$/ -177: self.sub(/ogi$/, 'og') -178: else -179: self -180: end -181: end-
-Search for the longest among the suffixes listed in the keys of -Porter2::STEP_3_MAPS. If one is found and that suffix occurs in R1, -replace it with the value found in STEP_3_MAPS. -
--(Suffix ‘ative’ is treated as a special case in the procedure.) -
--(If gb_english is true, replace the ‘alise’ suffix -with ‘al’, similarly to how ‘alize’ is treated.) -
- - - -- # File lib/porter2_implementation.rb, line 192 -192: def porter2_step3(gb_english = false) -193: if self =~ /ative$/ and self.porter2_r2 =~ /ative$/ -194: self.sub(/ative$/, '') -195: else -196: s3m = Porter2::STEP_3_MAPS.dup -197: if gb_english -198: s3m["alise"] = "al" -199: end -200: step_3_re = Regexp.union(s3m.keys.map {|r| Regexp.new(r + "$")}) -201: r1 = self.porter2_r1 -202: if self =~ step_3_re and r1 =~ /#{$&}$/ -203: self.sub(/#{$&}$/, s3m[$&]) -204: else -205: self -206: end -207: end -208: end-
-Search for the longest among the suffixes listed in the keys of -Porter2::STEP_4_MAPS. If one is found and that suffix occurs in R2, -replace it with the value found in STEP_4_MAPS. -
--(Suffix ‘ion’ is treated as a special case in the procedure.) -
--(If gb_english is true, delete the ‘ise’ suffix if -found.) -
- - - -- # File lib/porter2_implementation.rb, line 218 -218: def porter2_step4(gb_english = false) -219: if self.porter2_r2 =~ /ion$/ and self =~ /(s|t)ion$/ -220: self.sub(/ion$/, '') -221: else -222: s4m = Porter2::STEP_4_MAPS.dup -223: if gb_english -224: s4m["ise"] = "" -225: end -226: step_4_re = Regexp.union(s4m.keys.map {|r| Regexp.new(r + "$")}) -227: r2 = self.porter2_r2 -228: if self =~ step_4_re -229: if r2 =~ /#{$&}/ -230: self.sub(/#{$&}$/, s4m[$&]) -231: else -232: self -233: end -234: else -235: self -236: end -237: end -238: end-
-Search for the the following suffixes, and, if found, perform the action -indicated. -
-e | -delete if in R2, or in R1 and not preceded by a short syllable - - |
l | -delete if in R2 and preceded by l - - |
- # File lib/porter2_implementation.rb, line 244 -244: def porter2_step5 -245: if self =~ /ll$/ and self.porter2_r2 =~ /l$/ -246: self.sub(/ll$/, 'l') -247: elsif self =~ /e$/ and self.porter2_r2 =~ /e$/ -248: self.sub(/e$/, '') -249: else -250: r1 = self.porter2_r1 -251: if self =~ /e$/ and r1 =~ /e$/ and not self =~ /#{Porter2::SHORT_SYLLABLE}e$/ -252: self.sub(/e$/, '') -253: else -254: self -255: end -256: end -257: end-
-Tidy up the word before we get down to the algorithm -
- - - -- # File lib/porter2_implementation.rb, line 7 - 7: def porter2_tidy - 8: preword = self.to_s.strip.downcase - 9: -10: # map apostrophe-like characters to apostrophes -11: preword.gsub!(/â/, "'") -12: preword.gsub!(/â/, "'") -13: -14: preword -15: end-
Disabled; run with --debug to generate this.
- -Generated with the Darkfish - Rdoc Generator 1.1.6.
-