Updated documentation
[porter2stemmer.git] / Readme.rdoc
1 # coding: utf-8
2
3 # ==The Porter 2 stemmer
4 # This is the Porter 2 stemming algorithm, as described at
5 # http://snowball.tartarus.org/algorithms/english/stemmer.html
6 # The original paper is:
7 #
8 # Porter, 1980, "An algorithm for suffix stripping", _Program_, Vol. 14,
9 # no. 3, pp 130-137
10 #
11 # ==Features of this implementation
12 # This stemmer is written in pure Ruby, making it easy to modify for language variants.
13 # For instance, the original Porter stemmer only works for American English and does
14 # not recognise British English's '-ise' as an alternate spelling of '-ize'. This
15 # implementation has been extended to handle correctly British English.
16 #
17 # This stemmer also features a comprehensive test set of over 29,000 words, taken from the
18 # {Porter 2 stemmer website}[http://snowball.tartarus.org/algorithms/english/stemmer.html].
19 #
20 # ==Files
21 # Constants for the stemmer are in the Porter2 module.
22 #
23 # Procedures that implement the stemmer are added to the String class.
24 #
25 # The stemmer algorithm is implemented in the String#porter2_stem procedure.
26 #
27 # ==Internationalisation
28 # There isn't much, as this is a stemmer that only works for English.
29 #
30 # The +gb_english+ flag to the various procedures allows the stemmer to treat the British
31 # English '-ise' the same as the American English '-ize'.
32 #
33 # ==Longest suffixes
34 # Several places in the algorithm require matching the longest suffix of a word. The
35 # regexp engine in Ruby 1.9 seems to handle alterntives in regexps by finding the
36 # alternative that matches at the first position in the string. As we're only talking
37 # about suffixes, that first match is also the longest suffix. If the regexp engine changes,
38 # this behaviour may change and break the stemmer.
39 #
40 # ==Usage
41 # Call the String#porter2_stem or String#stem methods on a string to return its stem
42 # "consistency".stem # => "consist"
43 # "knitting".stem # => "knit"
44 # "articulated".stem # => "articul"
45 # "nationalize".stem # => "nation"
46 # "nationalise".stem # => "nationalis"
47 # "nationalise".stem(true) # => "nation"
48 #
49 # ==Author
50 # The Porter 2 stemming algorithm was developed by
51 # {Martin Porter}[http://snowball.tartarus.org/algorithms/english/stemmer.html].
52 # This implementation is by {Neil Smith}[http://www.njae.me.uk].
53