Tidied up the gem requirements and fixed the use of Bundler
[porter2stemmer.git] / README.md
1 The Porter 2 stemmer
2 ====================
3 This is the Porter 2 stemming algorithm, as described at
4 http://snowball.tartarus.org/algorithms/english/stemmer.html
5 The original paper is:
6
7 Porter, 1980, "An algorithm for suffix stripping", _Program_, Vol. 14,
8 no. 3, pp 130-137
9
10 Features of this implementation
11 ===============================
12 This stemmer is written in pure Ruby, making it easy to modify for language variants.
13 For instance, the original Porter stemmer only works for American English and does
14 not recognise British English's '-ise' as an alternate spelling of '-ize'. This
15 implementation has been extended to handle correctly British English.
16
17 This stemmer also features a comprehensive test set of over 29,000 words, taken from the
18 [Porter 2 stemmer website](http://snowball.tartarus.org/algorithms/english/stemmer.html).
19
20 Files
21 =====
22 Constants for the stemmer are in the Porter2 module.
23
24 Procedures that implement the stemmer are added to the String class.
25
26 The stemmer algorithm is implemented in the String#porter2_stem procedure.
27
28 Internationalisation
29 ====================
30 There isn't much, as this is a stemmer that only works for English.
31
32 The `gb_english` flag to the various procedures allows the stemmer to treat the British
33 English '-ise' the same as the American English '-ize'.
34
35 Longest suffixes
36 ================
37 Several places in the algorithm require matching the longest suffix of a word. The
38 regexp engine in Ruby 1.9 seems to handle alterntives in regexps by finding the
39 alternative that matches at the first position in the string. As we're only talking
40 about suffixes, that first match is also the longest suffix. If the regexp engine changes,
41 this behaviour may change and break the stemmer.
42
43 Usage
44 =====
45 Call the String#porter2_stem or String#stem methods on a string to return its stem
46 "consistency".stem # => "consist"
47 "knitting".stem # => "knit"
48 "articulated".stem # => "articul"
49 "nationalize".stem # => "nation"
50 "nationalise".stem # => "nationalis"
51 "nationalise".stem(true) # => "nation"
52
53 Author
54 ======
55 The Porter 2 stemming algorithm was developed by
56 [Martin Porter](http://snowball.tartarus.org/algorithms/english/stemmer.html).
57 This implementation is by [Neil Smith](http://www.njae.me.uk).
58