summaryrefslogtreecommitdiff
path: root/rt/docs/full_text_indexing.pod
blob: 6b0025d623189db166e799c465e9aaed25fe1732 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
=head1 NAME

Full text indexing in RT

=head1 LIMITATIONS

While all of the below solutions can search for Unicode characters, they
are not otherwise Unicode aware, and do no case folding, normalization,
or the like.  That is, a string that contains C<U+0065 LATIN SMALL
LETTER E> followed by C<U+0301 COMBINING ACUTE ACCENT> will not match a
search for C<U+00E9 LATIN SMALL LETTER E WITH ACUTE>.  They also only
know how to tokenize C<latin-1>-ish languages where words are separated
by whitespace or similar characters; as such, support for searching for
Japanese and Chinese content is extremely limited.

=head1 POSTGRES

=head2 Creating and configuring the index

Postgres 8.3 and above support full-text searching natively; to set up
the required C<ts_vector> column, and create either a C<GiN> or C<GiST>
index on it, run:

    sbin/rt-setup-fulltext-index

If you have a non-standard database administrator username or password,
you may need to pass the C<--dba> or C<--dba-password> options:

    sbin/rt-setup-fulltext-index --dba postgres --dba-password secret

This will also output an appropriate C<%FullTextSearch> configuration to
add to your F<RT_SiteConfig.pm>; you will need to restart your webserver
after making these changes.  However, the index will also need to be
filled before it can be used.  To update the index initially, run:

    sbin/rt-fulltext-indexer --all

This will tokenize and index all existing attachments in your database;
it may take quite a while if your database already has a large number of
tickets in it.

=head2 Updating the index

To keep the index up-to-date, you will need to run:

    sbin/rt-fulltext-indexer

...at regular intervals.  By default, this will only tokenize up to 100
tickets at a time; you can adjust this upwards by passing
C<--limit 500>.  Larger batch sizes will take longer and
consume more memory.

If there is already an instances of C<rt-fulltext-indexer> running, new
ones will exit abnormally (with exit code 1) and the error message
"rt-fulltext-indexer is already running."  You can suppress this message
and end those processes normally (with exit code 0) using the C<--quiet>
option; this is particularly useful when running the command via
C<cron>:

    sbin/rt-fulltext-indexer --quiet

=head1 MYSQL

MySQL does not support full-text indexing natively.  However, it does
integrate with the external Sphinx engine, available from
L<http://sphinxsearch.com>.  Unfortunately, Sphinx integration (using
SphinxSE) does require that you recompile MySQL from source.  Most
distribution-provided packages for MySQL do not include SphinxSE
integration, merely the external Sphinx tools; these are not sufficient
for RT's needs.

=head2 Compiling MySQL and SphinxSE

SphinxSE requires MySQL 5.0 or 5.1; later versions of MySQL have not
been tested at this time.  Sphinx version 2.0.1 has been tested to work,
but version 0.9.9 may work as well.  Compilation and installation
instructions for MySQL with SphinxSE can be found at
L<http://sphinxsearch.com/docs/current.html#sphinxse-installing>.

=head2 Creating and configuring the index

Once MySQL has been recompiled with SphinxSE, and Sphinx itself is
installed, you may create the required SphinxSE communication table via:

    sbin/rt-setup-fulltext-index

If you have a non-standard database administrator username or password,
you may need to pass the C<--dba> or C<--dba-password> options:

    sbin/rt-setup-fulltext-index --dba root --dba-password secret

This will also provide you with the appropriate C<%FullTextSearch>
configuration to add to your F<RT_SiteConfig.pm>; you will need to
restart your webserver after making these changes.  It will also print a
sample Sphinx configuration, which should be placed in
F</etc/sphinx.conf>, or equivalent.

To fill the index, you will need to run the C<indexer> command-line tool
provided by Sphinx:

    indexer rt

Finally, start the Sphinx search daemon:

    searchd

=head2 Updating the index

To keep the index up-to-date, you will need to run:

    indexer rt --rotate

...at regular intervals in order to pick up new and updated attachments
from RT's database.  Failure to do so will result in stale data.

=head2 Caveats

Sphinx only returns a finite number of matches to any query; this number
is controlled by C<max_matches> in F</etc/sphinx.conf> and
C<%FullTextSearch>'s C<MaxMatches> in C<RT_SiteConfig.pm>, which must be
kept in sync.  The default, set during C<rt-setup-fulltext-index>, is
10000.  This limit may lead to false negatives in search results if the
maximum number of matches is reached but the results returned do not
match RT's other criteria.

Take, for example, the instance where Sphinx is configured to return a
maximum of three results, and tickets 1, 2, 3, 4, and 5 contain the
string "target", but only ticket 5 is in status "Open".  A search for
C<Content LIKE 'target' AND Status = 'Open'> may return no results,
despite ticket 5 matching those criteria, as Sphinx will only return
tickets 1, 2, and 3 as possible matches.

After index creation, altering C<MaxMatches> in C<RT_SiteConfig.pm> is
insufficient to adjust this limit; both C<max_matches> in
F</etc/sphinx.conf> and C<%FullTextSearch>'s C<MaxMatches> in
C<RT_SiteConfig.pm> must be updated.

=head1 ORACLE

=head2 Creating and configuring the index

Oracle supports full-text indexing natively using the Oracle Text
package.  Once Oracle Text is installed and configured, run:

    sbin/rt-setup-fulltext-index

If you have a non-standard database administrator username or password,
you may need to pass the C<--dba> or C<--dba-password> options:

    sbin/rt-setup-fulltext-index --dba sysdba --dba-password secret

This will create an Oracle CONTEXT index on the Content column in the
Attachments table, as well as several preferences, functions and
triggers to support this index.  The script will also output an
appropriate C<%FullTextSearch> configuration to add to your
F<RT_SiteConfig>.

=head2 Updating the index

To update the index, you will need to run the following at regular
intervals:

    sbin/rt-fulltext-indexer

This, in effect, simply runs:

    begin
    ctx_ddl.sync_index('rt_fts_index', '2M');
    end;

The amount of memory used for the sync can be controlled with the
C<--memory> option:

    rt-fulltext-indexer --memory 10M

If there is already an instance of C<rt-fulltext-indexer> running, new
ones will exit abnormally (with exit code 1) and the error message
"rt-fulltext-indexer is already running."  You can suppress this message
and end those processes normally (with exit code 0) using the C<--quiet>
option; this is particularly useful when running the command via
C<cron>:

    sbin/rt-fulltext-indexer --quiet

Instead of being run via C<cron>, this may instead be run via a
DBMS_JOB; read the B<Managing DML Operations for a CONTEXT Index>
chapter of Oracle's B<Text Application Developer's Guide> for details
how to keep the index optimized, perform garbage collection, and other
tasks.

=cut