Integrating Mailman with a Swish-e search engine

Mailman is a very popular mailing list manager.

Unfortunately, one feature Mailman doesn't provide is searching its archives. Note that although Mailman can be integrated with Google search, this method is discouraged - it normally takes several weeks until Googlebot crawls your new posts.

However, Mailman can be easily integrated with existing open-source indexing systems, like Swish-e, which we will document here.

Prerequisites

This HOWTO assumes:

  • both Mailman and Swish-e will be installed on the same machine,
  • your Mailman setup/list is already running,
  • Swish-e is already installed (if your system doesn't provide packaged version of Swish-e, see Swish-e documentation for more info)
  • this setup reflects Mailman/Swish-e installation on http://lists.wpkg.org

Apache configuration

Swish-e uses Perl; Mailman uses Python. This means that we probably need to tell Apache how to parse .cgi files. Your Apache needs mod_perl. To parse Perl .cgi files, you need to add +ExecCGI to Mailman directory options. Apache also needs to know that it has to parse some or all HTML files (later, you will decide if you want to have the search form on all Mailman pages, or just on thread.html, subject.html, author.html and date.html):


See http://httpd.apache.org/docs/2.2/howto/ssi.html#configuring for more info on configuring Server Side Includes (SSI) in Apache.

Swish-e configuration

Indexing configuration

  • The first step is to create a working directory for Swish-e. This is where you will keep its configuration and index files.


  • Now we can index our Mailman archive

Web search configuration

Indexing is done - now, it's time to set up a search on your Mailman pages.

  • First, copy swish.cgi to Mailman's cgi-bin dir:
  • Then, create a config file for swish.cgi - save it as /srv/www/vhosts/wpkg.org/swish/swishcgi.conf - I didn't want such an advanced search form, so I've hidden some entries:
  • I also didn't want to search by date - as my Mailman archive was first created out of a mbox file, the dates of .html files did not match the dates when posts were sent to the list. Here's a patch that does it, basically just commenting out the date range fields:

(Note that if you do not comment that code out, and date options still don't show up on the search page, you may be missing the Date::Calc module required by swish.cgi - see http://swish-e.org/docs/swish.cgi.html - you can test this from the command line with perl -e 'require Date::Calc' which should have no output.)


  • swish.cgi needs to know where to look for the configuration file - open it with your favourite editor, and change the location of $DEFAULT_CONFIG_FILE:

Integrating the search with Mailman's pages

If search works - congratulations. Now it's time to integrate the search form with some of the Mailman's pages. We will do it by a simple Server Side Include (SSI) - <!--#include virtual="/swish_mm.cgi" --> added to Mailman pages. Did you notice swish_mm.cgi here? It is there for a reason.

swish.cgi generates a whole HTML page, that is, with all <html>, <body> etc. tags. As Mailman's pages already include these tags we have to make sure these tags are not added by swish.cgi again.

Copy swish.cgi to swish_mm.cgi and make these changes:


  • As you probably noticed, you will also need to edit one more file (well make a copy of the original first, just in case):

  • In this patch, we comment out all unnecessary <body>, <html> etc. tags, change its Perl name to SWISH::TemplateDefault_MM and send results to a separate swish.cgi (we can't use swish_mm.cgi, it doesn't contain <body>, <html> etc. tags):

Mailman configuration

Now it's time to edit Mailman template files so that Mailman pages include a search form. If you just want a search form on thread.html, subject.html, author.html and date.html, you need to add <!--#include virtual="/swish_mm.cgi" --> to three Mailman templates: archidxhead.html, archtoc.html and archtocnombox.html. It is very important that you do NOT edit the templates in MAILMANDIR/templates/en (because you would lose your changes later if you upgraded Mailman). Instead, create a directory at MAILMANDIR/templates/site/en, copy the templates you want to update to this new directory and edit the site files.

If you use the default English language in Mailman, you will find these files in templates/en directory of your Mailman installation. The change is simple - an example below:

If you want to have a search form also on every Mailman's archived message page, do a similar change in article.html.

Once you have made the changes to the templates, you MUST restart the Mailman process, since ArchRunner keeps a cache of the templates.

Recreating Mailman's archive

If you already have a list archive, you will need to recreate it to apply all these changes. To do this, you need a mbox file which is created by Mailman. An example - below:


If you executed the above command as root, make sure to restore proper permissions:


That's it! Now check if search is integrated with your Mailman pages.

Adding crontab entries

You will want to crawl your archive periodically. Also, if you only want to have the search form on thread.html, subject.html, author.html and date.html pages, you have to add execute bit to them.

How often you do it will depend on the size of your list and the traffic it gets.

I run these two commands every hour (note - this is not crontab entry, just commands you need to start with crontab):

Also, you will probably need to add such entries to default Mailman's cron file - otherwise: