Integrating Mailman with a Swish-e search engine

Mailman is a very popular mailing list manager.

Unfortunately, one feature Mailman doesn't provide is searching its archives. Note that although Mailman can be integrated with Google search, this method is discouraged - it normally takes several weeks until Googlebot crawls your new posts.

However, Mailman can be easily integrated with existing open-source indexing systems, like Swish-e, which we will document here.


This HOWTO assumes:

  • both Mailman and Swish-e will be installed on the same machine,
  • your Mailman setup/list is already running,
  • Swish-e is already installed (if your system doesn't provide packaged version of Swish-e, see Swish-e documentation for more info)
  • this setup reflects Mailman/Swish-e installation on

Apache configuration

Swish-e uses Perl; Mailman uses Python. This means that we probably need to tell Apache how to parse .cgi files. Your Apache needs mod_perl. To parse Perl .cgi files, you need to add +ExecCGI to Mailman directory options. Apache also needs to know that it has to parse some or all HTML files (later, you will decide if you want to have the search form on all Mailman pages, or just on thread.html, subject.html, author.html and date.html):

Options -Indexes +FollowSymLinks +ExecCGI +Includes

### You can comment out XBitHack if you want a search form on all Mailman pages/messages
XBitHack on

###  Uncomment these if you want to have a search form on all pages
# AddHandler server-parsed .html # for Apache 1.3
# AddOutputFilter INCLUDES .html   # for Apache 2.x

See for more info on configuring Server Side Includes (SSI) in Apache.

Swish-e configuration

Indexing configuration

  • The first step is to create a working directory for Swish-e. This is where you will keep its configuration and index files.
# mkdir /srv/www/vhosts/

# Index file - this is what Swish will create
IndexFile /srv/www/vhosts/

# Root of our Mailman archives - everything under here will be indexed
IndexDir /srv/www/vhosts/

# We want to index .html files only
IndexOnly .html

# Don't index summary pages: author.html, date.html etc.
FileRules filename is (author\.html|date\.html|index\.html|subject\.html|thread\.html)

# Replace local (physical) path with the web-accessible path
ReplaceRules replace "/srv/www/vhosts/" "pipermail/"

# Store description in search results
IndexContents HTML .html
StoreDescription HTML <pre> 200000

# Look at the title, too
MetaNames swishtitle

FollowSymLinks yes

  • Now we can index our Mailman archive
# swish-e -c /srv/www/vhosts/
Indexing Data Source: "File-System"
Indexing "/srv/www/vhosts/"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 10,755 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
10,755 unique words indexed.
5 properties sorted.
2,283 files indexed.  11,588,318 total bytes.  699,952 total words.
Elapsed time: 00:00:02 CPU time: 00:00:02
Indexing done!

Web search configuration

Indexing is done - now, it's time to set up a search on your Mailman pages.

  • First, copy swish.cgi to Mailman's cgi-bin dir:
# cp /usr/lib/swish-e/swish.cgi /srv/www/vhosts/
# chmod 755 /srv/www/vhosts/

  • Then, create a config file for swish.cgi - save it as /srv/www/vhosts/ - I didn't want such an advanced search form, so I've hidden some entries:
return {
    title        => 'Search WPKG mailing lists',
    swish_binary => '/usr/bin/swish-e',
    swish_index  => '/srv/www/vhosts/',

# I wanted to hide some fields I didn't use - compare it with the values in swish.cgi.
# Default values are commented out.

#   secondary_sort  => [qw/swishlastmodified desc/],
    secondary_sort  => [qw/swishtitle/],
#   sorts           => [qw/swishrank swishlastmodified swishtitle swishdocpath/],
    sorts           => [qw/swishrank swishtitle swishdocsize/],
#   metanames       => [qw/ swishdefault swishtitle swishdocpath /],
    metanames       => [qw/ swishdefault swishtitle /],
#   display_props   => [qw/swishlastmodified swishdocsize swishdocpath/],
    display_props   => [qw/swishdocsize/],
  • I also didn't want to search by date - as my Mailman archive was first created out of a mbox file, the dates of .html files did not match the dates when posts were sent to the list. Here's a patch that does it, basically just commenting out the date range fields:
--- swish.cgi.orig      2007-11-25 16:16:39.000000000 +0100
+++ swish.cgi   2007-11-29 23:13:18.000000000 +0100
@@ -1679,14 +1679,14 @@

     # Set the layout:

-    my $string = '<br>Limit to: '
-                 . ( $fields{buttons} ? "$fields{buttons}<br>" : '' )
-                 . ( $fields{date_range_button} || '' )
-                 . ( $fields{date_range_low}
-                     ? " $fields{date_range_low} through $fields{date_range_high}"
-                     : '' );
-    return $string;
+#    my $string = '<br>Limit to: '
+#                 . ( $fields{buttons} ? "$fields{buttons}<br>" : '' )
+#                 . ( $fields{date_range_button} || '' )
+#                 . ( $fields{date_range_low}
+#                     ? " $fields{date_range_low} through $fields{date_range_high}"
+#                     : '' );
+#    return $string;

(Note that if you do not comment that code out, and date options still don't show up on the search page, you may be missing the Date::Calc module required by swish.cgi - see - you can test this from the command line with perl -e 'require Date::Calc' which should have no output.)

  • swish.cgi needs to know where to look for the configuration file - open it with your favourite editor, and change the location of $DEFAULT_CONFIG_FILE:
my $DEFAULT_CONFIG_FILE = '/srv/www/vhosts/';

Integrating the search with Mailman's pages

If search works - congratulations. Now it's time to integrate the search form with some of the Mailman's pages. We will do it by a simple Server Side Include (SSI) - <!--#include virtual="/swish_mm.cgi" --> added to Mailman pages. Did you notice swish_mm.cgi here? It is there for a reason.

swish.cgi generates a whole HTML page, that is, with all <html>, <body> etc. tags. As Mailman's pages already include these tags we have to make sure these tags are not added by swish.cgi again.

Copy swish.cgi to swish_mm.cgi and make these changes:

--- swish.cgi   2007-11-29 23:19:45.000000000 +0100
+++ swish_mm.cgi        2007-11-29 23:33:07.000000000 +0100
@@ -451,7 +451,7 @@
         # TemplateDefault is the default

         xtemplate => {
-            package     => 'SWISH::TemplateDefault',
+            package     => 'SWISH::TemplateDefault_MM',

         xtemplate => {
@@ -770,7 +770,7 @@

     # load the templating module
-    my $template = $conf->{template} || { package => 'SWISH::TemplateDefault' };
+    my $template = $conf->{template} || { package => 'SWISH::TemplateDefault_MM' };
     load_module( $template->{package} );

  • As you probably noticed, you will also need to edit one more file (well make a copy of the original first, just in case):
# cp /usr/lib/swish-e/perl/SWISH/ /usr/lib/swish-e/perl/SWISH/
# cp /usr/lib/swish-e/perl/SWISH/ /usr/lib/swish-e/perl/SWISH/

  • In this patch, we comment out all unnecessary <body>, <html> etc. tags, change its Perl name to SWISH::TemplateDefault_MM and send results to a separate swish.cgi (we can't use swish_mm.cgi, it doesn't contain <body>, <html> etc. tags):
---  2005-06-19 00:52:52.000000000 +0200
+++       2007-11-29 23:41:14.000000000 +0100
@@ -2,7 +2,7 @@
 # These routines format the HTML output.
 #    $Id:,v 1.3 2003/05/13 06:11:33 whmoseley Exp $
-package SWISH::TemplateDefault;
+package SWISH::TemplateDefault_MM;
 use strict;

 use CGI;
@@ -63,14 +63,14 @@
                : $results->config('logo') || $default_logo;

     return <<EOF;
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
+<!-- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
-    <body>
+    <body>-->
         $logo$title $message
@@ -124,7 +124,7 @@

     return <<EOF;
-    <form method="get" action="$form" enctype="application/x-www-form-urlencoded" class="form">
+    <form method="get" action="/swish.cgi" enctype="application/x-www-form-urlencoded" class="form">
         <input maxlength="200" value="$query" size="32" type="text" name="query"/>
         <input value="Search!" type="submit" name="submit"/><br>
@@ -337,11 +337,11 @@

-    <small>Powered by <em>Swish-e</em> <a href=""></a></small>
+<!--    <small>Powered by <em>Swish-e</em> <a href=""></a></small>

Mailman configuration

Now it's time to edit Mailman template files so that Mailman pages include a search form. If you just want a search form on thread.html, subject.html, author.html and date.html, you need to add <!--#include virtual="/swish_mm.cgi" --> to three Mailman templates: archidxhead.html, archtoc.html and archtocnombox.html. It is very important that you do NOT edit the templates in MAILMANDIR/templates/en (because you would lose your changes later if you upgraded Mailman). Instead, create a directory at MAILMANDIR/templates/site/en, copy the templates you want to update to this new directory and edit the site files.

If you use the default English language in Mailman, you will find these files in templates/en directory of your Mailman installation. The change is simple - an example below:

--- archidxhead.html.orig       2007-11-29 23:51:43.000000000 +0100
+++ archidxhead.html    2007-11-29 00:15:03.000000000 +0100
@@ -8,6 +8,7 @@
   <BODY BGCOLOR="#ffffff">
       <a name="start"></A>
       <h1>%(archive)s Archives by %(archtype)s</h1>
+   <!--#include virtual="/swish_mm.cgi" -->
          <li> <b>Messages sorted by:</b>

If you want to have a search form also on every Mailman's archived message page, do a similar change in article.html.

Once you have made the changes to the templates, you MUST restart the Mailman process, since ArchRunner keeps a cache of the templates.

Recreating Mailman's archive

If you already have a list archive, you will need to recreate it to apply all these changes. To do this, you need a mbox file which is created by Mailman. An example - below:

# /srv/www/vhosts/ --wipe wpkg-users wpkg-users.mbox

If you executed the above command as root, make sure to restore proper permissions:

# chown -R mailman:mailman /srv/www/vhosts/

That's it! Now check if search is integrated with your Mailman pages.

Adding crontab entries

You will want to crawl your archive periodically. Also, if you only want to have the search form on thread.html, subject.html, author.html and date.html pages, you have to add execute bit to them.

How often you do it will depend on the size of your list and the traffic it gets.

I run these two commands every hour (note - this is not crontab entry, just commands you need to start with crontab):

# Crawl the archive
swish-e -c /srv/www/vhosts/ &>/dev/null

# If you use "XBitHack on", Apache should parse only these files
find /srv/www/vhosts/ -name thread.html -or -name index.html \
     -or -name date.html -or -name subject.html -or -name author.html | xargs chmod 755

Also, you will probably need to add such entries to default Mailman's cron file - otherwise:


Without HOME, it didn't work with my cron.