REL="nofollow" and Mailman

From lxadm | Linux administration tips, tutorials, HOWTOs and articles
Jump to: navigation, search

Was your Mailman mailing list ever devastated by spambots? If yes, you may consider adding REL="nofollow" attributes to all postings sent to your mailing list.

Usage:

  • copy the script somewhere to your server
  • edit $searchpath - point it to Mailman's "archives" directory
  • if you have any domains you wish to whitelist - add them to @excludedomains
  • start the script by hand - you will see which files were converted
  • add a cronjob in /var/spool/mailman:
# add nofollow
48 * * * * perl /path/to/this/script/add_nofollow.pl &>/dev/null

That's it! The script is Mailman specific, but can be easily modified to add REL="nofollow" attributes to other HTML files, too.


add_nofollow.pl:

#!/usr/bin/perl

# This script adds REL="nofollow" attribute to Mailman's HTML files

use strict;
use File::Copy;

# This is a directory containing HTML files we want to add REL="nofollow"
# (Mailman's "archives" directory)
my $searchpath = "/srv/www/example.com/mailman/archives/private";

# Domains we want to exclude (whitelist) - don't add REL="nofollow" there
my @excludedomains = (".*?example\.org", ".*?example\.com" );
my $excludedomain;

my $htmlfiles; # all HTML files we find
my @content;   # content of a single HTML file

$htmlfiles = `find $searchpath -name "[0-9]\*\.html"`;

my @htmlfiles = split('\n',$htmlfiles);
my $htmlfile;
foreach $htmlfile (@htmlfiles) {

    open INPUTHTML, "<$htmlfile";

    @content = <INPUTHTML>;
    my $newhtml;        # new HTML contents
    my $oldhtml;        # old HTML contents
    my $line;           # each line of a HTML file
    foreach $line (@content) {
        $oldhtml .= "$line";

        # This is where we do all replacing
        if ( $line =~ m/(.*)?(<A HREF=\"http.*)/ ) {
            # Remove any REL="nofollow"
            $line =~ s/(<A HREF=\"http)([^\"]*\")(\sREL="nofollow")/$1$2/gsmi;

            # Add REL="nofollow"
            $line =~ s/(<A HREF=\"http)([^\"]*\")/$1$2 REL="nofollow"/gsmi;

            # Remove REL="nofollow" from excluded domains
            foreach $excludedomain(@excludedomains) {
                $line =~ s/(<A HREF=\"($excludedomain))([^\"]*\")(\sREL="nofollow")/$1$3/gsmi;
            }
            $newhtml .= $line;
        } else {
            $newhtml .= "$line";
        }
    }

    # If these variables differ, it means we added REL="nofollow" - commit it to a file
    if ( $oldhtml ne $newhtml ) {
        print "Added REL=\"nofollow\": $htmlfile\n";
        open OUTPUTHTML, ">$htmlfile.tmp";
        print OUTPUTHTML $newhtml;
        close OUTPUTHTML;
        move("$htmlfile.tmp", "$htmlfile");
    }

    close(INPUTHTML);
}