JunkEmail Education Project

[previous: Reasons] [next: Procmail Scripts]

JunkEmail Education Project

Getting Started

(updated April 28 2002) This page describes what you need to operate procmail, how it works, and supplies some information on the structure of e-mails, and on regular expressions.

How Procmail [works]
E-mail [structure]
Regular [expressions]

I will assume that (at least for the local machine), you will have available (and know how to use) Sendmail, Pine, Lynx, Perl, and any shell except "Csh." If you use an mail handling utilities other than Pine or Sendmail I am sure appropriate changes can be made to the procmail script.

How procmail Works

The first thing you need to do is to install a '.forward' file in your home directory with directions for starting up procmail. The '.forward' file should read as follows, exactly, including the quotes and spaces.
                "|IFS=' '&&exec /usr/bin/procmail -f-"
On my local machine I do not need this, since procmail is the local delivery agent (MDA) so that all the incoming e-mail get handed off by Sendmail to procmail anyway -- which is the purpose of the '.forward' file.
In either case, procmail looks for a '.procmailrc' file in your home directory to figure out what to do with the e-mail it has just been handed.
It is important to understand that the '.procmailrc' does not simply set conditions for the program procmail (as would be expected of a true 'rc' file), but is a script which is followed line by line for every e-mail which is handed to procmail and which is abandoned as soon as an e-mail is considered delivered.
Environmental Variables

At the top of the '.procmailrc' file you need to set a few environmental variables (other can be set elsewhere on the page, as you need them).
	SHELL=/bin/sh
	PATH=$HOME:/usr/local/bin:/usr/bin:/bin
	MAILDIR=$HOME/mail
	DEFAULT=/var/spool/mail/jno
	LOGFILE=$HOME/maillog.`date +%y-%m-%d`
	VERBOSE=no
	LOG="
	"
Here is a list of the less obvious:

As you can see from the use of "$HOME", procmail inherits some global environmental variables.
"MAILDIR" as shown above is the directory Pine uses. Other mail readers use different names. When you send e-mail to a file by supplying just a name, procmail will assume the file is a 'folder' located in "$MAILDIR", and will create it if need be.
"DEFAULT" is the mail spooler where Pine expects to see its 'INBOX'. You could as well set this to one of your mail folders.
The "LOGFILE" format shown above places the logfile in your home directory, and appends the date to the filename, so that there is a new logfile every day. But you can use any name or location.
"VERBOSE" defines the expansiveness of the error messages which are generated in your log file. Set it to 'yes' when you are testing the '.procmailrc' script.
The "LOG=" could say anything. As shown above it generates a blank line between log entries. Whatever you have will be entered into the logfile with each e-mail (at the point where the "LOG=" is encountered in the script).

At any point you can also set other environmental variables. If for example you wanted to know the size of each e-mail, you could specify..
	SIZE= `wc -c`
..which simply runs the Unix 'wc' (word count) program against the current e-mail, and writes the results (in this case the number of characters) to SIZE, which can then be used in any following recipes as ${SIZE} to write this same information out in a header or as e-mail text.
Rest of the .procmailrc script

The remainder of the '.procmailrc' file will be a series of recipes, each consisting of..
First, a starting line, which always starts with ':0' followed by a series of letters designating what is to be done with the e-mail at that point. For example, "c" denotes "make a (c)opy" -- procmail will make a copy before proceeding with the present recipe. The copy is used as the starting point for the next recipe, without waiting for the current recipe to finish. But a "w" tells procmail to "(w)ait for completion" before starting the next receipe. There are a dozen others, see "man procmailrc."
Secondly, a set of zero or more conditions which are the tests for completing the third item, the 'action' line. Test conditions start with an '*' as a label, followed immediately with a regular expression to be matched. (See "man 7 regex" for the definition of POSIX regex's.) For example,
	* ^To:.*me
.. will test true for any e-mail containing a line starting (^) with 'To:' followed by any (.*) characters whatsoever, until the phrase 'me' is found.
If any condition test False, the recipy will be abondened and the script will continue to the start of the next set.
As long as the * conditions test True, the recipy will continue, and reach the third part: an action to be performed. The action can be:

send to an e-mail address (! you at domain.com),
write to a mail folder (importantmail),
deliver to the mail spool file ($DEFAULT),
pipe to another program (| $HOME/program),
delete entirely (/dev/null),
non-delivery, ({ }) -- a condition which continues processing.

Delivery to a program includes the possibility of delivering (piping) to formail. Formail might be used to alter the e-mail by adding or rewriting headers, or rewriting the e-mail body.
Formail also uses a number of leading option letter, which are significantly different from those used by procmail and will just add to the confusion of the scripting. For example..
	| (formail -r -A "X-SpamFlag: Warning")
.. generates a (r)eply (which means it writes a 'To:' header based on the content of the original 'From:' or 'Reply-To:' header, and adds a new 'From:' header), and (A)dds the header 'X-SpamFlag'. In this case the destination line would need to be completed by further piping the formail action to Sendmail. All the information is available in "man formail".
Creating Headers with formail

I almost always add a few headers to returned e-mail. These are (A)dded or (I)ncluded by piping an e-mail through formail, using (as an example) the following form...
		|  formail \
		-A "X-Loop: digest" \
                -I "Precedence: junk" \
                -I "Reply-To: help at domain.zone" \
                -I "From: digest at domain.zone" \
                -I "To: digest at domain.zone"  \
                -I "Subject: Weekly Digest"
    
If (A)dded, a header might be duplicated. If (I)nserted, it will replace the previous header.
Some common added or inserted headers (in use by me)...
"X-Loop:" -- which could hold a phrase, or some reference to the script doing the sending, or the location which is generating the e-mail, but is mainly used to prevent local mail looping by suppressing any incoming email with this header.
"X-Bogus-To:" -- which shows the original "To:" label. The original "To:" label can be obtained with formail, as ..
	BOGUS=`formail -x"To:"`
This could be (I)nserted with..
	-I "X-Bogus-To: ${BOGUS}"
"From:" or "Reply-To:" -- used on occasion to overwrite the "from" labels which are auto-generated on a reply or forward, like
        -I "From: Procmail Daemon <errors at domain.zone>"
"Bcc: my_account at localhost" -- used to send copies of some outgoing e-mail to myself.
"Precedence: junk" -- a standard way to signify that the email is not important.
You are free to add any "X-anything" labels you want. Microsoft has added an "X-Message-Flag:" label, whose contents shows up in bold at the top of the Microsoft Outlook e-mail reader. I often add,
	"X-Message-Flag: Are you not worried about Outlook viruses?"
E-mail Headers

Let me present 'headers' in the context of the overall structure of an e-mail. E-mails are made up of three or four parts..
1: An envelope: Or many envelopes.

Every machine involved in the transfer of e-mail adds another envelope. This information constitutes the top of an e-mail if you ask to have the headers expanded in Pine ("h").
The series of envelops (some might be missing or faked) will show the real "to" to which delivery is made. Together these will show the path of transmission of the e-mail from origin to final destination.
The outer "envelope" comes first, the first "envelope" comes last. Here is a typical example:
Received: from pop.Outflux.net
        by localhost with POP3 (fetchmail-5.1.0) for jno at localhost
	(single-drop); Mon, 01 Apr 2002 15:41:34 -0600 (CST)
Received: [from hotmail.com (f152.law11.hotmail.com [64.4.17.152]) by
	mailhost.Outflux.net (/) with ESMTP id g31L3jw27286 for
	<meathead at domain.zone>; Mon, 1 Apr 2002 15:03:45 -0600]
Received: from 63.62.195.78 by lw11fd.law11.hotmail.msn.com with HTTP;
        Mon, 01 Apr 2002 21:03:43 GMT
The envelopes may not be as neatly formatted as shown above. From the top, in this case, you can read the following...
The message was fetched by "fetchmail" and sent to a local account. This is the action on the 'home' computer. "Fetchmail" got the e-mail from the pop account at Outflux.net.
Outflux.net in turn received the e-mail from Hotmail.com, from one of their machines known as "f152.law11".
Hotmail in turn received this email with one of their other machines known as "lw11fd.law11". Hotmail also notes the IP address of the source and notes the fact that the e-mail was submitted from a browser ("with HTTP"). Note too that Hotmail is on GMT. We are on CST.
Tracking the time (and converting for the time zones), it looks like the e-mail was delivered in 2 seconds, but I did not fetch it until 38 minutes later. Mea culpa.
2: The headers proper.

Most of these stay the same, although any mail handeling program can add or delete some of them.
There are over 50 'headers' in use. They all start on a separate line with something useful like "To" followed immediately by a colon, So expect "To:" and "From:" and "Subject:" -- these are more or less consistant, but, ahum, need not be there.
A typical set of headers, and notes, below...
	X-Originating-IP: [63.62.195.78]
	From: "Pedro Velez" <pedroace at hotmail.com>
	To: meathead at domain.zone
	Subject: FGA #1
	Date: Mon, 01 Apr 2002 15:03:43 -0600
	Mime-Version: 1.0
	Content-Type: text/plain; format=flowed
	Message-ID: <F152HnOZK5QJ3smv4Jf00008e78 at hotmail.com>
	X-OriginalArrivalTime: 01 Apr 2002 21:03:44.0568 (UTC)
    		FILETIME=[B588E380:01C1D9C0]
	X-Spam-Status: No, hits=0.6 required=5.0 tests=NO_REAL_NAME 
		version=2.11
The "X-Originating-IP" and "X-OriginalArrivalTime" were added by Hotmail. I have no idea why; it must be some sort of tracking information.
Almost all e-mail will contain a "Message-ID" also, and usually consists of a unique number. Supposedly for tracking purposes, but only of use on listserv's and on the UseNet.
The "From: Pedro etc .." header will usually be there, but may be missing, and can be forged.
The "To:", "Subject:", and "Date:" are what they seem to be, and you could also expect "Cc:" on occassion. If your name was on a "Bcc" list, the initial mail handeling program would have deleted it and written individual envelopes instead. The "To:" line could thus show an address complete different from yours. Or "To:" could be blank.
The "Mime-Version:" and "Content-Type:" were added probably by an e-mail composer program, and helps the e-mail reader program figure out how to present the message locally.
Similarly something like "X-Mailer:" might be added by AOL, or another e-mail program. "X-Spam-Status:" was added by Spam Assassin (in this case, a local email pre-processor which checks for spam).
Often you can figure out what e-mail program and operating system the sender is using, for some e-mail programs can't help themselves and have to write headers like..
   User-Agent: Microsoft-Outlook-Express-Macintosh-Edition/5.02.2022
You will also find "Reply-To:" headers, and most e-mail programs will insist on sending replys to the content of "Reply-To:" rather than the content of the "From:" header.
Obviously, you could add even more headers (certainly easy to do with formail).. But that is not why you are reading this. What you want to know is which headers you can expect. Answer: Anybody's guess. Most certainly the "Date:" header, but that can (and is) often forged, and on Macs often wrong by decades. The headers are simply not used by the machinery which sends, transports, and receives e-mail. They look only at the envelopes.
Hopefully you can learn things from the headers, but not if they are forged (trivial to do), and if you are dealing with spam, even the envelopes can be forged. So now what? Well, do the best you can, even if that is not good enough.
3: the body or message of the e-mail

You will be astounded to find out that even the textual contents of the email gets rewritten by your e-mail reader -- to fold lines properly for the screen size you are using, to translate some other character set to whatever you are using, to decompose HTML formats, the decode encrypted texts.. Can't win.
4: the "attachments"

If there are any (these could be image files, other email text, HTML text by Outlook, sound files, PDF's, or viruses and worms), and could be composed entirely of unreadable garbage, which will go on for pages and pages, separated by "Boundary" lines (the "Boundary" is identified in the "Content-Type:" header), but will not show up in your email reader under any conditions. This is "MIME" stuff.
Learn more about the headers by looking up RFC822 (and a number of following RFC's), and check out [D. J. Bernstein's] website.

Regular Expressions

Procmail reads through the e-mail header (unless the body is specified), one line at a time. At each line procmail stops to apply the regular expression of the current 'condition' to the e-mail header line which is being inspected. So if the current condition reads..
	* ^To:.*Outflux.Net
..It will return as 'true' if any line which starts (^) with 'To:' followed by anything whatsoever (.*) up to 'Outflux.Net'. The test is case insensitive, so it would also return 'true' if 'outflux.net' is found.
A 'line' is not what you see when you inspect your e-mail with Pine, but it is all the characters between the last 'end-of-line' marker (or the beginning of the e-mail) and the next 'end-of-line' marker. (This is a working description of what happens, and will do.)
Negated conditions can be specified with the bang symbol. For example..
	* ! ^From:.*Fred 
..will test 'true' if the word 'Fred' could not (!) be found on the line starting (^) with 'From:' -- but would also test 'true' of the 'From:' line were missing.
If the result is 'true' the next condition is tested, and eventually the 'action' is executed. As soon as a condition fails, procmail abandons that particular recipe, and continues on to the next recipe in the .procmailrc file. A series of conditions is thus logically AND'able -- all the conditions have to be met to reach the 'action' line.
OR'ed conditions can be specified on a single condition line by separating words with the pipe symbol, for example..
	* ^To:.*me|you|us|them
..will match true on an e-mail which includes any of me, you, us, or them, on a line stating (^) with 'To:'.
A series of OR conditions can also be constructed by negating an ANDed series of negated conditions; in boolian: a+b+c = (a'.b'.c')', as for example ..
	:0
	* ! ^To:.*me
	* ! ^To:.*you
	* ! ^To:.*us
	* ! ^To:.*them
	{ }
	:0 E
	action
Here procmail will drop out of the series of conditions as soon as one is not true (meaning there is at least one of 'me ... them'), and execute the next recipe, where the 'E' flag means (E)lse.
If all of the conditions are true (meaning there are none of 'me ... them') the non-action ({ }) is executed, and the bottom line is not executed.
The following are most of the regular expression meta characters not mentioned above. A few more convolutions may be found on the "man procmail", "man procmailrc", and "man procmailex" pages.

the sign '.' stands for 'any single character whatsoever, including a blank, but not an end-of-line marker'
the sign '*' modifies whatever comes before it (like . or any single character, or a group of characters in parentheses) to mean 'zero or more of these'
the sign '+' means 'one or more of..'
the sign '?' means 'zero or one of..'
the sign '|' means 'or' as in 'this|that'.
the sign '^' means 'at the beginning' of a line.
the sign '$' means 'at the end', of a line.
the sign '\' means 'escape', that is, treat the following single character literally rather than as a regular expression meta character.

From the last it is thus obvious that if you were trying to match a line which contained a dollar sign (for example 'costs $14.99') you would want to escape the '$' sign..
	* costs \$14\.99
.. and you would also want to escape the '.' above because it is also one of the regular expression meta characters.
The character eating behaviour of meta-characters are not greedy, unlike as in Perl. A search for a match stops at the first opportunity, rather than the last. Supposedly the procmail regex matches the Unix egrep command - but it is not entirely true.
Testing procmail

Testing is easy; you will not have to find remote sites to e-mail to and from. Put your Sendmail Daemon to sleep, or delay making e-mail connections (if you connect via a modem, unplug it). The point of this is to be able to inspect e-mails you create before they zip off into cyberspace.
To pipe a sample e-mail into procmail you will first of all need to make a sample e-mail. Do that with Pine. Select any short e-mail and export it to your home directory (in expanded header format) as the file 'foo' -- or whatever name you like.
The file 'foo' can be edited with Pico; you can remove headers, or add others, or change the information any way to accomplish whatever testing you are doing. To hand the test e-mail to procmail requires only
	'cat foo | procmail' 
Once piped to procmail, you can inspect the spooler directory /var/spool/mqueue where sendmail collects e-mail for later transport. Sendmail will store the header and body separately.
Go to /var/spool/mqueue and start up the Lynx browser locally ('lynx .'). You can inspect the e-mail headers (and the body), and delete them if you like.
The header section will be in some sort of sendmail pre-processing format, but it is mostly readable. Sendmail might add additional headers in the process of delivery, but at least the headers you see will remain.
The advantage of using Lynx is that you can move with a few clicks to the spooler directory where your Pine INBOX is located, and look at e-mail there also. This will be the file
/var/spool/mail/{your_account}.

All your incoming e-mail is appended to this file, with a line starting with the word "From" (without the colon) used to designate the start of another e-mail.
With Lynx you can now proceed with a few clicks to your home directory, and edit the .procmailrc file (if Lynx is set up to show dot files).
Simple? Even simpler when you consider that Pine or Pico can be started up as a shell ('!') from Lynx. Just keep track of where you are.
If you start getting broken e-mails during testing (like e-mails with missing headers) just delete (with Lynx, edit, and Pico) all of this spooler file after the message part of the first psuedo e-mail (which is generated by Pine).

[previous] [next]

ISP: Counterpoint Networking,
Website Provider: Outflux.net, www.Outflux.net
URL:http://jnocook.net/start.htm

JunkEmail Education Project

Getting Started

How procmail Works

Environmental Variables

Rest of the .procmailrc script

Creating Headers with formail

E-mail Headers

1: An envelope: Or many envelopes.

2: The headers proper.

3: the body or message of the e-mail

4: the "attachments"

Regular Expressions

Testing procmail