392 lines
18 KiB
HTML
392 lines
18 KiB
HTML
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
|
||
|
"http://www.w3.org/TR/REC-html40/loose.dtd">
|
||
|
<html>
|
||
|
<head>
|
||
|
<meta http-equiv="Content-Type" content="text/html">
|
||
|
<title>rfc822 - RFC 822 parsing library</title>
|
||
|
<!-- $Id$ -->
|
||
|
<!-- Copyright 2000 Double Precision, Inc. See COPYING for -->
|
||
|
<!-- distribution information. -->
|
||
|
<!-- SECTION 3X -->
|
||
|
</head>
|
||
|
|
||
|
<body text="#000000" bgcolor="#FFFFFF" link="#0000EE" vlink="#551A8B"
|
||
|
alink="#FF0000">
|
||
|
<h1>rfc822 - RFC 822 parsing library</h1>
|
||
|
|
||
|
<h2>SYNOPSIS</h2>
|
||
|
|
||
|
<p><code>#include <rfc822.h></code></p>
|
||
|
|
||
|
<p><code>#include <rfc2047.h></code></p>
|
||
|
|
||
|
<p><code>cc ... -lrfc822</code></p>
|
||
|
|
||
|
<h2>DESCRIPTION</h2>
|
||
|
|
||
|
<p>The rfc822 library provides functions for parsing E-mail headers in the RFC
|
||
|
822 format. This library also includes some functions to help with encoding
|
||
|
and decoding 8-bit text, as defined by RFC 2047.</p>
|
||
|
|
||
|
<p>The format used by E-mail headers to encode sender and recipient
|
||
|
information is defined by RFC 822. The format allows the actual E-mail
|
||
|
address and the sender/recipient name to be expressed together, for example:
|
||
|
<code>John Smith <jsmith@example.com></code></p>
|
||
|
|
||
|
<p>The main purposes of the rfc822 library is to:</p>
|
||
|
|
||
|
<p>1) Parse a text string containing a list of RFC 822-formatted address into
|
||
|
its logical components: names and E-mail addresses.</p>
|
||
|
|
||
|
<p>2) Access those individual components.</p>
|
||
|
|
||
|
<p>3) Allow some limited modifications of the parsed structure, and then
|
||
|
convert it back into a text string.</p>
|
||
|
|
||
|
<h3>Tokenizing an E-mail header</h3>
|
||
|
<pre>struct rfc822t *tokens=rfc822t_alloc(const char *header,
|
||
|
void (*err_func)(const char *, int));
|
||
|
|
||
|
void rfc822t_free(tokens);</pre>
|
||
|
|
||
|
<p>The rfc822t_alloc() function accepts an E-mail <i>header</i>, and parses it
|
||
|
into individual tokens. This function allocates and returns a pointer to a
|
||
|
<i>rfc822t</i> structure, which is later used by the <i>rfc822a_alloc</i>
|
||
|
function to extract individual addresses from these tokens.</p>
|
||
|
|
||
|
<p>If <i>err_func</i> argument, if not NULL, is a pointer to a callback
|
||
|
function. The function is called in the event that the E-mail header is
|
||
|
corrupted to the point that it cannot even be parsed. This is a rare
|
||
|
instances -- most forms of corruption are still valid at least on the lexical
|
||
|
level. The only time this error is reported is in the event of mismatched
|
||
|
parenthesis, angle brackets, or quotes. The callback function receives the
|
||
|
<i>header</i> pointer, and an index to the syntax error in the header
|
||
|
string.</p>
|
||
|
|
||
|
<p>The semantics of <i>err_func</i> are subject to change. It is recommended
|
||
|
to leave this argument as NULL in the current version of the library.</p>
|
||
|
|
||
|
<p>rfc822t_alloc() returns a pointer to a dynamically-allocated <i>rfc822t</i>
|
||
|
structure. A NULL pointer is returned if there's insufficient memory to
|
||
|
allocate this structure. The rfc822t_free() function is used to destroy the
|
||
|
<i>rfc822t</i> structure and to free all the dynamically allocated memory.</p>
|
||
|
|
||
|
<p>NOTE: Until rfc822t_free() is called, the contents of <i>header</i> MUST
|
||
|
NOT be destroyed or altered in any way. The contents of <i>header</i> are not
|
||
|
modified by rfc822t_alloc(), however the <i>rfc822t</i> structure contains
|
||
|
pointers to portions of the supplied <i>header</i>.</p>
|
||
|
|
||
|
<h3>Extracting E-mail addresses</h3>
|
||
|
<pre>struct rfc822a *addrs=rfc822a_alloc(struct rfc822t *tokens);
|
||
|
|
||
|
void rfc822a_free(addrs);</pre>
|
||
|
|
||
|
<p>The rfc822a_alloc() function returns a dynamically-allocated <i>rfc822a</i>
|
||
|
structure, that contains individual addresses that were logically extracted
|
||
|
from a <i>rfc822t</i> structure. The rfc822a_alloc() function returns NULL if
|
||
|
there was insufficient memory to allocate the <i>rfc822a</i> structure. The
|
||
|
rfc822a_free() function destroys the <i>rfc822a</i> function, and frees all
|
||
|
associated dynamically-allocated memory. The <i>rfc822t</i> structure passed
|
||
|
to rfc822a_alloc() must not be destroyed before rfc822a_free() destroys the
|
||
|
<i>rfc822a</i> structure.</p>
|
||
|
|
||
|
<p>The <i>rfc822a</i> structure has the following fields:</p>
|
||
|
<pre>struct rfc822a {
|
||
|
struct rfc822addr *addrs;
|
||
|
int naddrs;
|
||
|
} ;</pre>
|
||
|
|
||
|
<p>The <i>naddrs</i> field gives the number of <i>rfc822addr</i> structures
|
||
|
that are pointed to by <i>addrs</i>, which is an array. Each <i>rfc822addr</i>
|
||
|
structure represents either an address found in the original E-mail header,
|
||
|
<i>or the contents of some legacy "syntactical sugar"</i>. For example, the
|
||
|
following is a valid E-mail header:</p>
|
||
|
<pre>To: recipient-list: tom@example.com, john@example.com;</pre>
|
||
|
|
||
|
<p>Typically, all of this, sans the "To:" part, is tokenized by
|
||
|
rfc822t_alloc(), then parsed by rfc822a_alloc(). The "recipient-list:" and
|
||
|
the trailing semicolon is a legacy mailing list specification that is no
|
||
|
longer in widespread use, but still must be accounted for. The resulting
|
||
|
<i>rfc822a</i> structure will have four <i>rfc822addr</i> structures, one for
|
||
|
"recipient-list:", one for each address, and one for the trailing semicolon.
|
||
|
Each <i>rfc822a</i> structure has the following fields:</p>
|
||
|
<pre>struct rfc822addr {
|
||
|
struct rfc822token *tokens;
|
||
|
struct rfc822token *name;
|
||
|
} ;</pre>
|
||
|
|
||
|
<p>If <i>tokens</i> is a null pointer, this structure represents some
|
||
|
non-address portion of the original header, such as "recipient-list:" or a
|
||
|
semicolon. Otherwise it points to a structure that represents the E-mail
|
||
|
address in tokenized form.</p>
|
||
|
|
||
|
<p><i>name</i> either points to the tokenized form of a non-address portion of
|
||
|
the original header, or to a tokenized form of the recipient's name.
|
||
|
<i>name</i> will be NULL if the recipient name was not provided. For the
|
||
|
following address: <code>Tom Jones <tjones@example.com></code> The
|
||
|
<i>tokens</i> field will point to the tokenized form of "tjones@example.com",
|
||
|
and <i>name</i> points to the tokenized form of "Tom Jones".</p>
|
||
|
|
||
|
<p>Each <i>rfc822token</i> structure contains the following fields:</p>
|
||
|
<pre>struct rfc822token {
|
||
|
struct rfc822token *next;
|
||
|
int token;
|
||
|
const char *ptr;
|
||
|
int len;
|
||
|
} ;</pre>
|
||
|
|
||
|
<p>The <i>next</i> pointer builds a linked list of all tokens in this name or
|
||
|
address. The possible values for the <i>token</i> field are:</p>
|
||
|
<ul>
|
||
|
<li>0x00 - this is a simple atom - a sequence of non-special characters that
|
||
|
is delimited by whitespace or special characters (see below).<br>
|
||
|
<br>
|
||
|
</li>
|
||
|
<li>0x22 - the value of the ascii quote - this is a quoted string.<br>
|
||
|
<br>
|
||
|
</li>
|
||
|
<li>'(' - this is an old style comment. A deprecated form of E-mail
|
||
|
addressing uses, as an example, "john@example.com (John Smith)" instead of
|
||
|
"John Smith <john@example.com>". This old-style notation defined
|
||
|
parenthesized content as arbitrary comments. The <i>rfc822token</i> with
|
||
|
<i>token</i> set to '(' is created for the contents of the entire
|
||
|
comment.<br>
|
||
|
<br>
|
||
|
</li>
|
||
|
<li>'<', '>', '@', and many others - the remaining possible values of
|
||
|
<i>token</i> include all the characters in RFC 822 headers that have
|
||
|
special significance.</li>
|
||
|
</ul>
|
||
|
|
||
|
<p>When a <i>rfc822token</i> structure does not represent a special character,
|
||
|
the <i>ptr</i> field points to a text string giving its contents. The contents
|
||
|
are NOT null-terminated, the <i>len</i> field contains the number of
|
||
|
characters included. The macro rfc822_is_atom(token) indicates whether
|
||
|
<i>ptr</i> and <i>len</i> are used for the given <i>token</i>. Currently
|
||
|
rfc822_is_atom() returns true if <i>token</i> is a zero byte, '"', or '('.</p>
|
||
|
|
||
|
<p>Note that it's possible that <i>len</i> might be zero. This will be the
|
||
|
case for null addresses used as return addresses for delivery status
|
||
|
notifications.</p>
|
||
|
|
||
|
<h3>Working with E-mail addresses</h3>
|
||
|
<pre>void rfc822_deladdr(struct rfc822a *addrs, int index);
|
||
|
|
||
|
void rfc822tok_print(const struct rfc822token *list,
|
||
|
void (*func)(char, void *), void *func_arg);
|
||
|
|
||
|
void rfc822_print(const struct rfc822a *addrs,
|
||
|
void (*print_func)(char, void *),
|
||
|
void (*print_separator)(const char *, void *), void *callback_arg);
|
||
|
|
||
|
void rfc822_addrlist(const struct rfc822a *addrs,
|
||
|
void (*print_func)(char, void *),
|
||
|
void *callback_arg);
|
||
|
|
||
|
void rfc822_namelist(const struct rfc822a *addrs,
|
||
|
void (*print_func)(char, void *),
|
||
|
void *callback_arg);
|
||
|
|
||
|
void rfc822_praddr(const struct rfc822a *addrs,
|
||
|
int index,
|
||
|
void (*print_func)(char, void *),
|
||
|
void *callback_arg);
|
||
|
|
||
|
void rfc822_prname(const struct rfc822a *addrs,
|
||
|
int index,
|
||
|
void (*print_func)(char, void *),
|
||
|
void *callback_arg);
|
||
|
|
||
|
void rfc822_prname_orlist(const struct rfc822a *addrs,
|
||
|
int index,
|
||
|
void (*print_func)(char, void *),
|
||
|
void *callback_arg);
|
||
|
|
||
|
char *rfc822_gettok(const struct rfc822token *list);
|
||
|
char *rfc822_getaddrs(const struct rfc822a *addrs);
|
||
|
char *rfc822_getaddr(const struct rfc822a *addrs, int index);
|
||
|
char *rfc822_getname(const struct rfc822a *addrs, int index);
|
||
|
char *rfc822_getname_orlist(const struct rfc822a *addrs, int index);
|
||
|
|
||
|
char *rfc822_getaddrs_wrap(const struct rfc822a *, int);</pre>
|
||
|
|
||
|
<p>These functions are used to work with individual addresses that are parsed
|
||
|
by rfc822a_alloc().</p>
|
||
|
|
||
|
<p>rfc822_deladdr() removes a single <i>rfc822addr</i> structure, whose
|
||
|
<i>index</i> is given, from the address array in <i>rfc822addr</i>.
|
||
|
<i>naddrs</i> is decremented by one.</p>
|
||
|
|
||
|
<p>rfc822tok_print() converts a tokenized <i>list</i> of <i>rfc822token</i>
|
||
|
objects into a text string. The callback function, <i>func</i>, is called one
|
||
|
character at a time, for every character in the tokenized objects. An
|
||
|
arbitrary pointer, <i>func_arg</i>, is passed unchanged as the additional
|
||
|
argument to the callback function. rfc822tok_print() is not usually the most
|
||
|
convenient and efficient function, but it has its uses.</p>
|
||
|
|
||
|
<p>rfc822_print() takes an entire <i>rfc822a</i> structure, and uses the
|
||
|
callback functions to print the contained addresses, in their original form,
|
||
|
separated by commas. The function pointed to by <i>print_func</i> is used to
|
||
|
print each individual address, one character at a time. Between the
|
||
|
addresses, the <i>print_separator</i> function is called to print the address
|
||
|
separator, usually the string ", ". The <i>callback_arg</i> argument is passed
|
||
|
along unchanged, as an additional argument to these functions.</p>
|
||
|
|
||
|
<p>The functions rfc822_addrlist() and rfc822_namelist() also print the
|
||
|
contents of the entire <i>rfc822a</i> structure, but in a different way.
|
||
|
rfc822_addrlist() prints just the actual E-mail addresses, not the recipient
|
||
|
names or comments. Each E-mail address is followed by a newline character.
|
||
|
rfc822_namelist() prints just the names or comments, followed by newlines.</p>
|
||
|
|
||
|
<p>The functions rfc822_praddr() and rfc822_prname() are just like
|
||
|
rfc822_addrlist() and rfc822_namelist(), except that they print a single name
|
||
|
or address in the <i>rfc822a</i> structure, given its <i>index</i>. The
|
||
|
functions rfc822_gettok(), rfc822_getaddrs(), rfc822_getaddr(), and
|
||
|
rfc822_getname() are equivalent to rfc822tok_print(), rfc822_print(),
|
||
|
rfc822_praddr() and rfc822_prname(), but, instead of using a callback function
|
||
|
pointer, these functions write the output into a dynamically allocated buffer.
|
||
|
That buffer must be destroyed by free(3) after use. These functions will
|
||
|
return a null pointer in the event of a failure to allocate memory for the
|
||
|
buffer.</p>
|
||
|
|
||
|
<p>rfc822_prname_orlist() is similar to rfc822_prname(), except that it will
|
||
|
also print the legacy RFC822 group list syntax (which are also parsed by
|
||
|
rfc822a_alloc()). rfc822_praddr() will print an empty string for an index
|
||
|
that corresponds to a group list name (or terminated semicolon).
|
||
|
rfc822_prname() will also print an empty string. rfc822_prname_orlist() will
|
||
|
instead print either the name of the group list, or a single string ";".
|
||
|
rfc822_getname_orlist() will instead save it into a dynamically allocated
|
||
|
buffer.</p>
|
||
|
|
||
|
<p>The function rfc822_getaddrs_wrap() is similar to rfc822_getaddrs(), except
|
||
|
that the generated text is wrapped on or about the 73rd column, using newline
|
||
|
characters.</p>
|
||
|
|
||
|
<h3>Working with dates</h3>
|
||
|
<pre>time_t timestamp=rfc822_parsedt(const char *datestr)
|
||
|
const char *datestr=rfc822_mkdate(time_t timestamp);
|
||
|
void rfc822_mkdate_buf(time_t timestamp, char *buffer);</pre>
|
||
|
|
||
|
<p>These functions convert between timestamps and dates expressed in the Date:
|
||
|
E-mail header format.</p>
|
||
|
|
||
|
<p>rfc822_parsedt() returns the timestamp corresponding to the given date
|
||
|
string (0 if there was a syntax error).</p>
|
||
|
|
||
|
<p>rfc822_mkdate() returns a date string corresponding to the given timestamp.
|
||
|
rfc822_mkdate_buf() writes the date string into the given buffer instead,
|
||
|
which must be of sufficient size to accommodate it.</p>
|
||
|
|
||
|
<h3>Working with 8-bit MIME-encoded headers</h3>
|
||
|
<pre>int error=rfc2047_decode(const char *text,
|
||
|
int (*callback_func)(const char *, int, const char *, void *),
|
||
|
void *callback_arg);
|
||
|
|
||
|
extern char *str=rfc2047_decode_simple(const char *text);
|
||
|
|
||
|
extern char *str=rfc2047_decode_enhanced(const char *text,
|
||
|
const char *charset);
|
||
|
|
||
|
void rfc2047_print(const struct rfc822a *a,
|
||
|
const char *charset,
|
||
|
void (*print_func)(char, void *),
|
||
|
void (*print_separator)(const char *, void *), void *);
|
||
|
|
||
|
|
||
|
char *buffer=rfc2047_encode_str(const char *string,
|
||
|
const char *charset);
|
||
|
|
||
|
int error=rfc2047_encode_callback(const char *string,
|
||
|
const char *charset,
|
||
|
int (*func)(const char *, size_t, void *),
|
||
|
void *callback_arg);
|
||
|
|
||
|
char *buffer=rfc2047_encode_header(const struct rfc822a *a,
|
||
|
const char *charset);</pre>
|
||
|
|
||
|
<p>These functions provide additional logic to encode or decode 8-bit content
|
||
|
in 7-bit RFC 822 headers, as specified in RFC 2047.</p>
|
||
|
|
||
|
<p>rfc2047_decode() is a basic RFC 2047 decoding function. It receives a
|
||
|
pointer to some 7bit RFC 2047-encoded text, and a callback function. The
|
||
|
callback function is repeatedly called. Each time it's called it receives a
|
||
|
piece of decoded text. The arguments are: a pointer to a text fragment, number
|
||
|
of bytes in the text fragment, followed by a pointer to the character set of
|
||
|
the text fragment. The character set pointer is NULL for portions of the
|
||
|
original text that are not RFC 2047-encoded.</p>
|
||
|
|
||
|
<p>The callback function also receives <i>callback_arg</i>, as its last
|
||
|
argument. If the callback function returns a non-zero value, rfc2047_decode()
|
||
|
terminates, returning that value. Otherwise, rfc2047_decode() returns 0 after
|
||
|
a successful decoding. rfc2047_decode() returns -1 if it was unable to
|
||
|
allocate sufficient memory.</p>
|
||
|
|
||
|
<p>rfc2047_decode_simple() and rfc2047_decode_enhanced() are alternatives to
|
||
|
rfc2047_decode() which forego a callback function, and return the decoded text
|
||
|
in a dynamically-allocated memory buffer. The buffer must be free(3)-ed after
|
||
|
use. rfc2047_decode_simple() discards all character set specifications, and
|
||
|
merely decodes any 8-bit text. rfc2047_decode_enhanced() is a compromise to
|
||
|
discarding all character set information. The local character set being used
|
||
|
is specified as the second argument to rfc2047_decode_enhanced(). Any RFC
|
||
|
2047-encoded text in a different character set will be prefixed by the name of
|
||
|
the character set, in brackets, in the resulting output.</p>
|
||
|
|
||
|
<p>rfc2047_decode_simple() and rfc2047_decode_enhanced() return a null pointer
|
||
|
if they are unable to allocate sufficient memory.</p>
|
||
|
|
||
|
<p>The rfc2047_print() function is equivalent to rfc822_print(), followed by
|
||
|
rfc2047_decode_enhanced() on the result. The callback functions are used in
|
||
|
an identical fashion, except that they receive text that's already
|
||
|
decoded.</p>
|
||
|
|
||
|
<p>The function rfc2047_encode_str() takes a <i>string</i> and <i>charset</i>
|
||
|
being the name of the local character set, then encodes any 8-bit portions of
|
||
|
<i>string</i> using RFC 2047 encoding. rfc2047_encode_str() returns a
|
||
|
dynamically-allocated buffer with the result, which must be free(3)-ed after
|
||
|
use, or NULL if there was insufficient memory to allocate the buffer.</p>
|
||
|
|
||
|
<p>The function rfc2047_encode_callback() is similar to rfc2047_encode_str()
|
||
|
except that the callback function is repeatedly called to received the
|
||
|
encoding string. Each invocation of the callback function receives a pointer
|
||
|
to a portion of the encoded text, the number of characters in this portion,
|
||
|
and <i>callback_arg</i>.</p>
|
||
|
|
||
|
<p>The function rfc2047_encode_header() is basically equivalent to
|
||
|
rfc822_getaddrs(), followed by rfc2047_encode_str();</p>
|
||
|
|
||
|
<h3>Working with subjects</h3>
|
||
|
<pre>char *basesubj=rfc822_coresubj(const char *subj)
|
||
|
|
||
|
char *basesubj=rfc822_coresubj_nouc(const char *subj)</pre>
|
||
|
|
||
|
<p>This function takes the contents of the subject header, and returns the
|
||
|
"core" subject header that's used in the specification of the IMAP THREAD
|
||
|
function. This function is designed to strip all subject line artifacts that
|
||
|
might've been added in the process of forwarding or replying to a message.
|
||
|
Currently, rfc822_coresubj() performs the following transformations:</p>
|
||
|
<ul>
|
||
|
<li>Whitespace - leading and trailing whitespace is removed. Consecutive
|
||
|
whitespace characters are collapsed into a single whitespace character.
|
||
|
All whitespace characters are replaced by a space.<br>
|
||
|
<br>
|
||
|
</li>
|
||
|
<li>Re:, (fwd) [foo] - these artifacts (and several others) are removed from
|
||
|
the subject line.</li>
|
||
|
</ul>
|
||
|
|
||
|
<p>Note that this function does NOT do MIME decoding. In order to implement
|
||
|
IMAP THREAD, it is necessary to call something like rfc2047_decode() before
|
||
|
calling rfc822_coresubj().</p>
|
||
|
|
||
|
<p>This function returns a pointer to a dynamically-allocated buffer, which
|
||
|
must be free(3)-ed after use.</p>
|
||
|
|
||
|
<p>rfc822_coresubj_nouc() is like rfc822_coresubj(), except that the subject
|
||
|
is not converted to uppercase.</p>
|
||
|
|
||
|
<h2>SEE ALSO</h2>
|
||
|
<a href="rfc2045.html">rfc2045(3)</a>, <a
|
||
|
href="reformime.html">reformime(1)</a>, <a
|
||
|
href="reformail.html">reformail(1)</a>.</body>
|
||
|
</html>
|