AspEmail Manual Chapter 6: Unicode and Non-ASCII Support

Contents

6.1 Quoted-Printable Format

AspEmail is capable of sending messages in alphabets other than US-ASCII by supporting the "Quoted-Printable" format. This format is described in RFC-2045. The idea of the format is that characters with codes less than 33 and greater than 126 are represented by an "=" followed by a two digit hexadecimal representation of the character's value. For example, the decimal value 12 (US-ASCII form feed) is represented as =0C, and the decimal value 61 (US-ASCII "=") can be represented as =3D.

AspEmail encodes the message body in the Quoted-Printable format automatically if the ContentTransferEncoding property is set to the string "Quoted-Printable" (letter case is immaterial). You may also set the Charset property to the appropriate character set. The following code snippet sends a message in Russian:

<% @codepage=1251 %>

<%
...
Mail.Charset = "Windows-1251"
Mail.Body = "Сообщение по-русски."
Mail.ContentTransferEncoding = "Quoted-Printable"
%>

The directive <% @codepage=1251 %> instructs the ASP interpreter to treat the hard-coded characters in the script as Russian symbols (1251 is the Russian code page). As a result, the Body property will receive a Russian Unicode string.

6.2 Non-ASCII Characters in Headers

If you wish to send a message with certain mail headers such as Subject:, To: or From: containing non-US-ASCII characters, you should use the method Mail.EncodeHeader to encode your character string according to the RFC 1522. The method takes one required parameter, the header string, and one optional parameter, the character set, which is "ISO-8859-1" by default. For example:

<% @codepage=1251 %>

<%
Mail.Subject = Mail.EncodeHeader("Тема По-русски", "Windows-1251")
Mail.FromName = Mail.EncodeHeader("Иван", "Windows-1251")
Mail.AddAddress "stein@somecompany.no", Mail.EncodeHeader("Штейн")
%>

6.3 Unicode and UTF-8

From MSDN: "Unicode is a 16-bit, fixed-width character encoding standard that encompasses virtually all of the characters commonly used on computers today. This includes most of the world's written languages, plus publishing characters, mathematical and technical symbols, and punctuation marks."

From Unicode.org: "Computers ... store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters... Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language."

For example, the basic Latin letter "A" has the code Hex 0041 (65), the Russian letter Ж has the code Hex 0416 (1046), and the Chinese character has the code Hex 32A5 (12965).

UTF-8 (Unicode Transformation Format, 8-bit encoding form) is the recommended format to be used to send Unicode-based data across networks, in particular the Internet. UTF-8 represents a Unicode value as a sequence of 1, 2, or 3 bytes.

Unicode characters in the range Hex 0000 to 007F are encoded simply as bytes 00 to 7F. This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. Therefore, the Unicode 0041 ("A") in UTF-8 is Hex 41.

Unicode characters in the range Hex 0080 to 07FF are encoded as a sequence of two bytes For example, the Unicode 0416 ("Ж") is encoded as Hex D0 96. Unicode characters in the range Hex 0800 to FFFF are encoded as a sequence of three bytes. For example the Unicode 32A5 ("㊥") is encoded as Hex E3 8A A5.

As of Unicode 2.0, characters are no longer limited to the Hex 0000 to Hex FFFF range, referred to as the Basic Multilingual Plane (BMP). Characters in the range Hex 10000 to Hex 10FFFF, referred to as Supplementary code points, are also supported. Among the latter are the Emoji characters such as these:

Icon
Code Point (Hex)
Encoding (Hex)
😂
1F602
D83D DE02
🌹
1F339
D83C DF39

Since the Emoji characters (and other supplementary code points) exceed Hex FFFF, they need to be represented by two 16-bit numbers instead of one, as shown in the right column of the table above. These two numbers are referred to as high surrogate code point and low surrogate code point. The formulas to convert a supplementary code point to its high and low surrogates are as follows:

hi = (cp - 0x10000) / 0x400 + 0xD800
lo = (cp - 0x10000) % 0x400 + 0xDC00

These numbers come in handy when Emojis need to be included in an email subject, as shown in the next section.

6.4 UTF-8 Support in AspEmail

AspEmail 5.0 offers full UTF-8 support in both a message body and headers. To send a UTF-8 encoded message, you must set the CharSet property to the string "UTF-8" (case is immaterial), and ContentTransferEncoding to "Quoted-Printable". You should also pass "UTF-8" as the second argument to EncodeHeader.

The following code sample demonstrates the UTF-8 usage:

<%
' change to address of your own SMTP server
strHost = "smtp.myisp.net"

' Enable UTF-8 -> Unicode translation for form items
Session.CodePage = 65001 ' UTF-8 code

If Request("Send") <> "" Then
   Set Mail = Server.CreateObject("Persits.MailSender")
   ' enter valid SMTP host
   Mail.Host = strHost

   Mail.From = "info@aspemail.com" ' From address
   Mail.FromName = Mail.EncodeHeader(Request("FromName"), "utf-8")
   Mail.AddAddress Request("To")

   ' message subject
   Mail.Subject = Mail.EncodeHeader( Request("Subject"), "utf-8")

   ' message body
   Mail.Body = Request("Body")

   ' UTF-8 parameters
   Mail.CharSet = "UTF-8"
   Mail.ContentTransferEncoding = "Quoted-Printable"
   Mail.Send ' send message
   Response.Write "Message sent to " & Request("To")
End If
%>

<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" content="text/html; charset=utf-8">
<TITLE>AspEmail: Unicode.asp</TITLE>
</HEAD>
<BODY>

<FORM METHOD="POST" ACTION="Unicode.asp">
<TABLE CELLSPACING=0 CELLPADDING=0>
<TR><TD>Enter email:</TD><TD><INPUT TYPE="TEXT" NAME="To"></TD></TR>
<TR><TD>Enter your name:</TD><TD><INPUT TYPE="TEXT" NAME="FromName"></TD></TR>
<TR><TD>Enter Subject:</TD><TD><INPUT TYPE="TEXT" NAME="Subject"></TD></TR>
<TR><TD>Enter Body:</TD><TD><TEXTAREA cols="50" rows="10" NAME="Body"></TEXTAREA></TD></TR>
<TR><TD COLSPAN=2><INPUT TYPE=SUBMIT NAME="Send" VALUE="Send"></TD></TR>
</TABLE>
</FORM>
</BODY>
</HTML>

This code sample has several important elements you must not overlook:

<META HTTP-EQUIV="Content-Type" content="text/html; charset=utf-8">

This META tag specifies the character set for this page to be UTF-8. This, among other things, instructs the browser to UTF8-encode all form items when the form is submitted.

Session.CodePage = 65001

This line instructs our ASP script to convert UTF8-encoded form items (returned by the Request.Form collection) back to regular Unicode strings. The number 65001 is the UTF-8 code page.

Mail.Subject = Mail.EncodeHeader( Request("Subject"), "utf-8")

The second optional argument is set to "UTF-8" for proper encoding of the header.

Mail.CharSet = "UTF-8"
Mail.ContentTransferEncoding = "Quoted-Printable"

These two lines ensure proper UTF-8 encoding of the message body.

Click the links below to run this code sample:

The EncodeHeader method can also be used to include Emoji characters in the message subject. The Emoji's two-number encoding must be used along with VBScript's built-in ChrW function which converts a number to a 2-byte (Unicode) character. The following code snippets appends two Emojis, the laughing face and rose (mentioned in the previous section) to the subject:

...
Mail.Subject = Mail.EncodeHeader("Emoji Test: " + ChrW(&HD83D) & ChrW(&HDE02) & ChrW(&HD83C) & ChrW(&HDF39), "utf-8")
...

6.5 Valid CharSet Values

You may specify the following string values for the CharSet property, as well as the second optional argument to the EncodeHeader method:

Value Meaning
"UTF-8" UTF-8
"UTF-7" UTF-7
"Windows-1250"
"cp1250"
ANSI - Central Europe
"Windows-1251"
"cp1251"
ANSI - Cyrillic
"Windows-1252"
"cp1252"
"ascii"
"us-ascii"
Latin I
"Windows-1253"
"cp1253"
ANSI - Greek
"Windows-1254"
"cp1254"
ANSI - Turkish
"Windows-1255"
"cp1255"
ANSI - Hebrew
"Windows-1256"
"cp1256"
ANSI - Arabic
"Windows-1257"
"cp1257"
ANSI - Baltic
"Windows-1258"
"cp1258"
ANSI - Vietnamese
"ISO-8859-1" Latin I (default value)
"ISO-8859-2" Central Europe
"ISO-8859-3" Latin 3
"ISO-8859-4" Baltic
"ISO-8859-5" Cyrillic
"ISO-8859-6" Arabic
"ISO-8859-7" Greek
"ISO-8859-8" Hebrew
"ISO-8859-9" Latin 5
"ISO-8859-15" Latin 9
"cp866" Russian DOS
"koi8-r" Russian
"koi8-u" Ukrainian
"shift_jis" Japanese Windows
"ks_c_5601-1987"
"korean"
Korean
"EUC-KR"
"korean"
EUC - Korean
"BIG5" Traditional Chinese Windows
"GB2312"
"chinese"
Simplified Chinese
"HZ-GB-2312" Simplified Chinese HZ
"EUC-JP" EUC - Japanese
"X-EUC-TW" EUC - Traditional Chinese