172 lines
5.3 KiB
HTML
172 lines
5.3 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
<html>
|
|
<!--
|
|
== Copyright (c) 2001 Ronald Garcia
|
|
==
|
|
== Permission to use, copy, modify, distribute and sell this software
|
|
== and its documentation for any purpose is hereby granted without fee,
|
|
== provided that the above copyright notice appears in all copies and
|
|
== that both that copyright notice and this permission notice appear
|
|
== in supporting documentation. Ronald Garcia makes no
|
|
== representations about the suitability of this software for any
|
|
== purpose. It is provided "as is" without express or implied warranty.
|
|
-->
|
|
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
|
|
<link rel="stylesheet" type="text/css" href="../../../boost.css">
|
|
<link rel="stylesheet" type="text/css" href="style.css">
|
|
<head>
|
|
<title>UTF-8 Codecvt Facet</title>
|
|
|
|
</head>
|
|
|
|
<body bgcolor="#ffffff" link="#0000ee" text="#000000"
|
|
vlink="#551a8b" alink="#ff0000">
|
|
<img src="../../../boost.png" alt="C++ Boost"
|
|
width="277" height="86"> <br clear="all">
|
|
|
|
|
|
<a name="sec:utf8-codecvt-facet-class"></a>
|
|
|
|
|
|
<h1><code>utf8_codecvt_facet</code></h1>
|
|
|
|
|
|
<pre>
|
|
template<
|
|
typename InternType = wchar_t,
|
|
typename ExternType = char
|
|
> utf8_codecvt_facet
|
|
</pre>
|
|
|
|
|
|
<h2>Rationale</h2>
|
|
|
|
|
|
UTF-8 is a method of encoding Unicode text in environments
|
|
where data is stored as 8-bit characters and some ascii characters
|
|
are considered special (i.e. Unix filesystem filenames) and tend
|
|
to appear more commonly than other characters. While
|
|
UTF-8 is convenient and efficient for storing data on filesystems,
|
|
it was not meant to be manipulated in memory by
|
|
applications. While some applications (such as Unix's 'cat') can
|
|
simply ignore the encoding of data, others should convert
|
|
from UTF-8 to UCS-4 (the more canonical representation of Unicode)
|
|
on reading from file, and reversing the process on writing out to
|
|
file.
|
|
|
|
<p>The C++ Standard IOStreams provides the <tt>std::codecvt</tt>
|
|
facet to handle specifically these cases. On reading from or
|
|
writing to a file, the <tt>std::basic_filebuf</tt> can call out to
|
|
the codecvt facet to convert data representations from external
|
|
format (ie. UTF-8) to internal format (ie. UCS-4) and
|
|
vice-versa. <tt>utf8_codecvt_facet</tt> is a specialization of
|
|
<tt>std::codecvt</tt> specifically designed to handle the case
|
|
of translating between UTF-8 and UCS-4.
|
|
|
|
|
|
<h2>Template Parameters</h2>
|
|
|
|
<table border summary="template parameters">
|
|
<tr>
|
|
<th>Parameter</th><th>Description</th><th>Default</th>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><tt>InternType</tt></td>
|
|
<td>The internal type used to represent UCS-4 characters.</td>
|
|
<td><tt>wchar_t</tt></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><tt>ExternType</tt></td>
|
|
<td>The external type used to represent UTF-8 octets.</td>
|
|
<td><tt>char_t</tt></td>
|
|
</tr>
|
|
</table>
|
|
|
|
|
|
<h2>Requirements</h2>
|
|
|
|
<tt>utf8_codecvt_facet</tt> defaults to using <tt>char</tt> as
|
|
its external data type and <tt>wchar_t</tt> as its internal
|
|
datatype, but on some architectures <tt>wchar_t</tt> is
|
|
not large enough to hold UCS-4 characters. In order to use
|
|
another internal type.You must also specialize <tt>std::codecvt</tt>
|
|
to handle your internal and external types.
|
|
(<tt>std::codecvt<char,wchar_t,std::mbstate_t></tt> is required to be
|
|
supplied by any standard-conforming compiler).
|
|
|
|
|
|
<h2>Example Use</h2>
|
|
The following is a simple example of using this facet:
|
|
|
|
<pre>
|
|
//...
|
|
// My encoding type
|
|
typedef wchar_t ucs4_t;
|
|
|
|
std::locale old_locale;
|
|
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);
|
|
|
|
// Set a New global locale
|
|
std::locale::global(utf8_locale);
|
|
|
|
// Send the UCS-4 data out, converting to UTF-8
|
|
{
|
|
std::wofstream ofs("data.ucd");
|
|
ofs.imbue(utf8_locale);
|
|
std::copy(ucs4_data.begin(),ucs4_data.end(),
|
|
std::ostream_iterator<ucs4_t,ucs4_t>(ofs));
|
|
}
|
|
|
|
// Read the UTF-8 data back in, converting to UCS-4 on the way in
|
|
std::vector<ucs4_t> from_file;
|
|
{
|
|
std::wifstream ifs("data.ucd");
|
|
ifs.imbue(utf8_locale);
|
|
ucs4_t item = 0;
|
|
while (ifs >> item) from_file.push_back(item);
|
|
}
|
|
//...
|
|
</pre>
|
|
|
|
|
|
<h2>History</h2>
|
|
|
|
This code was originally written as an iterator adaptor over
|
|
containers for use with UTF-8 encoded strings in memory.
|
|
Dietmar Kuehl suggested that it would be better provided as a
|
|
codecvt facet.
|
|
|
|
<h2>Resources</h2>
|
|
|
|
<ul>
|
|
<li> <a href="http://www.unicode.org">Unicode Homepage</a>
|
|
<li> <a href="http://home.CameloT.de/langer/iostreams.htm">Standard
|
|
C++ IOStreams and Locales</a>
|
|
<li> <a href="http://www.research.att.com/~bs/3rd.html">The C++
|
|
Programming Language Special Edition, Appendix D.</a>
|
|
</ul>
|
|
|
|
<br>
|
|
<hr>
|
|
<table summary="Copyright information">
|
|
<tr valign="top">
|
|
<td nowrap>Copyright © 2001</td>
|
|
<td><a href="http://www.osl.iu.edu/~garcia">Ronald Garcia</a>,
|
|
Indiana University
|
|
(<a href="mailto:garcia@cs.indiana.edu">garcia@osl.iu.edu</a>)<br>
|
|
<a href="http://www.osl.iu.edu/~lums">Andrew Lumsdaine</a>,
|
|
Indiana University
|
|
(<a href="mailto:lums@osl.iu.edu">lums@osl.iu.edu</a>)</td>
|
|
</tr>
|
|
</table>
|
|
<p><i>© Copyright <a href="http://www.rrsd.com">Robert Ramey</a> 2002-2004.
|
|
Distributed under the Boost Software License, Version 1.0. (See
|
|
accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
|
|
</i></p>
|
|
</body>
|
|
</html>
|
|
|
|
|