litehtml/document_createFromString.md at cc

Yuri Kobets 37bbcc6f18 Added documentation

2025-05-23 03:22:45 +03:00

3.4 KiB

Raw Permalink Blame History

static document::ptr  document::createFromString(
	const estring&       str,
	document_container*  container,
	const string&        master_styles = litehtml::master_css,
	const string&        user_styles = "");

Terminology:

BOM encoding is the encoding suggested by the byte-order-mark (BOM). Can be UTF-8, UTF-16LE, or UTF-16BE. Cannot be UTF-32 because it is not a valid HTML encoding. See bom_sniff.

meta encoding is an HTML encoding suggested by a valid charset tag.

valid charset tag:

must be inside <head>
must have one of these forms:
- <meta charset="utf-8"> or
- <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
encoding name must be one of the encoding labels https://encoding.spec.whatwg.org/#names-and-labels (see get_encoding)

HTTP encoding is the encoding specified in HTTP Content-Type header https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type

user override encoding - when your program allows user to manually choose encoding for particular page or site

Call without specifying encoding:

createFromString(string, container);

where string is std::string or char*

if BOM is present, BOM encoding will be used
otherwise, if valid tag is present, meta encoding will be used
otherwise, UTF-8 will be used

Call with encoding, confidence is certain:

createFromString({string, encoding::big5}, container)

if BOM is present, BOM encoding will be used
otherwise, Big5 will be used

NOTE: encoding from tag will be ignored

Call with encoding, confidence is tentative (very rare, you probably don't need this):

createFromString({string, encoding::big5, confidence::tentative}, container)

if BOM is present, BOM encoding will be used
otherwise, if valid tag is present, meta encoding will be used
otherwise, Big5 will be used

User override encoding and HTTP encoding must be passed with confidence certain, if both are present user override encoding should take precedence.

If both user override encoding and HTTP encoding are unspecified, your program may guess encoding by using encoding of the page when it was last visited or by performing frequency analysis or by URL domain or by current user locale or smth else. Any such encoding should be passed with confidence tentative. The precedence of these guesses is specified in the encoding sniffing algorithm, see litehtml::encoding_sniffing_algorithm and https://html.spec.whatwg.org/multipage/parsing.html#encoding-sniffing-algorithm

litehtml implements only the 3 steps of this algorithm:

set encoding to BOM encoding if BOM is present
prescan the input to determine its encoding - this only called if no BOM found and user didn't specify encoding or specified tentative encoding
return an implementation-defined default character encoding (UTF-8) - this only called if there is no BOM, user didn't specify an encoding and prescan failed.

If your program is displaying html files from the web it is recommended to detect HTTP encoding, because it is not very unusual for web pages to have encoding specified only in HTTP header or meta encoding be different from HTTP encoding (HTTP encoding takes the precedence in this case).

3.4 KiB Raw Permalink Blame History

Terminology:

Call without specifying encoding:

Call with encoding, confidence is certain:

Call with encoding, confidence is tentative (very rare, you probably don't need this):

3.4 KiB

Raw Permalink Blame History