bug #10983 [DomCrawler] Fixed charset detection in html5 meta charset tag (77web)

This PR was squashed before being merged into the 2.3 branch (closes #10983).

Discussion
----------

[DomCrawler] Fixed charset detection in html5 meta charset tag

| Q             | A
| ------------- | ---
| Bug fix?      | yes
| New feature?  | no
| BC breaks?    | no
| Deprecations? | no
| Tests pass?   | yes
| Fixed tickets | N/A
| License       | MIT

It may be minor to folks with ascii-charactered language, but is critical for us Japanese.
Many Japanese websites with SJIS encoding have "Shift_JIS" as their encoding declaration.

Commits
-------

172e752 [DomCrawler] Fixed charset detection in html5 meta charset tag
This commit is contained in:
Fabien Potencier 2014-05-27 00:15:18 +02:00
commit cff410507f
2 changed files with 7 additions and 1 deletions

View File

@ -108,8 +108,10 @@ class Crawler extends \SplObjectStorage
}
}
// http://www.w3.org/TR/encoding/#encodings
// http://www.w3.org/TR/REC-xml/#NT-EncName
if (null === $charset &&
preg_match('/\<meta[^\>]+charset *= *["\']?([a-zA-Z\-0-9]+)/i', $content, $matches)) {
preg_match('/\<meta[^\>]+charset *= *["\']?([a-zA-Z\-0-9_:.]+)/i', $content, $matches)) {
$charset = $matches[1];
}

View File

@ -232,6 +232,10 @@ EOF
$crawler = new Crawler();
$crawler->addContent('<html><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><span>中文</span></html>');
$this->assertEquals('中文', $crawler->filterXPath('//span')->text(), '->addContent() guess wrong charset');
$crawler = new Crawler();
$crawler->addContent(mb_convert_encoding('<html><head><meta charset="Shift_JIS"></head><body>日本語</body></html>', 'SJIS', 'UTF-8'));
$this->assertEquals('日本語', $crawler->filterXPath('//body')->text(), '->addContent() can recognize "Shift_JIS" in html5 meta charset tag');
}
/**