[DomCrawler] fix handling of schemes by Link::getUri()

A link (anchor tag with an href attr) in crawled by the Crawler
can contain any valid URI, including mailto: links.

Currently this is not correctly supported by Link::getUri.
Schemes that do not start with 'http' are treated as relative URIs
and appenden to the base URI. This leads to strange URIs like this:
http://foo.com/mailto:foo@bar.com

Fixed Link::getUri to treat any URI with a schema part as an
absolute URL. Updated the unit tests to test for this.
This commit is contained in:
Matthijs van den Bos 2013-02-28 12:27:10 +01:00
parent 83382bc798
commit 8f8ba380d6
2 changed files with 3 additions and 1 deletions

View File

@ -89,7 +89,7 @@ class Link
$uri = trim($this->getRawUri());
// absolute URL?
if (0 === strpos($uri, 'http')) {
if (null !== parse_url($uri, PHP_URL_SCHEME)) {
return $uri;
}

View File

@ -93,6 +93,8 @@ class LinkTest extends \PHPUnit_Framework_TestCase
array('?a=b', 'http://localhost/bar/', 'http://localhost/bar/?a=b'),
array('http://login.foo.com/foo', 'http://localhost/bar/', 'http://login.foo.com/foo'),
array('https://login.foo.com/foo', 'https://localhost/bar/', 'https://login.foo.com/foo'),
array('mailto:foo@bar.com', 'http://localhost/foo', 'mailto:foo@bar.com'),
array('?foo=2', 'http://localhost?foo=1', 'http://localhost?foo=2'),
array('?foo=2', 'http://localhost/?foo=1', 'http://localhost/?foo=2'),