[COMPOSER] Add new php-ffmpeg package
This commit is contained in:
470
vendor/zetacomponents/document/tests/files/xhtml/xpath/s_004_detect_url_in_texts.html
vendored
Normal file
470
vendor/zetacomponents/document/tests/files/xhtml/xpath/s_004_detect_url_in_texts.html
vendored
Normal file
@@ -0,0 +1,470 @@
|
||||
<?xml version="1.0"?>
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 Strict//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||||
<html
|
||||
xmlns="http://www.w3.org/1999/xhtml"
|
||||
xml:lang="en"
|
||||
lang="en">
|
||||
<head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
||||
|
||||
<meta name="description" content="Personal homepage of Kore Nordmann. Contains information about his mainly PHP related projects with some political rants in his blog." />
|
||||
<meta name="keywords" content="Image_3D, Image_Curve, Image_Turtle, Rays, k.Portal, k.bot, irc, irc bot, graphics, 3d" />
|
||||
<meta name="language" content="en" />
|
||||
<meta name="author" content="Kore Nordmann" />
|
||||
<meta name="date" content="Sat, 28 Jun 2008 21:17:17 +0200" />
|
||||
<meta name="robots" content="all" />
|
||||
|
||||
<link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
|
||||
<meta name="DC.title" content="Detecting URLs with PCRE" />
|
||||
<meta name="DC.creator" content="Kore Nordmann" />
|
||||
<meta name="DC.date" content="Sat, 28 Jun 2008 21:17:17 +0200" />
|
||||
<meta name="DC.rights" content="CC by-sa" />
|
||||
|
||||
<link rel="Stylesheet" type="text/css" href="/styles/screen.css" media="screen" />
|
||||
<link rel="Stylesheet" type="text/css" href="/styles/print.css" media="print" />
|
||||
|
||||
<link rel="icon" href="/favicon.png" type="image/png" />
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" title="Detecting URLs with PCRE" href="/blog/detect_url_in_texts.rss" />
|
||||
|
||||
<title>Detecting URLs with PCRE - Kore Nordmann - PHP / Projects / Politics</title>
|
||||
</head>
|
||||
<body>
|
||||
<div class="frame">
|
||||
<h1>
|
||||
<a href="/" title="Kore Nordmann - PHP / Projects / Politics">
|
||||
Kore Nordmann - PHP / Projects / Politics
|
||||
</a>
|
||||
</h1>
|
||||
|
||||
<ul class="mainnav">
|
||||
<li class="selected">
|
||||
<a href="/blog.html" title="Blog">
|
||||
Blog
|
||||
</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/projects.html" title="Projects">
|
||||
Projects
|
||||
</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/portfolio.html" title="Portfolio">
|
||||
Portfolio
|
||||
</a>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<div class="box" id="navigation">
|
||||
<h2>Contents</h2>
|
||||
<ul>
|
||||
<li class="requested">
|
||||
<a href="/blog.html" title="Blog">Blog</a>
|
||||
<ul>
|
||||
<li >
|
||||
<a href="/blog/php.html" title="PHP">PHP</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/blog/phpug.html" title="PHP Usergroup">PHP Usergroup</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/blog/projects.html" title="Project news">Project news</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/blog/articles.html" title="Articles">Articles</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/blog/irc_bot.html" title="k.Bot - PHP IRC Bot">k.Bot - PHP IRC Bot</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/blog/busimess.html" title="busimess.org">busimess.org</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/blog/personal.html" title="Personal & Website">Personal & Website</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/blog/receipts.html" title="Rezepte">Rezepte</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/blog/tagging.html" title="Tagging">Tagging</a>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/projects/projects.html" title="Projects">Projects</a>
|
||||
<ul>
|
||||
<li >
|
||||
<a href="/projects/phpillow/index.html" title="PHPillow">PHPillow</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/projects/image_3d/index.html" title="Image 3D">Image 3D</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/projects/kaforkl/index.html" title="KaForkL">KaForkL</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/projects/k_bot/index.html" title="k.Bot">k.Bot</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/projects/code_snippets/index.html" title="Code snippets">Code snippets</a>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/portfolio.html" title="Portfolio">Portfolio</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/about_me.html" title="About me">About me</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/photos/index.html" title="Photos">Photos</a>
|
||||
<ul>
|
||||
<li >
|
||||
<a href="/photos/experiments.html" title="Experimental pictures">Experimental pictures</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/photos/norway.html" title="Norway">Norway</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/photos/industriekultur.html" title="Industriekultur">Industriekultur</a>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/photos/misc.html" title="Miscellaneous pictures">Miscellaneous pictures</a>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</li>
|
||||
<li >
|
||||
<a href="/impressum.html" title="Legal info">Legal info</a>
|
||||
</li>
|
||||
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<div
|
||||
class="content"
|
||||
xml:lang="en"
|
||||
lang="en">
|
||||
|
||||
<div class="path">
|
||||
<div class="breadcrumbs">
|
||||
|
||||
|
||||
|
||||
<a href="/.html">Start</a>
|
||||
» <a href="/blog.html">Blog</a>
|
||||
» <a href="/blog/detect_url_in_texts.html">Detecting URLs with PCRE</a>
|
||||
</div>
|
||||
|
||||
<a href="/blog/detect_url_in_texts.rdf">
|
||||
RDF
|
||||
</a>
|
||||
|
|
||||
<a href="/blog/detect_url_in_texts.html">
|
||||
HTML
|
||||
</a>
|
||||
|
|
||||
<a href="/blog/detect_url_in_texts.txt">
|
||||
Text
|
||||
</a>
|
||||
|
|
||||
<a href="/blog/detect_url_in_texts.rss">
|
||||
Feed
|
||||
</a>
|
||||
</div>
|
||||
|
||||
|
||||
<div id="detect_url_in_texts">
|
||||
<h2>Detecting URLs with PCRE</h2><p>From time to time I experience the issue that I should detect URLs in some text, while neither the URLs are standard conform (regarding the used characters), nor the URLs are strictly separated from other stuff by whitespaces or something. Now <a href="http://www.derickrethans.nl/">Derick</a> asked me to provide him with a regular expression for that, and I finally wrote some, which should work in most cases:</p><code>(
|
||||
(?:^|[\s,.!?])
|
||||
(?# Ignore matching braces around the URL)
|
||||
(<)?
|
||||
(\[)?
|
||||
(\()?
|
||||
(?# Ignore quoting around the URL)
|
||||
([\'"]?)
|
||||
|
||||
(?# Actually match the URL)
|
||||
(?P<url>https?://[^\s]*?)
|
||||
\4
|
||||
(?(3)\))
|
||||
(?(2)\])
|
||||
(?(1)>)
|
||||
|
||||
(?# Ignore common punctuation after the URL)
|
||||
[.,?!]?(?:\s|$)
|
||||
)xm</code><p>Sadly invalid characters are not always encoded, and also you can't expect to have only matching braces in the URLs, but still user like to write something like:</p><blockquote>Check out my Blog ($url)!</blockquote><p>In which case the braces are obviously not part of the actual URL so you should skip them, the same for the other brace types.</p><p>The regular expression uses conditional subpatterns to check for those matching braces before and after an URL and ignores them, when they are found. The same for quotes. Often URLs are followed by some markup, which also shouldn't be included in the actual URL, which is also ignored by this regular expression, but still - even not valid - characters like commas are included in the URL, if used there.</p><a name="issues"></a><h3>Issues</h3><p>There are two issues, which are still not really solveable by a regular expression I think, but additions and suggestions would be really welcome:</p><ol><li><p>PCRE does not reuse the end markers (?:\s|$) as start markers for the next URL, and I see no way to get the regular expression working without them. This means, that two URLs, only separated by one whitespace, would be detected when calling preg_match_all. You can still call preg_match() in a while-loop, though and remove all URLs from the text, after you found them.</p></li><li><p>Some users tend to use braces for subsentences, where one brace may end right after the URL, like this:</p><blockquote>Hi there (Check out my blog at $url)!</blockquote><p>Where the closing brace after the URL won't be removed, because there is no opening URL right before the URL.</p><p>I don't think this is fixable, because you can't expect the user to have only matching braces in his sentences, nor can you expect that for URLs itself. So we can just guess, what will be there more common problem - ignoring closing braces at the end of URLs, or users writing such sentences...</p></li></ol><p>Still I think this regular expression might be useful to you, feel free to use it where ever you might find it useful. As a german I am not allowed to put something under public domain, but I grant anyone the right to use this for any purpose, without any conditions, unless such conditions are required by law.</p>
|
||||
|
||||
</div>
|
||||
|
||||
<ul class="bookmark">
|
||||
<li>
|
||||
<a href="http://del.icio.us/post?url=http://kore-nordmann.de/blog/detect_url_in_texts.txt">
|
||||
<img src="/images/favicons/delicious.png" width="16" height="16"
|
||||
alt="Bookmark at del.icio.us"
|
||||
title="Bookmark blog entry at del.icio.us" />
|
||||
</a>
|
||||
<a href="http://digg.com/submit?phase=2&url=http://kore-nordmann.de/blog/detect_url_in_texts.txt">
|
||||
<img src="/images/favicons/digg.png" width="16" height="16"
|
||||
alt="Digg it!"
|
||||
title="Digg blog entry" />
|
||||
</a>
|
||||
<a href="http://yigg.de/neu?exturl=http://kore-nordmann.de/blog/detect_url_in_texts.txt">
|
||||
<img src="/images/favicons/yiggit.png" width="16" height="16"
|
||||
alt="Yigg it!"
|
||||
title="Yigg blog entry" />
|
||||
</a>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<h3>Comments</h3>
|
||||
<ul class="comments">
|
||||
<li>
|
||||
<a name="comment_1"></a>
|
||||
<h4>
|
||||
<a href="http://enobrev.com/" title="Mark Armendariz">
|
||||
<strong>Mark Armendariz</strong>
|
||||
</a>
|
||||
at Sat, 21 Jun 2008 23:19:47 +0200
|
||||
</h4>
|
||||
<p>
|
||||
I've been using this for years, which has been incredibly successful for me:<br />
|
||||
<br />
|
||||
'/(?P<protocol>(?:(?:f|ht)tp|https):\/\/)?<br />
|
||||
(?P<domain>(?:(?!-)<br />
|
||||
(?P<sld>[a-zA-Z\d\-]+)(?<!-)<br />
|
||||
[\.]){1,2}<br />
|
||||
(?P<tld>(?:[a-zA-Z]{2,}\.?){1,}){1,}<br />
|
||||
|<br />
|
||||
(?P<ip>(?:(?(?<!\/)\.)(?:25[0-5]|2[0-4]\d|[01]?\d?\d)){4})<br />
|
||||
)<br />
|
||||
(?::(?P<port>\d{2,5}))?<br />
|
||||
(?:\/<br />
|
||||
(?P<script>[~a-zA-Z\/.0-9-_]*)?<br />
|
||||
(?:\?(?P<parameters>[=a-zA-Z+%&0-9,.\/_ -]*))?<br />
|
||||
)?<br />
|
||||
(?:\#(?P<anchor>[=a-zA-Z+%&0-9._]*))?/x';<br />
|
||||
<br />
|
||||
it has an optional protocol (which you can make mandatory by removing the ? at the end of the 1st line), and names all the parts (protocol, domain, sld, tld, ip, port, script, parameters, anchor).<br />
|
||||
<br />
|
||||
You can include internal ones using a 'servername' like so:<br />
|
||||
'|(?P<servername>[a-zA-Z\d\-]*[a-zA-Z\d][^:\/]?)'<br />
|
||||
<br />
|
||||
after the '<ip>' line.<br />
|
||||
<br />
|
||||
Mark<br />
|
||||
|
||||
</p>
|
||||
<a class="permalink" href="#comment_1">
|
||||
Link to comment
|
||||
</a>
|
||||
</li>
|
||||
<li>
|
||||
<a name="comment_2"></a>
|
||||
<h4>
|
||||
<a href="http://kore-nordmann.de" title="Kore">
|
||||
<strong>Kore</strong>
|
||||
</a>
|
||||
at Sat, 21 Jun 2008 23:43:46 +0200
|
||||
</h4>
|
||||
<p>
|
||||
I wrote similar ones following the specification of relevant RFC, but this is actually not the point of the regular expression mentioned above.<br />
|
||||
<br />
|
||||
The above one does not try to detect the parts of regular expressions, I found the PHP function parse_url() more useful (and more readable) for that task, but from filtering URLs out of random text. Your regular expression misses that part and does not accept quite common URL chrarcters like () and ;.<br />
|
||||
<br />
|
||||
But anyways - thanks for sharing that regular expression.
|
||||
</p>
|
||||
<a class="permalink" href="#comment_2">
|
||||
Link to comment
|
||||
</a>
|
||||
</li>
|
||||
<li>
|
||||
<a name="comment_3"></a>
|
||||
<h4>
|
||||
<a href="http://enobrev.com/" title="Mark Armendariz">
|
||||
<strong>Mark Armendariz</strong>
|
||||
</a>
|
||||
at Mon, 23 Jun 2008 03:05:04 +0200
|
||||
</h4>
|
||||
<p>
|
||||
Good call about the extra characters. I had recently added semicolons and commas, but hadn't thought to add parentheses (thanks for the suggestion!). The regex I gave can be used to filter them out, but now i realize what you're showing in your post. I'd originally misread it (sorry).<br />
|
||||
<br />
|
||||
As for punctuation surrounding a url, I imagine you could get rid of anything that is "touching" a url. Any surrounding text that is not a \s (or even \b) would likely be associated with that url and would likely do well to be filtered out as well.
|
||||
</p>
|
||||
<a class="permalink" href="#comment_3">
|
||||
Link to comment
|
||||
</a>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<form action="/blog/detect_url_in_texts.dyn" method="post">
|
||||
<fieldset>
|
||||
<label for="blog_author"><strong>Author:</strong></label>
|
||||
<input name="author" type="text" id="blog_author" />
|
||||
<label for="blog_homepage">Homepage:</label>
|
||||
<input name="homepage" type="text" id="blog_homepage" />
|
||||
<!-- This input field is only for cactching spam bots, should be hidden
|
||||
using CSS -->
|
||||
<input name="url" type="text" id="blog_url" />
|
||||
<label for="blog_comment"><strong>Comment:</strong></label>
|
||||
<textarea name="comment" id="blog_comment" rows="10" cols="50"></textarea>
|
||||
<label for="blog_submit">Submit:</label>
|
||||
<input name="submit" type="submit" id="blog_submit" value="Send comment" />
|
||||
|
||||
<legend>Add new comment</legend>
|
||||
</fieldset>
|
||||
</form>
|
||||
|
||||
<p>
|
||||
Fields with bold names are mandatory.
|
||||
</p>
|
||||
|
||||
</div>
|
||||
|
||||
<div class="box" id="search">
|
||||
<h2>Search</h2>
|
||||
<form action="/search.html" method="post">
|
||||
<fieldset>
|
||||
<legend>Search</legend>
|
||||
<label for="search_phrase">Search phrase</label>
|
||||
<input type="text" id="search_phrase" name="phrase" />
|
||||
<input type="submit" name="search" value="Search" />
|
||||
</fieldset>
|
||||
</form>
|
||||
</div>
|
||||
|
||||
<div class="box" id="info">
|
||||
|
||||
<ul>
|
||||
<li class="revision">
|
||||
Revision 4<br />
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<p>
|
||||
Revision 488
|
||||
by kore
|
||||
at Sat, 28 Jun 2008 21:17:17 +0200
|
||||
</p>
|
||||
<p class="message">
|
||||
- s/butt/but/<br />
|
||||
|
||||
</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>
|
||||
Revision 478
|
||||
by kore
|
||||
at Sat, 21 Jun 2008 23:46:36 +0200
|
||||
</p>
|
||||
<p class="message">
|
||||
- Fixed typo. changed license remark<br />
|
||||
|
||||
</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>
|
||||
Revision 477
|
||||
by kore
|
||||
at Sat, 21 Jun 2008 20:38:48 +0200
|
||||
</p>
|
||||
<p class="message">
|
||||
- Replaced tabs by whitespaces<br />
|
||||
|
||||
</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>
|
||||
Revision 476
|
||||
by kore
|
||||
at Sat, 21 Jun 2008 20:37:20 +0200
|
||||
</p>
|
||||
<p class="message">
|
||||
- Published blog article about regular expression for URL detection<br />
|
||||
|
||||
</p>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li class="spacer"> </li>
|
||||
|
||||
<li>By <a href="" onclick="returnMailLink( this, 'kore-nordmann', 'de', 'website' );">Kore Nordmann</a></li>
|
||||
|
||||
<li>At Sat, 28 Jun 2008 21:17:17 +0200</li>
|
||||
|
||||
<li>License: <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.de" title="Creative Commons Attribution-ShareAlike 3.0">CC by-sa</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<div class="box" id="books">
|
||||
<h2>eZ Components</h2>
|
||||
<a id="book_ezc" href="http://www.amazon.de/gp/product/3836210738?ie=UTF8&tag=korenordmap07-21&linkCode=as2&camp=1638&creative=6742&creativeASIN=3836210738" title="eZ Components">
|
||||
<img src="/images/ez_components.png" width="67" height="90" alt="eZ Components" />
|
||||
</a>
|
||||
<h2>Exploring PHP</h2>
|
||||
<a id="book_expl" href="http://www.amazon.de/gp/product/3935042957?ie=UTF8&tag=korenordmap07-21&linkCode=as2&camp=1638&creative=6742&creativeASIN=3935042957" title="Exploring PHP">
|
||||
<img src="/images/exploring_php.png" width="62" height="90" alt="Exploring PHP" />
|
||||
</a>
|
||||
</div>
|
||||
|
||||
<div class="box">
|
||||
<a href="http://norro.de/Spam/Pot1/+/hp.honeypot.simpleformspam=" title="Spam honeypot - NOT for humans">Spam honeypot</a>
|
||||
</div>
|
||||
|
||||
<div class="box" id="thanks">
|
||||
<h2>Amazon wishlist</h2>
|
||||
<ul>
|
||||
<li>
|
||||
<a id="t_amazon" href="http://www.amazon.de/gp/registry/?type=wishlist&id=4MA8PIIHYC7E" title="Amazon wishlist">
|
||||
<img src="/images/thanks/amazon_wishlist.png" alt="Amazon wishlist" />
|
||||
</a>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<h2>Powered by</h2>
|
||||
<ul>
|
||||
<li>
|
||||
<a id="t_lighttpd" href="http://lighttpd.net" title="Lighttpd">
|
||||
<img src="/images/thanks/lighttpd.png" alt="Powered by Lighttpd" />
|
||||
</a>
|
||||
</li>
|
||||
<li>
|
||||
<a id="t_wcv" href="http://wcv.kore-nordmann.de" title="Web Content Viewer">
|
||||
<img src="/images/thanks/wcv.png" alt="Powered by Web Content Viewer" />
|
||||
</a>
|
||||
</li>
|
||||
<li>
|
||||
<a id="t_php" href="http://php.net" title="The PHP Group">
|
||||
<img src="/images/thanks/php.png" alt="Powered by PHP" />
|
||||
</a>
|
||||
</li>
|
||||
<li>
|
||||
<a id="t_svn" href="http://subversion.tigris.org" title="Subversion">
|
||||
<img src="/images/thanks/svn.png" alt="Powered by Subversion" />
|
||||
</a>
|
||||
</li>
|
||||
<li>
|
||||
<a id="t_gentoo" href="http://gentoo.org" title="Gentoo Foundation">
|
||||
<img src="/images/thanks/gentoo.png" alt="Powered by Gentoo" />
|
||||
</a>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<div class="clear"></div>
|
||||
|
||||
<div class="footer">
|
||||
Revision <strong>4</strong> by
|
||||
<strong><a href="" onclick="returnMailLink( this, 'kore-nordmann', 'de', 'website' );">Kore Nordmann</a></strong>
|
||||
at <strong>Sat, 28 Jun 2008 21:17:17 +0200</strong>
|
||||
licensed under <strong><a href="http://creativecommons.org/licenses/by-sa/3.0/deed.de" title="Creative Commons Attribution-ShareAlike 3.0">CC by-sa</a></strong>
|
||||
<br />
|
||||
Displayed with <a href="http://web-content-viewer.org/" title="Web Content Viewer">WCV</a>
|
||||
licensed under <a href="http://www.gnu.org/licenses/gpl.txt" title="GNU General Public License - Version 3">GPLv3</a>
|
||||
</div>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
162
vendor/zetacomponents/document/tests/files/xhtml/xpath/s_004_detect_url_in_texts.xml
vendored
Normal file
162
vendor/zetacomponents/document/tests/files/xhtml/xpath/s_004_detect_url_in_texts.xml
vendored
Normal file
@@ -0,0 +1,162 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<article xmlns="http://docbook.org/ns/docbook">
|
||||
<section>
|
||||
<sectioninfo>
|
||||
<abstract>
|
||||
<para>Personal homepage of Kore Nordmann. Contains information about his mainly PHP related projects with some political rants in his blog.</para>
|
||||
</abstract>
|
||||
<author>Kore Nordmann</author>
|
||||
<date>Sat, 28 Jun 2008 21:17:17 +0200</date>
|
||||
<title>Detecting URLs with PCRE</title>
|
||||
<author>Kore Nordmann</author>
|
||||
<date>Sat, 28 Jun 2008 21:17:17 +0200</date>
|
||||
<copyright>CC by-sa</copyright>
|
||||
<title>Detecting URLs with PCRE - Kore Nordmann - PHP / Projects / Politics</title>
|
||||
</sectioninfo>
|
||||
<para><ulink url="/.html">Start</ulink> » <ulink url="/blog.html">Blog</ulink> » <ulink url="/blog/detect_url_in_texts.html">Detecting URLs with PCRE</ulink></para>
|
||||
<para><ulink url="/blog/detect_url_in_texts.rdf"> RDF </ulink> | <ulink url="/blog/detect_url_in_texts.html"> HTML </ulink> | <ulink url="/blog/detect_url_in_texts.txt"> Text </ulink> | <ulink url="/blog/detect_url_in_texts.rss"> Feed </ulink></para>
|
||||
<section>
|
||||
<title>Detecting URLs with PCRE</title>
|
||||
<para>From time to time I experience the issue that I should detect URLs in some text, while neither the URLs are standard conform (regarding the used characters), nor the URLs are strictly separated from other stuff by whitespaces or something. Now <ulink url="http://www.derickrethans.nl/">Derick</ulink> asked me to provide him with a regular expression for that, and I finally wrote some, which should work in most cases:</para>
|
||||
<literallayout>(
|
||||
(?:^|[\s,.!?])
|
||||
(?# Ignore matching braces around the URL)
|
||||
(<)?
|
||||
(\[)?
|
||||
(\()?
|
||||
(?# Ignore quoting around the URL)
|
||||
([\'"]?)
|
||||
|
||||
(?# Actually match the URL)
|
||||
(?P<url>https?://[^\s]*?)
|
||||
\4
|
||||
(?(3)\))
|
||||
(?(2)\])
|
||||
(?(1)>)
|
||||
|
||||
(?# Ignore common punctuation after the URL)
|
||||
[.,?!]?(?:\s|$)
|
||||
)xm</literallayout>
|
||||
<para>Sadly invalid characters are not always encoded, and also you can't expect to have only matching braces in the URLs, but still user like to write something like:</para>
|
||||
<blockquote>
|
||||
<para>Check out my Blog ($url)!</para>
|
||||
</blockquote>
|
||||
<para>In which case the braces are obviously not part of the actual URL so you should skip them, the same for the other brace types.</para>
|
||||
<para>The regular expression uses conditional subpatterns to check for those matching braces before and after an URL and ignores them, when they are found. The same for quotes. Often URLs are followed by some markup, which also shouldn't be included in the actual URL, which is also ignored by this regular expression, but still - even not valid - characters like commas are included in the URL, if used there.</para>
|
||||
<section>
|
||||
<title>Issues</title>
|
||||
<para>There are two issues, which are still not really solveable by a regular expression I think, but additions and suggestions would be really welcome:</para>
|
||||
<orderedlist>
|
||||
<listitem>
|
||||
<para>PCRE does not reuse the end markers (?:\s|$) as start markers for the next URL, and I see no way to get the regular expression working without them. This means, that two URLs, only separated by one whitespace, would be detected when calling preg_match_all. You can still call preg_match() in a while-loop, though and remove all URLs from the text, after you found them.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Some users tend to use braces for subsentences, where one brace may end right after the URL, like this:</para>
|
||||
<blockquote>
|
||||
<para>Hi there (Check out my blog at $url)!</para>
|
||||
</blockquote>
|
||||
<para>Where the closing brace after the URL won't be removed, because there is no opening URL right before the URL.</para>
|
||||
<para>I don't think this is fixable, because you can't expect the user to have only matching braces in his sentences, nor can you expect that for URLs itself. So we can just guess, what will be there more common problem - ignoring closing braces at the end of URLs, or users writing such sentences...</para>
|
||||
</listitem>
|
||||
</orderedlist>
|
||||
<para>Still I think this regular expression might be useful to you, feel free to use it where ever you might find it useful. As a german I am not allowed to put something under public domain, but I grant anyone the right to use this for any purpose, without any conditions, unless such conditions are required by law.</para>
|
||||
</section>
|
||||
</section>
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="/images/favicons/delicious.png" width="16" depth="16"/>
|
||||
</imageobject>
|
||||
<textobject>
|
||||
<para>Bookmark at del.icio.us</para>
|
||||
</textobject>
|
||||
</mediaobject>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="/images/favicons/digg.png" width="16" depth="16"/>
|
||||
</imageobject>
|
||||
<textobject>
|
||||
<para>Digg it!</para>
|
||||
</textobject>
|
||||
</mediaobject>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="/images/favicons/yiggit.png" width="16" depth="16"/>
|
||||
</imageobject>
|
||||
<textobject>
|
||||
<para>Yigg it!</para>
|
||||
</textobject>
|
||||
</mediaobject>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
<section>
|
||||
<title>Comments</title>
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<section>
|
||||
<title><ulink url="http://enobrev.com/"><emphasis Role="strong">Mark Armendariz</emphasis></ulink> at Sat, 21 Jun 2008 23:19:47 +0200 </title>
|
||||
<literallayout class="normal"> I've been using this for years, which has been incredibly successful for me:
|
||||
|
||||
'/(?P<protocol>(?:(?:f|ht)tp|https):\/\/)?
|
||||
(?P<domain>(?:(?!-)
|
||||
(?P<sld>[a-zA-Z\d\-]+)(?<!-)
|
||||
[\.]){1,2}
|
||||
(?P<tld>(?:[a-zA-Z]{2,}\.?){1,}){1,}
|
||||
|
|
||||
(?P<ip>(?:(?(?<!\/)\.)(?:25[0-5]|2[0-4]\d|[01]?\d?\d)){4})
|
||||
)
|
||||
(?::(?P<port>\d{2,5}))?
|
||||
(?:\/
|
||||
(?P<script>[~a-zA-Z\/.0-9-_]*)?
|
||||
(?:\?(?P<parameters>[=a-zA-Z+%&0-9,.\/_ -]*))?
|
||||
)?
|
||||
(?:\#(?P<anchor>[=a-zA-Z+%&0-9._]*))?/x';
|
||||
|
||||
it has an optional protocol (which you can make mandatory by removing the ? at the end of the 1st line), and names all the parts (protocol, domain, sld, tld, ip, port, script, parameters, anchor).
|
||||
|
||||
You can include internal ones using a 'servername' like so:
|
||||
'|(?P<servername>[a-zA-Z\d\-]*[a-zA-Z\d][^:\/]?)'
|
||||
|
||||
after the '<ip>' line.
|
||||
|
||||
Mark
|
||||
</literallayout>
|
||||
<para>
|
||||
<link linked="comment_1"> Link to comment </link>
|
||||
</para>
|
||||
</section>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<section>
|
||||
<title><ulink url="http://kore-nordmann.de"><emphasis Role="strong">Kore</emphasis></ulink> at Sat, 21 Jun 2008 23:43:46 +0200 </title>
|
||||
<literallayout class="normal"> I wrote similar ones following the specification of relevant RFC, but this is actually not the point of the regular expression mentioned above.
|
||||
|
||||
The above one does not try to detect the parts of regular expressions, I found the PHP function parse_url() more useful (and more readable) for that task, but from filtering URLs out of random text. Your regular expression misses that part and does not accept quite common URL chrarcters like () and ;.
|
||||
|
||||
But anyways - thanks for sharing that regular expression. </literallayout>
|
||||
<para>
|
||||
<link linked="comment_2"> Link to comment </link>
|
||||
</para>
|
||||
</section>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<section>
|
||||
<title><ulink url="http://enobrev.com/"><emphasis Role="strong">Mark Armendariz</emphasis></ulink> at Mon, 23 Jun 2008 03:05:04 +0200 </title>
|
||||
<literallayout class="normal"> Good call about the extra characters. I had recently added semicolons and commas, but hadn't thought to add parentheses (thanks for the suggestion!). The regex I gave can be used to filter them out, but now i realize what you're showing in your post. I'd originally misread it (sorry).
|
||||
|
||||
As for punctuation surrounding a url, I imagine you could get rid of anything that is "touching" a url. Any surrounding text that is not a \s (or even \b) would likely be associated with that url and would likely do well to be filtered out as well. </literallayout>
|
||||
<para>
|
||||
<link linked="comment_3"> Link to comment </link>
|
||||
</para>
|
||||
</section>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
<emphasis Role="strong">Author:</emphasis>
|
||||
<para>Homepage:</para>
|
||||
<para><emphasis Role="strong">Comment:</emphasis>Submit:</para>
|
||||
<para>Add new comment</para>
|
||||
<para> Fields with bold names are mandatory. </para>
|
||||
</section>
|
||||
</section>
|
||||
</article>
|
||||
Reference in New Issue
Block a user