PHP Function to extract anything between two tags

Here’s a small snippet of code that I use very frequently when parsing out webpages or content for specific items. For example on any webpage you need to extract data which is present like this:

<html><body><h1>ABC</h1>.... <!-- A lot list of code --><div id="myNewsItem">This is my news, and I am interested in extracting this out</div>.... <!-- and the HTML code continues on --></body></html>

I would like to extract out the data between the DIV tag “myNewsItem”.

Here’s the PHP function to do the extraction:

function SimMyExtract($string, $openingTag, $closingTag){    $string = trim($string);    $start  = intval(strpos($string,$openingTag)                       + strlen($openingTag));    $end    = intval(strpos($string,$closingTag));

    if($start == 0 || $end ==0)    return false; // not found

    $mytext = substr($string,$start, $end - $start);    return $mytext;}

Usage for above example:

SimMyExtract( $content, '<div id="myNewsItem">', '</div>' );

You can use it recursively to extract items in a list of similar tags  (i.e. when the same tag is used a number of times on the same page). To offer more power I use it in conjunction with regular expressions. I would rid you from going into any further details for RegEx but it is absolutely powerful, and I love the way RegEx is implemented in PHP (both Perl’s PREG and EREG)…

For instance the same function could be reduced to:

ereg( $openingTag."[a-zA-Z0-9<>/]+".$closingTag,       $content, $result);return implode($result,'');

The point is RegEx is able to capture a lot of occurrences and extract out, you need to master regex. Without that an interesting exercise could be to extract all URLs (content of HREF) from a webpage.

This entry was posted in Experiments, Old Ramblings and tagged , , . Bookmark the permalink.