PHP Function to extract anything between two tags

Here’s a small snippet of code that I use very frequently when parsing out webpages or content for specific items. For example on any webpage you need to extract data which is present like this:

<html><body><h1>ABC</h1>.... <!-- A lot list of code --><div id="myNewsItem">This is my news, and I am interested in extracting this out</div>.... <!-- and the HTML code continues on --></body></html>

I would like to extract out the data between the DIV tag “myNewsItem”.

Here’s the PHP function to do the extraction:

function SimMyExtract($string, $openingTag, $closingTag){    $string = trim($string);    $start  = intval(strpos($string,$openingTag)                       + strlen($openingTag));    $end    = intval(strpos($string,$closingTag));

    if($start == 0 || $end ==0)    return false; // not found

    $mytext = substr($string,$start, $end - $start);    return $mytext;}

Usage for above example:

SimMyExtract( $content, '<div id="myNewsItem">', '</div>' );

You can use it recursively to extract items in a list of similar tags (i.e. when the same tag is used a number of times on the same page). To offer more power I use it in conjunction with regular expressions. I would rid you from going into any further details for RegEx but it is absolutely powerful, and I love the way RegEx is implemented in PHP (both Perl’s PREG and EREG)…

For instance the same function could be reduced to:

ereg( $openingTag."[a-zA-Z0-9<>/]+".$closingTag,       $content, $result);return implode($result,'');

The point is RegEx is able to capture a lot of occurrences and extract out, you need to master regex. Without that an interesting exercise could be to extract all URLs (content of HREF) from a webpage.

6 Responses to PHP Function to extract anything between two tags

greyfade says:

February 19, 2009 at 9:41 pm

You could simplify this even more by not passing the closing tag at all:

$closingTag = preg_replace("/\\<\\s*(\\S*)[^>]*\\>/", "</\\1>", $openingTag);

Dave says:

April 30, 2009 at 1:19 pm

It would seem that this is not going to work with nested tags … as in the case of
<html>
<body>
<h1>ABC</h1>
…. <!– A lot list of code –>
<div id="myNewsItem">This is my news,
<div class="pullquote">the rain in spain</div>
and I am
interested in extracting this out</div>
…. <!– and the HTML code continues on –>
</body>
</html>

It would appear that the regex will extract
This is my news,
<div class="pullquote">the rain in spain</div>
and miss
and I am
interested in extracting this out

Am I mistaken? if not, is there a way to make this work as intended?

Asim says:

April 30, 2009 at 1:26 pm

Perhaps. But when you have a list of nodes that you would want to traverse and extract, its better to use domXML with xquery.

ishika says:

July 28, 2009 at 9:32 pm

Please tell me how you recursively used it i am using it in that manner but its returning only one result i need all the data which comes in between the tags again and again

Please help

below is my code :

<?php

function SimMyExtract($string, $openingTag, $closingTag)
{
$string = trim($string);
$start = intval(strpos($string,$openingTag)
+ strlen($openingTag));
$end = intval(strpos($string,$closingTag));

if($start == 0 || $end ==0)
return false; // not found

$mytext = substr($string,$start, $end – $start);
return $mytext;
}

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://www.lonare.com");
curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout after 30 seconds
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result = curl_exec ($ch);
curl_close ($ch);

$text = strip_tags($result);

$text = str_replace("Today,", "<div class=\"mydata\">", $text);

$text = str_replace("FML#", " </div> ", $text);

$text1 = SimMyExtract($text, ‘<div class="mydata">’, ‘</div>’);

echo $text1."<br>";

?>

Dan says:

March 8, 2010 at 8:53 am

I cant seem to make this piece of code work…
this is my PHP file

<?PHP
function SimMyExtract($string, $openingTag, $closingTag)
{
$string = trim($string);
$start = intval(strpos($string,$openingTag) + strlen($openingTag));
$end = intval(strpos($string,$closingTag));

if($start == 0 || $end ==0)
return false; // not found

$mytext = substr($string,$start, $end – $start);
return $mytext;
}
?>

<html>
<body>
<h1>ABC</h1>
…. <!– A lot list of code –>
<div id="myNewsItem">This is my news, and I am
interested in extracting this out</div>
…. <!– and the HTML code continues on –>

<?PHP
$exdata = SimMyExtract($text, ‘<div id="myNewsItem">’, ‘</div>’);
echo $exdata;
?>

</body>
</html>

Any ideas?

internet marketing melbourne says:

December 15, 2010 at 9:29 am

I Think of your talents as the things you’re really good at. They’re like personality traits. For instance, you may be a very creative person, or a person who’s really good at attending to details or a person with a gift for communicating. Your talents are the base for any successful business venture, including a home-based business.

PHP Function to extract anything between two tags

Here’s the PHP function to do the extraction:

Usage for above example:

6 Responses to PHP Function to extract anything between two tags

Leave a Reply Cancel reply

Recent Posts

Categories