always beats manual labor. So it's been a while since I've posted anything to the old GameSpot blog. But I thought now was a good of a time as any to try and jump back in with another mind-numbingly long soliloquy... When last we left our intrepid hero, he had started a new gig at an Internet marketing firm in the foothills of Appalachia. I'm happy to report that things have gone brilliantly these last 5 months - I've made some fast friends, and the job is a blast. I actually ran across a fairly interesting problem at work this afternoon. I needed to generate an XML sitemap for a site to be crawled by Google. Prior to today, I hadn't ever had a project that required this sort of thing. So I asked around, and basically I was told that people just recycled the same class over and over again, which required manually copying and pasting the urls into a function that would generate the xml. That just didn't seem like a great method to me. So I tried to come with something a little more flexible. Essentially, my thought was this - we already have this nice html sitemap on practically every site. So why couldn't I just use that file to generate the xml version? Well, unfortunately it isn't that straight-forward, because the html sitemap is actually using php to dynamically generate the links for data-driven content, and those links don't even exist until the page is loaded into a browser and the scripting is evaluated. After an hour or so of thought, I think I came up with a fairly elegant solution. Code now, description later...
ob_start(); $ch = curl_init(URL_BASE.'/'.$this->sitemapFile); curl_exec($ch); curl_close($ch); $response = ob_get_contents(); ob_end_clean(); $begin = strpos($response, $this->sitemapBegin)
+ strlen($this->sitemapBegin);
$end = strpos($response, $this->sitemapEnd)
- strlen($this->sitemapEnd);
$response = substr($response, $begin, $end - $begin); preg_match_all('regexp for matching all html link tags',
$response,$links,PREG_PATTERN_ORDER); foreach ($links[1] as $link) {
test conditions for generating xml }
So... I remember now how much I dislike using this editor, it cripples usage of quite a few html tags and keywords... Anyway, my solution initializes php's output buffering, uses
curl to execute the page and generate the dynamic sitemap links (which are written to the buffer instead of the browser), saves the output buffer's contents to a string, and then cleans and closes the buffer. Next, I define the beginning and ending point of the sitemap. I made this decision because I figured that 99% of the time, the sitemap links will be inside of a div defining their sty|e (beginning point) and immediately followed by some other div like the page footer (ending point). Using this convention, I could strip out just the html containing the sitemap links and throw away the rest. Eg,
...blah,blah,blah [div id="sitemap"] (aka the beginning string)
STUFF I NEED [/div] [div id="something_else"] (aka the ending string) blah,blah,blah...
Finally, I find any matches to a
regular expression defining the contents of href link tags and store them to an array. And the last step, and really the only one that should require modification in the future, is defining the cases that determine what sort of flags to use in the xml for each link. In this particular situation, there were 5 cases out of over 120 links. It might have taken a couple of hours of thought and coding time, but I think this algorithm will prove to be very reusable for subsequent xml sitemap generation, so long as the page conforms to the convention of surrounding the html/php sitemap links with unique html tags. At the very least, I know I am glad I spent my time trying to improve upon the status quo, and I hope it turns out that I've helped to save a lot of time in the future.