Chunk Data for Easier Scraping

Before you spend an hour writing some elaborate regular expression, try chunking the data and matching several expressions to make for a much simpler (and faster) scrape. So, assuming you're using PHP, after you've pulled the data (e.g. with file_get_contents()), use preg_replace with the following regex to chunk the data into a much easier "soup" to work with.
[ \t\r\n]
Here's an example of how to use this with PHP:
<?php
$data = file_get_contents("http://www.adammoro.com/");
$data = preg_replace('~[ \t\r\n]~', '', $data);
print_r($data);
?>
um yeah, except if you gonna parse HTML better use a library like PHPQuery (jQuery for PHP)
or you could replace /\W+/ with a single space instead of the empty string, to leave delimiters in tact.
and … PHP? really?
Hi Kip, thanks for the tip. The only other language I’ve used for scraping jobs is Python which was much faster performance-wise, it just took me a lot longer to write the scripts. Which language(s) would you suggest?
Python is my favourite. I switched from PHP to Python as well, indeed it was a littlebit awkward to learn at first–syntactical whitespace and that ’self’ keyword in class methods everywhere–but after 2 or 3 simple quicky scripts, I got the hang of it. I wouldn’t go back to PHP, I think Python is much “cleaner” and readable, and to be fair, the string processing functions are much more straightforward.
Plus, you can use PyQuery for doing jQuery-style manipulation and extraction of HTML documents.
Another language, which was basically built for this kind of work, is of course Perl. It’s a pretty good language, and has the added bonus of making you feel like an oldskool elite hacker when coding
But, Perl is an even older language than PHP, and that shows. The field of scripting languages has come a long way since then, and we know better how to make scripting languages as much of “simply tell the computer what to do” as possible, which is why I prefer Python.
I’ll definitely be looking into the libraries you suggested for the next ones. Thanks again for pointing them out.
Perhaps one day I’ll make a full switch to Python for these types of jobs. Recently, however, the majority of my work has been almost entirely marketing-related so that day will likely come later than sooner. But hey, at least that’s a choice I’ve made and not an order fulfilled to satisfy, “the first three.”