Web Scraping Exercises

broken image


Tag Archives: Python web-scraping-exercises Create GUI to Web Scrape articles in Python Prerequisite- GUI Application using Tkinter In this article, we are going to write scripts to extract information from the article in the given URL. Python SQLite Database 13 exercises with solution An editor is available at the bottom of the page to write and execute the scripts.1. Write a Python program to create a SQLite database and connect with the database and print the version of the SQLite database. Web Scraping Exercises is a collection which helps people learn Web Scraping. People can use it to improve Web Scraping skills by solving the practical Exercises. They are available at ScrapingClub. It is a website which includes many web scraping exercise and high-quality web scraping tutorials. In this Web Scraping in Python video, we are going to do a simple web scrapng exercise to extract Excel function list from Microsoft's website.The reason why.

OP here: I don't necessarily disagree with what you've said here. The 'Computational Journalism' class is an elective at Stanford, and while some of the students are from the journalism program, others come from more technical fields such as CS or MSE. The programming part for them is not a huge challenge...but besides the exposure to civic issues and data policy, for some of them, this is the first time they've worked with things like webscraping and public-facing APIs (as was the case for me in my computer engineering degree program, though that was years ago).

So there's a decent sized group of technically-apt students at Stanford who are interested in journalism. And my advice to them would be to at least intern as traditional reporters, as there's no better way to learn the work of developing access and sources (as well as interviewing and writing on deadline!).

That said, there are opportunities to quickly explore a domain if you're skilled at data collection and analysis. One of the best examples I can think of is this writeup by a couple of data reporters about their investigation into Florida cops:

> This was a case where the government had this wonderful, informative dataset and they weren't using it at all except to compile the information. I remember talking to one person at an office and saying: 'How could you guys not know some of this? In five minutes of (SQL) queries you know everything about these officers?' They basically said it wasn't their job. That left a huge opportunity for us.

This scenario -- in which the data is freely available but no one thinks to simply collect it into a spreadsheet -- is just the tip of the iceberg of data work that needs to be done...but I'd be lying if I said that this kind of low-hanging fruit was rare...There's plenty of information out there that's just begging for efficient examination...to paraphrase a classic adage, the problem today is not that we lack information, but we lack ways of filtering and understanding it.

I'll leave aside the debate of how worthwhile it is to try to teach programming to traditional journalists -- it's definitely not easy work...but there's a great deal of potential in teaching comsci students about civic and journalistic issues and how specifically to apply their skills. I turned out OK after first spending a few years as a newspaper reporter, but I think I missed some opportunities to hit bigger...but back then, I had no concept of mixing my programming background with my journalism.

Yesterday a friend of mine linked me to a fictional web serial that he was reading and enjoying, but could be enjoying more if it was available as a Kindle book. The author, as of yet, hasn't made one available and has asked that fan-made versions not be linked publicly. That said, it's a very long story and would be much easier to read using a dedicated reading app, so I built my own Kindle version to enjoy. This post is the story of how I built it.

Step 1: Source Analysis

The first step of any kind of web scraping is to understand your target. Here's what the first blog post looks like (with different content):

Scraping

After browsing around I found a table of contents, but since all of the posts were linked together with 'Next Chapter' pointers it seemed easier to just walk those. The other interesting thing here is that there's a comment section that I didn't really care about.

Step 2: Choose Your Tools

Web Scraping Exercises

The next stage of web scraping is to choose the appropriate tools. I started with just curl and probably could have gotten pretty far I knew the DOM futzing I wanted to do would require something more powerful later on. At the moment Ruby is where I turn to for most things, so naturally I picked Nokogiri. The first example on the Nokogiri docs page is actually a web scraping example, and that's basically what I cribbed from. Here's the initial version of the scraping function:

Ruby has a built-in capability for opening URLs as readable files with the open-uri standard library module. Because of various problems with Nokogiri's unicode handling I learned about in previous web scraping experiences, the best thing to do is to pass a string to Nokogiri instead of passing it the actual IO handle. Setting the encoding explicitly is also a best practice.

Web Scraping Exercises Examples

What is web scraping

After browsing around I found a table of contents, but since all of the posts were linked together with 'Next Chapter' pointers it seemed easier to just walk those. The other interesting thing here is that there's a comment section that I didn't really care about.

Step 2: Choose Your Tools

The next stage of web scraping is to choose the appropriate tools. I started with just curl and probably could have gotten pretty far I knew the DOM futzing I wanted to do would require something more powerful later on. At the moment Ruby is where I turn to for most things, so naturally I picked Nokogiri. The first example on the Nokogiri docs page is actually a web scraping example, and that's basically what I cribbed from. Here's the initial version of the scraping function:

Ruby has a built-in capability for opening URLs as readable files with the open-uri standard library module. Because of various problems with Nokogiri's unicode handling I learned about in previous web scraping experiences, the best thing to do is to pass a string to Nokogiri instead of passing it the actual IO handle. Setting the encoding explicitly is also a best practice.

Web Scraping Exercises Examples

Then it's a simple matter of using Nokogiri's css selector method to pick out the nodes we're interested in and return them to the caller. The idea is that, since each page is linked to it's successor we can just follow the links.

Web Scraping Online

Step 3: The Inevitable Bugfix Iteration

Of course it's never that easy. Turns out these links are generated by hand, and across hundreds of blog posts of course there will be some inconsistencies. At some point the author stopped using the title attribute. Instead of using the super clever CSS selector a[title='Next Chapter'] I had to switch to grabbing all of the anchor tags and selecting based on the text:

This works great, except that in a few cases there's some whitespace in the text of the anchor node, so I had to switch to a regex:

Another sticking point was that sometimes (but not always) the author used non-ASCII in their URLs. The trick for dealing with possibly-escaped URLs is to check to see if decoding does anything. If it does, it's already escaped and shouldn't be messed with:

Step 4: Repeat As Necessary

Now that we can reliably scrape one URL, it's time to actually follow the links:

This is pretty simple. Set some initial state, make a directory to put the scraped pages, then follow each link in turn and write out the interesting content to sequential files. Note that file names are all four digit numbers so that the sequence is preserved even with lexicographical sorting.

Step 5: Actually Build The Book

At first I wanted to use Docverter, my project that mashes up pandoc and calibre for building rich documents (including ebooks) out of plain text files. I tried the demo installation first, but that runs on Heroku and repeatedly ran out of memory so I tried a local installation. That timed out (did I mention that this web serial is also very long?) so instead I just ran pandoc and ebook-convert directly:

Pandoc can take multiple input files but it was easier to manage one input file on the command line. The stylesheet and metadata xml files are lifted directly from the mmp-builder project that I use to build Mastering Modern Payments, with appropriate authorship information changes.

In Conclusion, Please Don't Violate Copyright

Making your own ebooks is not hard with the tools that are out there. It's really just a matter of gluing them together with an appropriate amount of duct tape and bailing twine.

That said, distributing content that isn't yours without permission directly affects authors and platform shifting like this is sort of a gray area. The author of this web serial seems to be fine with fan-made ebooks editions as long as they don't get distributed, so that's why I anonymized this post.

Posted in: Software

Tagged: Programming





broken image