Simple Web Scraping with Ruby: Using Nokogiri and SQLite3
We will review some basic HTML and DOM concepts to understand how browsers and other programs interpret an HTML page, to make it easy to search for specific elements, and extract both the visible and the invisible text on a webpage.
We will us the following exercise to illustrate these concepts: look at the newest stories on Reddit, and insert into a local database the following information about the top 10 of those stories - the author, how many comments/points it has, its category, its title, and the date it was posted.
We will assume you have a basic understanding of Ruby constructs, like lists, string manipulation and methods.
The material covered in the class is available online - feel free to email us at team AT railsschool DOT org beforehand if you have read these and have any questions about the material:
- What is HTML and the DOM: Parts 1: Understanding HTML and 2: Converting HTML to a DOM Tree
- Basic SQL queries in SQLite (and Ruby)
If there is a webpage you're dying to figure out how to scrape and download data off of, email us that ahead of time too and we'll try to build a parser customized for your needs.