GitHub - amykhar/udacity-recipe-search: A project for the Udacity CS101 Class

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
static		static
templates		templates
venv		venv
Procfile		Procfile
README		README
crawler.py		crawler.py
db_lib.pyc		db_lib.pyc
recipe_search.db		recipe_search.db
recipe_search.py		recipe_search.py
requirements.txt		requirements.txt
schema.sql		schema.sql

Repository files navigation

My entry is two parts. The first is primarily the web crawler that we built for the class, with a few modifications.
1. I use BeautifulSoup to help me parse the html for my search engine.
2. Initially I used sqlite to rough out the app. But, I modified it to use MySQL via the tornado.database library
3. Because this search engine is intended to crawl a single site, I am interested primarily in the relative links. To prevent the crawler from heading off to follow FoodNetwork.com's affiliate and twitter links, I have excluded external links from the crawler.
4. My search engine is interested in the ingredients of the recipes. Therefore, I found the html tags used to delineate the ingredients on the page, and I index only those words. However, the crawler does look at the entire page to get links. Because web sites notoriously change these markers, I put them in config variables at the top of the script for easy modification.

The second part is the search engine. I had no experience using Python to show web pages; so there was a bit of a learning curve on this side. In order to keep things simple, I used Flask (http://flask.pocoo.org/) Flask allowed me to quickly get some pages connected to my lookup function. As with the crawler, the lookup function uses Mysql to access the index. The lookup function handles multiple terms, but unlike the homework assignment, order does not matter and the terms do not have to be adjacent.

While I have many years of experience with PHP, MySQL and Apache, I knew nothing about Python before this class. So, deploying the app ended up being the most challenging aspect. I finally ended up putting it on Amazon EC2 using Nginx and wsgi. These directions were the most helpful: https://github.com/d5/docs/wiki/Installing-Flask-on-Amazon-EC2

Todo:
1. Handle boolean operators for the search. Chicken AND Rice NOT Peas, etc.
2. Better display of the links. Have the crawler get the page title and store it in the database, and then render the links using the page title rather than the raw URL.

The code: https://github.com/amykhar/udacity-recipe-search