Skip to content

Commit

Permalink
changes
Browse files Browse the repository at this point in the history
  • Loading branch information
TheAhmadOsman committed Jan 13, 2019
1 parent 7a496a7 commit 6848a99
Show file tree
Hide file tree
Showing 270 changed files with 4,239 additions and 28,184 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
.DS_Store
.DS_Store
*.csv
8 changes: 8 additions & 0 deletions 1-Craigslist Scraper/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
*.xml
*.db
*.db-journal
*.pyc
*.pyo
*.csv
*.zip
*.txt
92 changes: 92 additions & 0 deletions 1-Craigslist Scraper/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Craigslist Filter
#### In Development...

Craigslist Filter (working title) is a web application built with python and flask which scrapes data about vehicles for sale from all Craigslist pages in America and allows users to filter them by criteria such as city, price, manufacturer, and odometer. This project is currently in development and is not yet completed.

## Installing

This application requires Python 3.6 or greater and several installable modules.

You will need to install flask, sqlite3, wtforms, lxml, and requests_html.

```
pip3 install Flask
```
```
pip3 install setuptools
```
```
pip3 install pysqlite3
```
```
pip3 install wtforms
```
```
pip3 install lxml
```
```
pip3 install requests_html
```
```
pip3 install flask_wtf
```
```
pip3 install flask_bootstrap
```
```
pip3 install geopy
```

## Deploying

To run this application locally you will first need to run both crawlCities.py and scrapeVehicles.py (or download a cached version [here](https://files.fm/u/p5z4fbkn)) in order to generate the databases used by the application.

Once these applications have completed, simply run app.py and copy and paste the address provided in the terminal into your browser.

## Specific Future Implementations

* Remain on the form page when a search yields no results.

* Allow users to specify a search radius and return more specific results when searching by location.

* Add Google maps API feature to allow users to browse sales in specific areas.

## Broad Future Implementations

* Integrate visualization project with filter, let users generate graphs to help narrow decisisons when purchasing cars e.g. show me a line graph of the average price of Ford pickups based on the odometer of the vehicle (this code has already been written with pandas, rewriting with SQL will take some time).

* Login/Logout functionality which allows users to save certain filter combinations and search results.

* Better site layout, less bootstrap-esque and more creative.

* Improved security.

* Frequent automated database updates.

* User-specific sale tracking (price has changed, listing has been removed, etc.).

## Blocked

* Pivot to multiprocessing to allow for many requests to be made at once, speeding up the scraper exponentially (I am worried about Craigslist blacklisting IPs).

## Completed Tasks

* Filter implemented.

* Improved filter form including dropdown lists automatically generated by column entries in the database.

* Scraped the map on the listing page to extract more specific location (lat/long) instead of just the region.

* Added a message that alerts a user when their search yields no results.

* Allowed for filtering between two values for fields such as price and odometer.

* Added photos to results page.

* Allow for users to search by any city using latitude and longitude instead of specific craigslist regions.

* Track which cities have been scraped recently to add order to the scraping process.

## Contributors

This application is being developed by Austin Reese.
60 changes: 60 additions & 0 deletions 1-Craigslist Scraper/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
from errHandle import errHandle

errHandle()

#this is where the magic happens, we currently have only one route but that will change

from flask import Flask, render_template
from flask_wtf import FlaskForm
from flask_bootstrap import Bootstrap
from wtforms import Form, BooleanField, StringField, IntegerField, SelectField, validators
from wtforms.validators import Length, ValidationError, DataRequired
from queryForm import queryForm
from queryDropdowns import queryDropdowns
from datetime import datetime

app = Flask(__name__)
app.config['SECRET_KEY'] = "CraigsistFilter"
bootstrap = Bootstrap(app)


class FilterForm(FlaskForm):
#set up the form and grabbing dropdowns, a dictionary of unique values to populate select fields
dropdowns = queryDropdowns()

year = datetime.now().year

city = StringField("City", validators = [Length(max=40)])
state = SelectField("State", choices = dropdowns["states"], validators = [validators.optional()])
manufacturer = SelectField("Manufacturer", choices = dropdowns["manufacturer"], validators = [validators.optional()])
make = StringField("Make", validators = [Length(max=40)])
condition = SelectField("Condition", choices = dropdowns["condition"], validators = [validators.optional()])
cylinders = SelectField("Cylinders", choices = dropdowns["cylinders"], validators = [validators.optional()])
fuel = SelectField("Fuel", choices = dropdowns["fuel"], validators = [validators.optional()])
transmission = SelectField("Transmission", choices = dropdowns["transmission"], validators = [validators.optional()])
titleStatus = SelectField("Title Status", choices = dropdowns["titleStatus"], validators = [validators.optional()])
vin = StringField("VIN", validators = [Length(max=40)])
drive = SelectField("Drive", choices = dropdowns["drive"], validators = [validators.optional()])
size = SelectField("Size", choices = dropdowns["size"], validators = [validators.optional()])
vehicleType = SelectField("Vehicle Type", choices = dropdowns["vehicleType"], validators = [validators.optional()])
paintColor = SelectField("Paint Color", choices = dropdowns["paintColor"], validators = [validators.optional()])
priceStart = IntegerField("Minimum Price", validators=[validators.optional(), validators.NumberRange(min=0, max=10000000, message="Please enter a value between 0 and 10,000,000")])
priceEnd = IntegerField("Maximum Price", validators=[validators.optional(), validators.NumberRange(min=0, max=10000000, message="Please enter a value between 0 and 10,000,000")])
yearStart = IntegerField("Minimum Year", validators=[validators.optional(), validators.NumberRange(min=1880, max=year + 1, message="Please enter a year between 1880 and {}".format(year + 1))])
yearEnd = IntegerField("Maximum Year", validators=[validators.optional(), validators.NumberRange(min=1880, max=year + 1, message="Please enter a year between 1880 and {}".format(year + 1))])
odometerStart = IntegerField("Minimum Odometer", validators=[validators.optional(), validators.NumberRange(min=0, max=10000000, message="Please enter a value between 0 and 10,000,000")])
odometerEnd = IntegerField("Maximum Odometer", validators=[validators.optional(), validators.NumberRange(min=0, max=100000000, message="Please enter a value between 0 and 10,000,000")])


@app.route('/', methods=['GET', 'POST'])
def index():
#render index.html with form passed through as a variable
form = FilterForm()
#validate_on_submit() runs when the form is submitted. we then redirect to search.html with the data fetched from queryForm.py
if form.is_submitted():
data = queryForm(form)
return render_template("search.html", data = data)
return render_template("index.html", form = form)

if __name__ == '__main__':
app.run(debug=True)
70 changes: 70 additions & 0 deletions 1-Craigslist Scraper/crawlCities.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
#crawlCities grabs every city on Craigslist

from lxml import html
from datetime import datetime
import requests
import sqlite3

db = sqlite3.connect("cities.db")
curs = db.cursor()
curs.execute("DROP TABLE IF EXISTS cities")
curs.execute("CREATE TABLE IF NOT EXISTS cities(cityId STRING PRIMARY KEY, cityTitle STRING)")

s = requests.Session()

def cityLooper(baseCase):
start = datetime.now()
try:
origin = s.get("https://{}.craigslist.com".format(baseCase))
except:
print("Could not reach {}.craigslist.com, is this link broken?".format(baseCase))
return None

tree = (html.fromstring(origin.content))
#so each city page on Craigslist has a recommeded cities page, essentially we grab each recommended city from the current city
#and store them in the cityQueue (which is a set so we cant have duplicates)
cityQueue = set(tree.xpath('//li[@class="s"]//a'))
crawled = set()
newEntry = True

while len(cityQueue) != 0:
city = cityQueue.pop()
moreCities, crawled, updated = cityCrawler(city, crawled)
if updated:
cityQueue.update(moreCities)
#difference_update will remove entries from cityQueue if the same entry is already in crawled
cityQueue.difference_update(crawled)
print("Added {}. {} regions crawled through, {} regions in the queue.".format(city.text.title(), len(crawled), len(cityQueue)))
db.commit()
db.close()
end = datetime.now()
print("Program complete. Run time: {} seconds. File cities.db contains entries for {} regions on craigslist.com".format(int((end - start).total_seconds()), len(crawled)))

def cityCrawler(city, crawled):
cityCode = city.attrib["href"][2:city.attrib["href"].index(".")]

if cityCode in crawled:
#this means we've already checked it out, no need to execute anything
return set(), crawled, False
else:
#otherwise put the city in the db and fetch the 'recommended cities' from the current target
curs.execute("INSERT INTO cities(cityId, cityTitle) VALUES(?,?)", (cityCode, city.text))

try:
newOrigin = s.get("https://{}.craigslist.com".format(cityCode))
except:
print("Could not reach {}.craigslist.com, is this link broken?".format(baseCase))
return set(), crawled, False

crawled.add(cityCode)
tree = (html.fromstring(newOrigin.content))
newCities = set(tree.xpath('//li[@class="s"]//a'))
#newCities is a set of the recommended cities featured on the current city
return newCities, crawled, True


def main():
cityLooper("kansascity")

if __name__ == "__main__":
main()
12 changes: 12 additions & 0 deletions 1-Craigslist Scraper/errHandle.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#confirm the necessary tables exist to run this app

import sqlite3

def errHandle():
try:
db = sqlite3.connect("cities.db")
curs = db.cursor()
curs.execute("SELECT 1 FROM vehicles LIMIT 1")
db.close()
except:
raise EnvironmentError("Please install cities.db from https://files.fm/u/yw247cuc and place the current directory")
135 changes: 135 additions & 0 deletions 1-Craigslist Scraper/queryDropdowns.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
#this is a work in progress that will grab unique column values to allow for dropdown menus instead of text boxes in the search form

import sqlite3

def queryDropdowns():
dropdowns = {}
db = sqlite3.connect("cities.db")
curs = db.cursor()
curs.execute("SELECT DISTINCT cylinders FROM vehicles")
cylinders = curs.fetchall()
curs.execute("SELECT DISTINCT fuel FROM vehicles")
fuel = curs.fetchall()
curs.execute("SELECT DISTINCT title_status FROM vehicles")
titleStatus = curs.fetchall()
curs.execute("SELECT DISTINCT drive FROM vehicles")
drive = curs.fetchall()
curs.execute("SELECT DISTINCT type FROM vehicles")
vehicleType = curs.fetchall()
curs.execute("SELECT DISTINCT paint_color FROM vehicles")
paintColor = curs.fetchall()
curs.execute("SELECT DISTINCT year FROM vehicles")
year = curs.fetchall()
curs.execute("SELECT DISTINCT manufacturer FROM vehicles")
manufacturer = curs.fetchall()
curs.execute("SELECT DISTINCT condition FROM vehicles")
condition = curs.fetchall()
curs.execute("SELECT DISTINCT size FROM vehicles")
size = curs.fetchall()
curs.execute("SELECT DISTINCT transmission FROM vehicles")
transmission = curs.fetchall()
db.close()
transmissions = []
for item in transmission:
item = item[0]
if item != None:
transmissions.append((item, item))
transmissions.append(("", ""))
transmissions.sort()
dropdowns["transmission"] = transmissions
sizes = []
for item in size:
item = item[0]
if item != None:
sizes.append((item, item))
sizes.append(("", ""))
sizes.sort()
dropdowns["size"] = sizes
cyls = []
for item in cylinders:
item = item[0]
if item != None:
cyls.append((item, item))
cyls.append(("", ""))
cyls.sort()
dropdowns["cylinders"] = cyls
fuels = []
for item in fuel:
item = item[0]
if item != None:
fuels.append((item, item))
fuels.append(("", ""))
fuels.sort()
dropdowns["fuel"] = fuels
titleStatusList = []
for item in titleStatus:
item = item[0]
if item != None:
titleStatusList.append((item, item))
titleStatusList.append(("", ""))
titleStatusList.sort()
dropdowns["titleStatus"] = titleStatusList
drives = []
for item in drive:
item = item[0]
if item != None:
drives.append((item, item))
drives.append(("", ""))
drives.sort()
dropdowns["drive"] = drives
vehicleTypes = []
for item in vehicleType:
item = item[0]
if item != None:
vehicleTypes.append((item, item))
vehicleTypes.append(("", ""))
vehicleTypes.sort()
dropdowns["vehicleType"] = vehicleTypes
paintColors = []
for item in paintColor:
item = item[0]
if item != None:
paintColors.append((item, item))
paintColors.append(("", ""))
paintColors.sort()
dropdowns["paintColor"] = paintColors
manufacturers = []
for item in manufacturer:
item = item[0]
if item != None:
manufacturers.append((item, item))
manufacturers.append(("", ""))
manufacturers.sort()
refinedManufacturers = []
dropdowns["manufacturer"] = manufacturers
years = []
for item in year:
item = item[0]
if item != None:
years.append((item, item))
years.sort()
years = [("", "")] + years
dropdowns["year"] = years
conditions = []
for item in condition:
item = item[0]
if item != None:
conditions.append((item, item))
conditions.append(("", ""))
conditions.sort()
dropdowns["condition"] = conditions
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
"MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
stateTuples = []
stateTuples.append(("", ""))
for item in states:
stateTuples.append((item, item))
dropdowns["states"] = stateTuples

return dropdowns
queryDropdowns()


Loading

0 comments on commit 6848a99

Please sign in to comment.