scaping sample text from automated.pythonanywhere.com
install selenium
in repl:
need to do an update to settings if you didn't fork author's repl. show hidden files:
open replit.nix
add two new lines:
pkgs.chromium
pkgs.chromedriver
close file then hide those files again
now you can go ahead and import selenium in your main.py
from selenium import webdriver
then create a driver variable
driver = web.driver.Chrome()
Add options for selenium and put it in a function get_driver
from selenium import webdriver
def get_driver():
## Set options to make browsing easier:
options = webdriver.ChromeOptions()
# on your browser you have info bars, may interfere with script, so this disables it
options.add_argument("disable-infobars")
# starts the browser as maximized
options.add_argument("start-maximized")
# disable some issues often encountered on linux boxes
options.add_argument("disable-dev-shm-usage")
# give program greater privs on that web page script will scrape
options.add_argument("no-sandbox")
# this experiemental option helps selenium avoid detection from the browser when websites try to detect scripts
options.add_experimental_option("excludeSwitches",["enable-automation"])
options.add_argument("disable-blink-features=AutomationControlled")
driver = webdriver.Chrome()
driver.get("https://automated.pythonanywhere.com/")
return driver
create new function main
to get the elements
def main():
driver = get_driver()
element = driver.find_element_by_xpath("")
the find_element_by_xpath
will identify and return that specific text identified by xpath
to get the xpath for that web element, open inspect in the browser, then click copy and copy xpath:
result:
/html/body/div[1]/div/h1[1]
error:
", line 81, in start raise WebDriverException( selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://chromedriver.chromium.org/home
response from instructor
Ardit — Instructor
0 upvotes0
15 hours ago
Hi Michael, the solution is to not create a new Repl, but fork my Repl instead. You can find my Repl links attached as lecture resources. For your convenience, below is link to the Repl which contains the complete code for the lecture "Scraping Simple Text with Selenium":
https://replit.com/@ArditS/Scrape-Simple-Text-with-Selenium-done
working code of mine in instructors forked replit (no PATH error)
from selenium import webdriver
def get_driver():
## Set options to make browsing easier:
options = webdriver.ChromeOptions()
# on your browser you have info bars, may interfere with script, so this disables it
options.add_argument("disable-infobars")
# starts the browser as maximized
options.add_argument("start-maximized")
# disable some issues often encountered on linux boxes
options.add_argument("disable-dev-shm-usage")
# give program greater privs on that web page script will scrape
options.add_argument("no-sandbox")
# this experiemental option helps selenium avoid detection from the browser when websites try to detect scripts
options.add_experimental_option("excludeSwitches",["enable-automation"])
options.add_argument("disable-blink-features=AutomationControlled")
options.add_argument("executable_path='/home/runner/Scrape-Simple-Text-with-Selenium'")
driver = webdriver.Chrome(options=options)
driver.get("https://automated.pythonanywhere.com/")
return driver
def main():
driver = get_driver()
element = driver.find_element(by="xpath", value="/html/body/div[1]/div/h1[1]")
return element.text
print(main())
File "06_scrape_with_selenium.py", line 2, in from selenium import webdriver ModuleNotFoundError: No module named 'selenium'
run
pip install selenium
now you get local error:
self.service.start()
File "C:\ProgramData\Anaconda3\lib\site-packages\selenium\webdriver\common\service.py", line 81, in start
raise WebDriverException(
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://chromedriver.chromium.org/home
go to https://chromedriver.chromium.org/downloads
find out what version of chrome you have:
search what is my browser and go to https://www.whatismybrowser.com/
add:
from selenium.webdriver.chrome.service import Service
#location of the chromedriver.exe you just downloaded:
service = Service('C:\\Users\\Matus1976\\Downloads\\chromedriver_win32\\chromedriver.exe')
change
driver = webdriver.Chrome(options=options)
to
driver = webdriver.Chrome(service=service, options=options)
if you right click and inspect this changing element you get only /div
as the xpath, so sometimes this doesn't work
/div
the previous xpath was:
"/html/body/div[1]/div/h1[1]"
/html means this document
/body isthe main tag and contains the div tag you grabbed
there are two /div
under body though
so you were getting the first instace of div under body with html/body/div[1]
and under that the first div and first header with /div/h1[1]
so now you want the 2nd header
you can also right click and get full xpath in chrome inspect
sometimes
so now you are using:
def main():
driver = get_driver()
element = driver.find_element(by="xpath", value="/html/body/div[1]/div/h1[2]")
return element.text
but the returned text doesn't have the value:
Average World Temperature Now:
this is because the value doens't appear for a few seconds
so add a pause
import time
...
def main():
driver = get_driver()
time.sleep(2) #pause for 2 seconds
element = driver.find_element(by="xpath", value="/html/body/div[1]/div/h1[2]")
return element.text
result
Let's make a new function that grabs just that number:
use split
Average World Temperature Now: 28
"average : 22".split(": ")
['average ', '22']
new program
from selenium import webdriver
import time
def get_driver():
## Set options to make browsing easier:
options = webdriver.ChromeOptions()
# on your browser you have info bars, may interfere with script, so this disables it
options.add_argument("disable-infobars")
# starts the browser as maximized
options.add_argument("start-maximized")
# disable some issues often encountered on linux boxes
options.add_argument("disable-dev-shm-usage")
# give program greater privs on that web page script will scrape
options.add_argument("no-sandbox")
# this experiemental option helps selenium avoid detection from the browser when websites try to detect scripts
options.add_experimental_option("excludeSwitches",["enable-automation"])
options.add_argument("disable-blink-features=AutomationControlled")
options.add_argument("executable_path='/home/runner/Scrape-Simple-Text-with-Selenium'")
driver = webdriver.Chrome(options=options)
driver.get("https://automated.pythonanywhere.com/")
return driver
def clean_text(text):
""" extract only the temperature from the text """
output = float(text.split(": ")[1])
#splits it at the `: `, gets 2nd index, and converts to float
return output
def main():
driver = get_driver()
time.sleep(2) #pause for 2 seconds
element = driver.find_element(by="xpath", value="/html/body/div[1]/div/h1[2]")
return clean_text(element.text)
print(main())
works
next get this working locally
your script will login to this form https://automated.pythonanywhere.com/login/
script logins and then clicks on the home page
script will need the url https://automated.pythonanywhere.com/login/
username is automated
pw is automatedautomated
updated website
driver = webdriver.Chrome(options=options)
driver.get("https://automated.pythonanywhere.com/login/")
return driver
update element
we don't need to use xpath, you can use the id if it's there
id = "id_username" whenever that's present you can use it
def main():
driver = get_drvier()
element = driver.find_element(by="id", value="id_username")
return element.text
also we want to send keys
def main():
driver = get_drvier()
element = driver.find_element(by="id", value="id_username").send_keys("automated")
return element.text
don't need to assign it or return anything either
def main():
driver = get_drvier()
driver.find_element(by="id", value="id_username").send_keys("automated")
# return element.text
works
def main():
driver = get_drvier()
driver.find_element(by="id", value="id_username").send_keys("automated")
driver.find_element(by="id", value="id_password").send_keys("automatedautomated")
works, but add a sleep to slow things down,
import time
def main():
driver = get_drvier()
driver.find_element(by="id", value="id_username").send_keys("automated")
time.sleep(2)
driver.find_element(by="id", value="id_password").send_keys("automatedautomated")
and send a RETURN
import selenium.webdriver.common.keys import Keys
def main():
driver = get_drvier()
driver.find_element(by="id", value="id_username").send_keys("automated")
time.sleep(2)
driver.find_element(by="id", value="id_password").send_keys("automatedautomated" + Keys.RETURN)
so the whole program now looks like
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
def get_drvier():
# Set options to make browsing easier
options = webdriver.ChromeOptions()
options.add_argument("disable-infobars")
options.add_argument("start-maximized")
options.add_argument("disable-dev-shm-usage")
options.add_argument("no-sandbox")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_argument("disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(options=options)
driver.get("https://automated.pythonanywhere.com/login/")
return driver
def main():
driver = get_drvier()
driver.find_element(by="id", value="id_username").send_keys("automated")
time.sleep(2)
driver.find_element(by="id", value="id_password").send_keys("automatedautomated" + Keys.RETURN)
print(main())
last you can add this to display your url to know it worked
def main():
driver = get_drvier()
driver.find_element(by="id", value="id_username").send_keys("automated")
time.sleep(2)
driver.find_element(by="id", value="id_password").send_keys("automatedautomated" + Keys.RETURN)
print(driver.current_url) #<---
Next we want to click on the Home button
we want to use the xpath since it has no id
/html/body/nav/div/a
find that element and then click on it
def main():
driver = get_driver()
driver.find_element(by="id", value="id_username").send_keys("automated")
time.sleep(2)
driver.find_element(by="id", value="id_password").send_keys("automatedautomated" + Keys.RETURN)
driver.find_element(by="xpath", value="/html/body/nav/div/a").click()
print(driver.current_url)
works
https://automated.pythonanywhere.com/
None
next we will extract the world temp value from the page after we have logged in
date-time
>>> from datetime import date
>>> today = date.today()
>>> today
datetime.date(2022, 6, 18)
>>> print(today)
2022-06-18
>>>
KeyboardInterrupt
>>>
>>> from datetime import datetime
>>> now = datetime.now()
>>> now
datetime.datetime(2022, 6, 18, 2, 30, 41, 997621)
>>> print(now)
2022-06-18 02:30:41.997621
>>> time_string = now.strftime("%H:%M:%S")
>>> time_string
'02:30:41'
>>> print(time_string)
02:30:41
on replit
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from datetime import date
from datetime import datetime
def get_driver():
# Set options to make browsing easier
print(" running get_driver")
options = webdriver.ChromeOptions()
options.add_argument("disable-infobars")
options.add_argument("start-maximized")
options.add_argument("disable-dev-shm-usage")
options.add_argument("no-sandbox")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_argument("disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(options=options)
driver.get("https://automated.pythonanywhere.com/")
return driver
def clean_text(text):
""" extract only the temperature from the text """
print(" running clean_text")
output = float(text.split(": ")[1])
#splits it at the `: `, gets 2nd index, and converts to float
#print(output)
return output
def get_current_temp(driver):
print(" running get_current_temp")
"""
# login
driver = get_driver()
driver.find_element(by="id", value="id_username").send_keys("automated")
time.sleep(2)
driver.find_element(by="id", value="id_password").send_keys("automatedautomated" + Keys.RETURN)
time.sleep(2)
driver.find_element(by="xpath", value="/html/body/nav/div/a").click()
"""
"""
# open web page
driver = get_driver()
"""
# scrape text
time.sleep(2)
element = driver.find_element(by="xpath", value="/html/body/div[1]/div/h1[2]")
current_temp = clean_text(element.text)
print(" ", current_temp)
return current_temp
def write_file(current_temp,file_name):
print(" running write_file")
file = open(file_name, 'w')
#print(file)
file.write(current_temp)
file.close()
def date_time_name():
print(" running date_time_name")
# get date
today = date.today()
print(" date is: ", today)
# get time
now = datetime.now()
str_time = now.strftime("%H:%M:%S")
print(" time is: ", str_time)
# concatonate datetime
file_name = str(today) + "." + str(str_time)
print(" filename is: ", file_name)
# return string for filename
## 2022-2-10.14-33-59.txt
return file_name
def main(driver):
print(" running main")
current_temp = str(get_current_temp(driver))
print("current temp is: ", current_temp)
file_name = date_time_name()
write_file(current_temp,file_name)
driver = get_driver()
keep_going = True
while keep_going == True:
main(driver)
time.sleep(2)
error message
Matus1976@DESKTOP-98E2DP4 MINGW64 /d/Files - Google Drive/Files - New Merged/MFD Personal/Programming/python/Udemy/Build-Practical-Programs-with-Python (main)
$ python ./local/10_scrape_dynamic_save.py
current temp is: 28.0
today is: 2022-06-18
str_time: 15:57:37
file_name: 2022-06-18.15:57:37.txt
file_name is: '2022-06-18.15:57:37.txt'
Traceback (most recent call last):
File "./local/10_scrape_dynamic_save.py", line 79, in <module>
main()
File "./local/10_scrape_dynamic_save.py", line 76, in main
write_file(temp, file_name)
File "./local/10_scrape_dynamic_save.py", line 52, in write_file
file = open(file_name, 'w')
OSError: [Errno 22] Invalid argument: '2022-06-18.15:57:37.txt'
windows doesn't like : in filenames
change to
# get time
now = datetime.now()
str_time = now.strftime("%H-%M-%S")
print("str_time: ", str_time)
rm *.txt
my solution is above
notes from instructor:
gets driver for home page:
def get_driver():
# Set options to make browsing easier
print(" running get_driver")
options = webdriver.ChromeOptions()
options.add_argument("disable-infobars")
options.add_argument("start-maximized")
options.add_argument("disable-dev-shm-usage")
options.add_argument("no-sandbox")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_argument("disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(options=options)
driver.get("https://automated.pythonanywhere.com/")
return driver
move the element into a loop:
strftime - generates a useful string from the datetime object
using datetime
from datetime import datetime
datetime.now()
datetime.datetime(2022, 6, 19, 18, 17, 16, 170593)
datetime.now().strftime("%Y")
'2022'
datetime.now().strftime("%Y-%M")
'2022-17'
datetime.now().strftime("%Y-%M-%D")
'2022-17-06/19/22'
datetime.now().strftime("%Y-%M-%d")
'2022-18-19'
full string
datetime.now().strftime("%Y-%M-%d.%H-%M-%S")
'2022-18-19.18-18-49'
titan22.com
sad.monday.funky
//*[@id="shopify-section-header"]/div[1]/div/div[3]/a[2]/span/svg - nope
//*[@id="shopify-section-header"]/div[1]/div/div[3]/a[2]/span/svg -
/html/body/header/div[1]/div[1]/div/div[3]/a[2]/span/svg - copy full xpath
//*[@id="shopify-section-header"]/div[1]/div/div[3]/a[2]/span/svg
/html/body/header/div[1]/div[1]/div/div[3]/a[2]/span/svg
just land right on the login page, login, then click contact us
/html/body/footer/div/section/div/div[1]/div[1]/div[1]/nav/ul/li[1]/a
code update - $ git commit -m "14-15 solution accessing an online retailer"
download historical stock data via python
example url
https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=345427200&period2=1655683200&interval=1d&events=history&includeAdjustedClose=true
get started
to download files with python, use the requests library
import requests
url = "https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=345427200&period2=1655683200&interval=1d&events=history&includeAdjustedClose=true"
content = requests.get(url).content #this is a method that expects a url
print(content)
content is a byte object
$ python local/02-16_download_stock_data.py
b'Forbidden'
it's forbidden, we need the headers
yahoo finance doens't allow scraping scripts to download files
you can skip this by providing headers to impersonate a browser
import requests
url = "https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=345427200&period2=1655683200&interval=1d&events=history&includeAdjustedClose=true"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"}
content = requests.get(url, headers=headers).content #this is a method that expects a url
print(content)
works...
Matus1976@DESKTOP-98E2DP4 MINGW64 /d/Files - Google Drive/Files - New Merged/MFD Personal/Programming/python/udemy/Build-Practical-Programs-with-Python (main)
$ python local/02-16_download_stock_data.py
b'Date,Open,High,Low,Close,Adj Close,Volume\n1980-12-12,0.128348,0.128906,0.128348,0.128348,0.100178,469033600\n1980-12-15,0.122210,0.122210,0.121652,0.121652,0.094952,175884800\n1980-12-16,0.113281,0.113281,0.112723,0.112723,0.087983,105728000\n1980-12-17,0.115513,0.116071,0.115513,0.115513,0.090160,86441600\n1980-1
...etc
but now we need to write these to a file on disk
using requests.get().content grabs things as binaries, so for writing the file use wb
instead of w
content = requests.get(url, headers=headers).content
with open('aapl.csv', 'wb')
then write the file
with open('aapl.csv', 'wb') as file:
file.write(content)
let's further improve the script by letting the user choose the period and the ticker symbol
from_date = input('Enter start from date in format (YYYY/MM/DD): ')
to_date = input('Enter end date in YYYY/MM/DD format:' )
convert human date format to epoch
>>> from_date = '2022/1/2'
>>> print(from_date)
2022/1/2
then convert to datetime object with a specific format
>>> from datetime import datetime
>>> d = datetime.strptime(from_date, '%Y/%m/%d')
>>> d
datetime.datetime(2022, 1, 2, 0, 0)
then convert d
to epoch time with import time
>>> import time
>>> epoch = time.mktime(d.timetuple())
>>> epoch
1641110400.0
>>>
so in your program it will look like
import requests
from datetime import datetime
import time
from_date = input('Enter start from date in format (YYYY/MM/DD): ')
to_date = input('Enter end date in YYYY/MM/DD format:' )
from_datetime = datetime.strptime(from_date, '%Y/%m/%d')
to_datetime = datetime.strptime(to_date, '%Y/%m/%d')
from_epoch = time.mktime(from_datetime.timetuple())
to_epoch = time.mktime(to_datetime.timetuple())
then add these into your yahoo finance url, adding from_epoch
and to_epoch
:
url = "https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1={from_epoch}&period2={to_epoch}&interval=1d&events=history&includeAdjustedClose=true"
run the program, you'll get
b'{"finance":{"result":null,"error":{"code":"Internal Server Error","description":"The template variable \'from_epoch\' has no value"}}}'
this is because our print(content) and our values our epoch is in the float format
from_epoch = int(time.mktime(from_datetime.timetuple()))
to_epoch = int(time.mktime(to_datetime.timetuple()))
or it needs the f
in front of the url
url = f"https://query1.finance.yahoo.com/v7/finance/download/{ticker}?period1={from_epoch}&period2={to_epoch}&interval=1d&events=history&includeAdjustedClose=true"
f
makes it an fstring and allows you to use the {variable} format
learn how to use beautiful soup - a great webscraping library
selenium is more targeted toward browser automations, logging in and out, clicking on buttons, scrape values, etc
beautiful soup, if you want to just extract data you don't have to open the browser. Selenium is kind of a hybrid between web scraping and scripting
beautiful soup is just a library taht directly accesses websites.
start with x-rates.com
string for conversion in x-rates currency converter
https://www.x-rates.com/calculator/?from=EUR&to=INR&amount=1
-
install beautiful soup
pip install beautifulsoup4
from bs4 import BeautifulSoup
-
example program
from bs4 import BeautifulSoup def GetCurrency(input_currency, output_currency): url = f"https://www.x-rates.com/calculator/?from={input_currency}&to={output_currency}&amount=1" print(url) GetCurrency("USD", "EUR")
produces useable URL
-
now with BeautifulSoup you want to get the source code of the web page
It'll be HTML / CSS / JavaScript
you want to access all that code from python
using requests
pip install requests
import requests
you will use requests to get the contents:
content = requests.get(url).text
print(content)
# prints out HTML sourcecode
current program now:
from bs4 import BeautifulSoup
import requests
def GetCurrency(input_currency, output_currency):
url = f"https://www.x-rates.com/calculator/?from={input_currency}&to={output_currency}&amount=1"
content = requests.get(url).text
print(url)
print(content)
GetCurrency("USD", "EUR")
out of all that source code we only want
you can search for it
or inspect element
when need to parse that source code
that's where beautiful soup is great
it understands HTML structure, knowledge of elements / div tags / nested elements
use the html parser to look through the conent:
soup = BeautifulSoup(content, 'html.parser')
if you print out soup here, it prints out basically the same thing.
code looks like this now:
from bs4 import BeautifulSoup
import requests
def GetCurrency(input_currency, output_currency):
url = f"https://www.x-rates.com/calculator/?from={input_currency}&to={output_currency}&amount=1"
content = requests.get(url).text
print(url)
#print(content)
soup = BeautifulSoup(content, 'html.parser')
print(soup)
GetCurrency("USD", "EUR")
We want this
from bs4 import BeautifulSoup
import requests
def GetCurrency(input_currency, output_currency):
url = f"https://www.x-rates.com/calculator/?from={input_currency}&to={output_currency}&amount=1"
content = requests.get(url).text
print(url)
#print(content)
soup = BeautifulSoup(content, 'html.parser')
#print(soup)
currency = soup.find("span", class_="ccOutputRslt").get_text()
print(currency)
GetCurrency("USD", "EUR")
run that, you get
D:\Files - Google Drive\Files - New Merged\MFD Personal\Programming\python\Udemy\Build-Practical-Programs-with-Python\local>python 02-17_scrape-real-time-currency.py
https://www.x-rates.com/calculator/?from=USD&to=EUR&amount=1
0.944916 EUR
let's clean it up and just get that float. we want to remove the last four characters
from bs4 import BeautifulSoup
import requests
def GetCurrency(input_currency, output_currency):
url = f"https://www.x-rates.com/calculator/?from={input_currency}&to={output_currency}&amount=1"
content = requests.get(url).text
print(url)
#print(content)
soup = BeautifulSoup(content, 'html.parser')
#print(soup)
raw_currency = soup.find("span", class_="ccOutputRslt").get_text()
print(raw_currency)
raw_currency_length = len(raw_currency)
#print(raw_currency_length)
currency = (raw_currency[0:raw_currency_length-4])
#print(repr(currency))
currency = float(currency)
#print(type(currency))
GetCurrency("EUR", "INR")
complete program
from bs4 import BeautifulSoup
import requests
def GetCurrency(input_currency, output_currency):
""" gets the current currency exchange rate and returns the rate """
url = f"https://www.x-rates.com/calculator/?from={input_currency}&to={output_currency}&amount=1"
content = requests.get(url).text
print(url)
#print(content)
soup = BeautifulSoup(content, 'html.parser')
#print(soup)
raw_currency = soup.find("span", class_="ccOutputRslt").get_text()
print(raw_currency)
raw_currency_length = len(raw_currency)
#print(raw_currency_length)
currency = (raw_currency[0:raw_currency_length-4])
#print(repr(currency))
currency = float(currency)
#print(type(currency))
rate = currency
return rate
GetCurrency("EUR", "AUD")