Web Scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort.
We need to do is to figure out where we can locate the links to the files we want to download inside the multiple levels of HTML tags.
<a href=”data/nyct/turnstile/turnstile_180922.txt”>Saturday, September 22, 2018</a>
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
[Build A Python App That Tracks Amazon Prices!](https://www.youtube.com/watch?v=Bg9r_yLk7VY&ab_channel=DevEd)
url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, “html.parser”)
soup.findAll('a')
This code gives us every line of code that has an tag.
one_a_tag = soup.findAll(‘a’)[38]
link = one_a_tag[‘href’]
should look like this:
# Import libraries
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# Set the URL you want to webscrape from
url = 'http://web.mta.info/developers/turnstile.html'
# Connect to the URL
response = requests.get(url)
# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html.parser")
# To download the whole data set, let's do a for loop through all a tags
line_count = 1 #variable to track what line you are on
for one_a_tag in soup.findAll('a'): #'a' tags are for links
if line_count >= 36: #code for text files starts at line 36
link = one_a_tag['href']
download_url = 'http://web.mta.info/developers/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:])
time.sleep(1) #pause the code for a sec
#add 1 for next line
line_count +=1
Web spiders should ideally follow the robot.txt file for a website while scraping. It has specific rules for good behavior, such as how frequently you can scrape, which pages allow scraping, and which ones you can’t.
If these are in the link, it means the website does’nt want to scrape.
User-agent: *
Disallow:/
Methods that could help scraping robots.txt files