Reading-Notes

View the Project on GitHub

Web Scrape with Python

overview

Web Scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort.

More about web scraping

inspecting the website

We need to do is to figure out where we can locate the links to the files we want to download inside the multiple levels of HTML tags.

Python Code

This code gives us every line of code that has an tag.

should look like this:

  
# Import libraries
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

# Set the URL you want to webscrape from
url = 'http://web.mta.info/developers/turnstile.html'

# Connect to the URL
response = requests.get(url)

# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html.parser")

# To download the whole data set, let's do a for loop through all a tags
line_count = 1 #variable to track what line you are on
for one_a_tag in soup.findAll('a'):  #'a' tags are for links
    if line_count >= 36: #code for text files starts at line 36
        link = one_a_tag['href']
        download_url = 'http://web.mta.info/developers/'+ link
        urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:]) 
        time.sleep(1) #pause the code for a sec
    #add 1 for next line
    line_count +=1
  

More

How to Scrape Websites Without Getting Blocked

Respect Robots.txt

Web spiders should ideally follow the robot.txt file for a website while scraping. It has specific rules for good behavior, such as how frequently you can scrape, which pages allow scraping, and which ones you can’t.

If these are in the link, it means the website does’nt want to scrape.

Methods that could help scraping robots.txt files

More