Lab

Lab Scrape Emails

Update:

It appears if there are too many requests to the Ridgewater URL with the same User-Agent, then anti-bot measures kick in and you will get a 403 Forbidden error.

Websites that describe how to overcome this:
How to customize Your User-Agent with Python Requests
How to Effectively Use User Agents for Web Scraping

The fix is to use your own User-Agent.

Go to Chrome tools, Console, and type navigator.userAgent at the > (prompt)
You will get a response with the User-Agent of your machine. Use it in your Python code.

Here’s how you can check and get the user agent using your browser’s console:
Open the developer tools in Google Chrome, Microsoft Edge, Mozilla Firefox, Safari or any other browser. You can use F12 or Ctrl+Shift+I on Windows/Linux, or Cmd+Option(⌘)+I on macOS. Switch to the Console tab.
Type navigator.userAgent in the console and press Enter (or Ctrl+Enter). The console will return a string which is your browser’s user agent.

This lab is to be done on your own.

Create a python program named scrape_emails.py that does the following:
Comments at the top of the module with your name, date, and description. 1 point.
Put this website url into a variable: http://ridgewater.edu/contact-us/staff-directory/ 1 points.
Set up the context that is needed to read https pages (secure pages). 2 points.
Set up the headers dictionary. 1 point.
Use urllib.request.Request to request a website. 1point.
Use urllib.request.urlopen to open the website 1 point.
Use
'(mailto:)([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+)'
or
r'(mailto:)([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+)'
as the regular expression to find all email addresses. 2 points.
Open a new file named 'email_addresses.txt'. 1 point.
Iterate through the regular expression matches to extract the individual email addresses. 2 point.
Write the email adresses to the screen. 2 point
Write the email addresses to the text file. 5 points
Close the text file. 1point

Submit all these files to the D2L dropbox:

The scrape_emails.py code file.
The email_addresses.txt file.
A screenshot in .jpg or .png format showing the start of the run as output.
A screenshot in .jpg or .png format showing the end of the run as output.

The image below shows an example start run of the program.

The image below shows an example end run of the program.