Python - Mid-Level Protocols: The urllib2 Module

Mid-Level Protocols: The `urllib2` Module
	Chapter 36. Programs: Clients, Servers, the Internet and the World Wide Web

Mid-Level Protocols: The `urllib2` Module

A central piece of the design for the World-Wide Web is the concept of a Uniform Resource Locator (URL) and Uniform Resource Identifier (URI). A URL provides several pieces of information for getting at a piece of data located somewhere on the internet. A URL has several data elements. Here's an example URL: https://www.python.org/download/.

A protocol (http)
A server (www.python.org)
A port number (80 is implied if no other port number is given)
A path (download)
An operation (browsers use GET or POST, some web services use PUT and DELETE, also)

It turns out that we have a choice of several protocols, making it very pleasant to use URL's. The protocols include

FTP - the File Transfer Protocol. This will send a single file from an FTP server to our client. For example, ftp://aeneas.mit.edu/pub/gnu/dictionary/cide.a is the identifier for a specific file.
HTTP - the Hypertext Transfer Protocol. Amongst other things that HTTP can do, it can send a single file from a web server to our client. For example, https://www.crummy.com/software/BeautifulSoup/download/BeautifulSoup.py retrieves the current release of the Beautiful Soup module.
FILE - the local file protocol. We can use a URL beginning with file:/// to access files on our local computer.

HTTP Interaction. A great deal of information on the World Wide Web is available using simple URI's. In any well-design web site, we can simply GET the resource that the URL identifies.

A large number of transactions are available through HTTP requests. Many web pages provide HTML that will be presented to a person using a browser.

In some cases, a web page provides an HTML form to a person. The person may fill in a form and click a button. This executes an HTTP POST transaction. The urllib2 module allows us to write Python programs which, in effect, fill in the blanks on a form and submit that request to a web server.

Example. By using URL's in our programs, we can write software that reads local files as well as it reads remote files. We'll show just a simple situation where a file of content can be read by our application. In this case, we located a file provided by an HTTP server and an FTP server. We can download this file and read it from our own local computer, also.

As an example, we'll look at the Collaborative International Dictionary of English, CIDE. Here are three places that these files can be found, each using different protocols. However, using the urrllb2 module, we can read and process this file using any protocol and any server.

FTP: ftp://aeneas.mit.edu/pub/gnu/dictionary/cide.a This URL describes the aeneas.mit.edu server that has the CIDE files, and will respond to the FTP protocol.
HTTP: https://ftp.gnu.org/gnu/gcide/gcide-0.46/cide.a This URL names the ftp.gnu.org server that has the CIDE files, and responds to the HTTP protocol.
FILE: file:///Users/slott/Documents/dictionary/cide.a This URL names a file on my local computer.

Example 36.4. urlreader.py

#!/usr/bin/env python
"""Get the "A" section of the GNU CIDE Collaborative International Dictionary of English
"""
import urllib2

#baseURL= "ftp://aeneas.mit.edu/pub/gnu/dictionary/cide.a"
baseURL= "https://ftp.gnu.org/gnu/gcide/gcide-0.46/cide.a"
#baseURL= "file:///Users/slott/Documents/dictionary/cide.a"

dictXML= urllib2.urlopen( baseURL, "r" )
print len(dictXML.read())
dictXML.close()

	We import the `urllib2` module.
	We name the URL's we'll be reading. In this case, any of these URL's will provide the file.
	When we open the URL, we can read the file.


Web Services: The `xmlrpclib` Module		Client-Server Exercises