Mid-Level Protocols: The urllib2
Module
A central piece of the design for the World-Wide Web is the
concept of a Uniform Resource Locator (URL) and Uniform Resource
Identifier (URI). A URL provides several pieces of information for
getting at a piece of data located somewhere on the internet. A URL has
several data elements. Here's an example URL:
https://www.python.org/download/
.
-
A protocol (http)
-
A server (www.python.org)
-
A port number (80 is implied if no other port number is
given)
-
A path (download)
-
An operation (browsers use GET or POST, some web services use
PUT and DELETE, also)
It turns out that we have a choice of several protocols, making it
very pleasant to use URL's. The protocols include
-
FTP - the File Transfer Protocol. This will send a single file
from an FTP server to our client. For example,
ftp://aeneas.mit.edu/pub/gnu/dictionary/cide.a
is the
identifier for a specific file.
-
HTTP - the Hypertext Transfer Protocol. Amongst other things
that HTTP can do, it can send a single file from a web server to our
client. For example,
https://www.crummy.com/software/BeautifulSoup/download/BeautifulSoup.py
retrieves the current release of the Beautiful Soup module.
-
FILE - the local file protocol. We can use a URL beginning
with file:///
to access files on our local
computer.
HTTP Interaction. A great deal of information on the World Wide Web is available
using simple URI's. In any well-design web site, we can simply GET the
resource that the URL identifies.
A large number of transactions are available through HTTP
requests. Many web pages provide HTML that will be presented to a person
using a browser.
In some cases, a web page provides an HTML form to a person. The
person may fill in a form and click a button. This executes an HTTP
POST
transaction. The urllib2
module allows us to write Python programs which, in effect, fill in the
blanks on a form and submit that request to a web server.
Example. By using URL's in our programs, we can write software that reads
local files as well as it reads remote files. We'll show just a simple
situation where a file of content can be read by our application. In
this case, we located a file provided by an HTTP server and an FTP
server. We can download this file and read it from our own local
computer, also.
As an example, we'll look at the Collaborative
International Dictionary of English, CIDE. Here are three
places that these files can be found, each using different protocols.
However, using the urrllb2
module, we can read
and process this file using any protocol and any server.
-
FTP
-
ftp://aeneas.mit.edu/pub/gnu/dictionary/cide.a
This URL describes the aeneas.mit.edu
server
that has the CIDE files, and will respond to the FTP
protocol.
-
HTTP
-
https://ftp.gnu.org/gnu/gcide/gcide-0.46/cide.a
This URL names the ftp.gnu.org
server that
has the CIDE files, and responds to the HTTP protocol.
-
FILE
-
file:///Users/slott/Documents/dictionary/cide.a
This URL names a file on my local computer.
Example 36.4. urlreader.py
#!/usr/bin/env python
"""Get the "A" section of the GNU CIDE Collaborative International Dictionary of English
"""
import urllib2
#baseURL= "ftp://aeneas.mit.edu/pub/gnu/dictionary/cide.a"
baseURL= "https://ftp.gnu.org/gnu/gcide/gcide-0.46/cide.a"
#baseURL= "file:///Users/slott/Documents/dictionary/cide.a"
dictXML= urllib2.urlopen( baseURL, "r" )
print len(dictXML.read())
dictXML.close()
|
We import the urllib2
module.
|
|
We name the URL's we'll be reading. In this case, any of
these URL's will provide the file.
|
|
When we open the URL, we can read the file.
|