This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
’ % name[len(‘User=’):] print ‘’
To make use of this functionality, you should read up on CGI (which is certainly not specific to Python). Although a complete discussion is outside the scope of this chapter, the following few hints will help get you started: ✦ CGIHTTPRequestHandler stores the user information (including form values) in environment variables. (Write a simple CGI script to print out all variables and their values to test this.) ✦ Anything you write to stdout (via print or sys.stdout.write) is returned to the client, and it can be text or binary data. ✦ CGIHTTPRequestHandler outputs some response headers for you, but you can add others if needed (such as the Content-type header in the example). ✦ After the headers, you must output a blank line before any data. ✦ On UNIX, external programs run with the nobody user ID.
Handling Multiple Requests Without Threads Although threads can help the Web servers in the previous sections handle more than one connection simultaneously, the program usually sits around waiting for data to be transmitted across the network. (Instead of being CPU bound, the program is said to be I/O bound.) In situations where your program is I/O bound, a lot
269
4807-7 ch15.F
270
5/24/01
8:59 AM
Page 270
Part III ✦ Networking and the Internet
of CPU time is wasted switching between threads that are just waiting until they can read or write more data to a file or socket. In such cases, it may be better to use the select and asyncore modules. These modules still let you process multiple requests at a time, but avoid all the senseless thread switching. The select(inList, outList, errList[, timeout]) function in the select module takes three lists of objects that are waiting to perform input or output (or want to be notified of errors). select returns three lists, subsets of the originals, containing only those objects that can now perform I/O without blocking. If the timeout parameter is given (a floating-point number indicating the number of seconds to wait) and is non-zero, select returns when an object can perform I/O or when the time limit is reached (whereupon empty lists are returned). A timeout value of 0 does a quick check without blocking. The three lists hold input, output, and error objects, respectively (objects that are interested in reading data, writing data, or in being notified of errors that occurred). Any of the three lists can be empty, and the objects can be integer file descriptors or filelike objects with a fileno() method that returns a valid file descriptor. CrossReference
See “Working with File Descriptors” in Chapter 10 for more information.
By using select, you can start several read or write operations and, instead of blocking until you can read or write more, you can continue to do other work. This way, your I/O-bound program spends as much time as possible being driven by its performance-limiting factor (I/O), instead of a more artificial factor (switching between threads). With select, it is possible to write reasonably high-performance servers in Python. Note
On Windows systems, select() works on socket objects only. On UNIX systems, however, it also works on other file descriptors, such as named pipes.
A slightly more efficient alternative to select is the select.poll() function, which returns a polling object (available on UNIX platforms). After you create a polling object, you call the register(fd[, eventmask]) method to register a particular file descriptor (or object with a fileno() method). The optional eventmask is constructed by bitwise OR-ing together any of the following: select.POLLIN (for input), select.POLLPRI (urgent input), select.POLLOUT (for output), or select.POLLERR. You can register as many file descriptors as needed, and you can remove them from the object by calling the polling object’s unregister(fd) method. Call the polling object’s poll([timeout]) method to see which file descriptors, if any, are ready to perform I/O without blocking. poll returns a possibly empty list of tuples of the form (fd, event), an entry for each file descriptor whose state has changed. The event will be a bitwise-OR of any of the eventmask flags as well as POLLHUP (hang up) or POLLNVAL (an invalid file descriptor).
4807-7 ch15.F
5/24/01
8:59 AM
Page 271
Chapter 15 ✦ Networking
asyncore If you’ve never used select or poll before, it may seem complicated or confusing. To help in creating select-based socket clients and servers, the asyncore module takes care of a lot of the dirty work for you. asyncore defines the dispatcher class, a wrapper around a normal socket object
that you subclass to handle messages about when the socket can be read or written without blocking. Because it is a wrapper around a socket, you can often treat a dispatcher object like a normal socket (it has the usual connect(addr), send(data), recv(bufsize), listen([backlog]), bind(addr), accept(), and close() methods). Although the dispatcher is a wrapper around a socket, you still need to create the underlying socket (either the caller needs to or you can create it in the dispatcher’s constructor) by calling the create_socket(family, type) method: d = myDispatcher() d.create_socket(AF_INET,SOCK_STREAM) create_socket creates the socket and sets it to nonblocking mode. asyncore calls methods of a dispatcher object when different events occur. When the socket can be written to without blocking, for example, the handle_write() method is called. When data is available for reading, handle_read() is called. You can also implement handle_connect() for when a socket connects successfully, handle_close() for when it closes, and handle_accept() for when a call to socket.accept will not block (because an incoming connection is available and
waiting). asyncore calls the readable() and writable() methods of the dispatcher object to see if it is interested in reading or writing data, respectively (by default, both methods always return 1). You can override these so that, for example, asyncore doesn’t waste time checking for data if you’re not even trying to read any.
In order for asyncore to fire events off to any dispatcher objects, you need to call asyncore.poll([timeout]) (on UNIX, you can also call asyncore.poll2 ([timeout]) to use poll instead of select) or asyncore.loop([timeout]). These functions use the select module to check for a change in I/O state and then fire off the appropriate events to the corresponding dispatcher objects. poll checks once (with a default timeout of 0 seconds), but loop checks until there are no more dispatcher objects that return true for either readable or writable, or until the timeout is reached (a default of 30 seconds). The best way to absorb all this is by looking at an example. Listing 15-4 is a very simple asynchronous Web page retrieval class that retrieves the index.html page from a Web site and writes it to disk (including the Web server’s response headers).
271
4807-7 ch15.F
272
5/24/01
8:59 AM
Page 272
Part III ✦ Networking and the Internet
Listing 15-4: asyncget.py – Asynchronous HTML page retriever import asyncore, socket class AsyncGet(asyncore.dispatcher): def __init__(self, host): asyncore.dispatcher.__init__(self) self.host = host self.create_socket(socket.AF_INET, socket.SOCK_STREAM) self.connect((host,80)) self.request = ‘GET /index.html HTTP/1.0\r\n\r\n’ self.outf = None print ‘Requesting index.html from’,host def handle_connect(self): print ‘Connect’,self.host def handle_read(self): if not self.outf: print ‘Creating’,self.host self.outf = open(self.host,’wt’) data = self.recv(8192) if data: self.outf.write(data) def writeable(self): return len(self.request) > 0 def handle_write(self): # Not all data might be sent, so track what did make it num_sent = self.send(self.request) self.request = self.request[num_sent:] def handle_close(self): asyncore.dispatcher.close(self) print ‘Socket closed for’,self.host if self.outf: self.outf.close() # Now retrieve some pages AsyncGet(‘www.yahoo.com’) AsyncGet(‘www.cnn.com’) AsyncGet(‘www.python.org’) asyncore.loop() # Wait until all are done
4807-7 ch15.F
5/24/01
8:59 AM
Page 273
Chapter 15 ✦ Networking
Here’s some sample output: C:\temp>asyncget.py Requesting index.html from www.yahoo.com Requesting index.html from www.cnn.com Requesting index.html from www.python.org Connect www.yahoo.com Connect www.cnn.com Creating www.yahoo.com Connect www.python.org Creating www.cnn.com Creating www.python.org Socket closed for www.yahoo.com Socket closed for www.python.org Socket closed for www.cnn.com
Notice that the requests did not all finish in the same order they were started. Rather, they each made progress according to when data was available. By being event-driven, the I/O-bound program spends most of its time working on its greatest performance boundary (I/O), instead of wasting time with needless thread switching.
Summary If you’ve done any networking programming in some other languages, you’ll find that doing the same thing in Python can be done with a lot less effort and bugs. Python has full support for standard networking functionality, as well as utility classes that do much of the work for you. In this chapter, you: ✦ Converted IP addresses to registered names and back. ✦ Created sockets and sent messages between them. ✦ Used SocketServers to quickly build custom servers. ✦ Built a working Web server in only a few lines of Python code. ✦ Used select to process multiple socket requests without threads. The next chapter looks at more of Python’s higher-level support for Internet protocols, including modules that hide the nasty details of “speaking” protocols such as HTTP, FTP, and telnet.
✦
✦
✦
273
4807-7 ch15.F
5/24/01
8:59 AM
Page 274
4807-7 ch16.F
5/24/01
8:59 AM
Page 275
16 C H A P T E R
Speaking Internet Protocols
O
n the Internet, people use various protocols to transfer files, send e-mail, and request resources from the World Wide Web. Python provides libraries to help work with Internet protocols. This chapter shows how you can write Internet programs without having to handle lower-level TCP/IP details such as sockets. Supported protocols include HTTP, POP3, SMTP, FTP, and Telnet. Python also provides useful CGI scripting abilities.
✦
✦
✦
✦
In This Chapter Python’s Internet protocol support Retrieving Internet resources Sending HTTP requests Sending and receiving e-mail Transferring files via FTP
Python’s Internet Protocol Support Python’s standard libraries make it easy to use standard Internet protocols such as HTTP, FTP, and Telnet. These libraries are built on top of the socket library, and enable you to program networked programs with a minimum of low-level code. Each Internet protocol is documented in a numbered request for comment (RFC). The name is a bit misleading for established protocols such as POP and FTP, as these protocols are widely implemented, and are no longer under much discussion! These protocols are quite feature-rich — the RFCs for the protocols discussed here would fill several hundred printed pages. The standard Python modules provide a high-level client for each protocol. However, you may need to know more about the protocols’ syntax and meaning, and the RFCs are the best place to learn this information. One good online RFC repository is at http://www.rfc-editor.org/. CrossReference
Refer to Chapter 15 for more information about the socket module and a quick overview of TCP/IP.
Retrieving resources using Gopher Working with newsgroups Using the Telnet protocol Writing CGI scripts
✦
✦
✦
✦
4807-7 ch16.F
276
5/24/01
8:59 AM
Page 276
Part III ✦ Networking and the Internet
Retrieving Internet Resources The library urllib provides an easy mechanism for grabbing files from the Internet. It supports HTTP, FTP, and Gopher requests. Resource requests can take a long time to complete, so you may want to keep them out of the main thread in an interactive program. The simplest way to retrieve a URL is with one line: urlretrieve(url[,filename[,callback[,data]]])
The function urlretrieve retrieves the resource located at the address url and writes it to a file with name filename. For example: >>> MyURL=”http://www.pythonapocrypha.com” >>> urllib.urlretrieve(MyURL, “pample2.swf”) >>> urllib.urlcleanup() # clean the cache!
If you do not pass a filename to urlretrieve, a temporary filename will be magically generated for you. The function urlcleanup frees up resources used in calls to urlretrieve. The optional parameter callback is a function to call after retrieving each block of a file. For example, you could use a callback function to update a progress bar showing download progress. The callback receives three arguments: the number of blocks already transferred, the size of each block (in bytes), and the total size of the file (in bytes). Some FTP servers do not return a file size; in this case, the third parameter is -1. Normally, HTTP requests are sent as GET requests. To send a POST request, pass a value for the optional parameter data. This string should be encoded using urlencode. To use a proxy on Windows or UNIX, set the environment variables http_proxy, ftp_proxy, and/or gopher_proxy to the URL of the proxy server. On a Macintosh, proxy information from Internet Config is used.
Manipulating URLs Special characters are encoded in URLs to ensure they can be passed around easily. Encoded characters take the form %##, where ## is the ASCII value of the character in hexadecimal. Use the function quote to encode a string, and unquote to translate it back to normal, human-readable form: >>> print urllib.quote(“human:nature”) human%3anature >>> print urllib.unquote(“cello%23music”) cello#music
4807-7 ch16.F
5/24/01
8:59 AM
Page 277
Chapter 16 ✦ Speaking Internet Protocols
The function quote_plus does the encoding of quote, but also replaces spaces with plus signs, as required for form values. The corresponding function unquote_plus decodes such a string: >>> print urllib.quote_plus(“bob+alice forever”) bob%2balice+forever >>> print urllib.unquote_plus(“where+are+my+keys?”) where are my keys?
Data for an HTTP POST request must be encoded in this way. The function urlencode takes a dictionary of names and values, and returns a properly encoded string, suitable for HTTP requests: >>> print urllib.urlencode( {“name”:”Eric”,”species”:”sea bass”}) species=sea+bass&name=Eric CrossReference
See the module urlparse, covered in Chapter 17, for more functions to parse and process URLs.
Treating a URL as a file The function urlopen(url[,data]) creates and returns a filelike object for the corresponding address url. The source can be read like an ordinary file. For example, the following code reads a Web page and checks the length of the file (the full HTML text of the page): >>> Page=urllib.urlopen(“http://www.python.org”) >>> print len(Page.read()) 339
The data parameter, as for urlretrieve, is used to pass urlencoded data for a POST request. The filelike object returned by urlopen provides two bonus methods. The method geturl returns the real URL — usually the same as the URL you passed in, but possibly different if a Web page redirected you to another URL. The method info returns a mimetools.Message object describing the file. CrossReference
Refer to Chapter 17 for more information about mimetools.
URLopeners The classes URLopener and FancyURLopener are what you actually build and use with calls to urlopen and urlretrieve. You may want to subclass them to handle new addressing schemes. You will probably always use FancyURLopener. It is a
277
4807-7 ch16.F
278
5/24/01
8:59 AM
Page 278
Part III ✦ Networking and the Internet
subclass of URLopener that handles HTTP redirections (response code 301 and 302) and basic authentication (response code 401). The opener constructor takes, as its first argument, a mapping of schemes (such as HTTP) to proxies. It also takes the keyword arguments key_file and cert_file, which, if supplied, allow you to request secure Web pages (using the HTTPS scheme). Note
The default Python build does not currently include SSL support. You must edit Modules/Setup to include SSL, and then rebuild Python, in order to open https:// addresses with urllib.
Openers provide a method, open(url[,data]), that opens the resource with address url. The data parameter works as in urllib.urlopen. To open new url types, override the method open_unknown(url[,data]) in your subclass. By default, this method returns an “unknown url type” IOError. Openers also provide a method retrieve(url[,filename[,hook[,data]]]), which functions like urllib.urlretrieve. The HTTP header user-agent identifies a piece of client software to a Web server. Normally, urllib tells the server that it is Python-urllib/1.13 (where 1.13 is the current version of urllib). If you subclass the openers, you can override this by setting the version attribute before calling the parent class’s constructor.
Extended URL opening The module urllib2 is a new and improved version of urllib. urllib2 provides a wider array of features, and is easier to extend. The syntax for opening a URL is the same: urlopen(url[,data]). Here, url can be a string or a Request object. The Request class gathers HTTP request information (it is very similar to the class httplib.HTTP). Its constructor has syntax Request(url[,data[,headers]]). Here, headers must be a dictionary. After constructing a Request, you can call add_header(name,value) to send additional headers, and add_data(data) to send data for a POST request. For example: >>> # Request constructor is picky: “http://” and the >>> # trailing slash are both required here: >>> MyRequest=urllib2.Request(“http://www.python.org/”) >>> MyRequest.add_header(“user-agent”,”Testing 1 2 3”) >>> URL=urllib2.urlopen(MyRequest) >>> print URL.readline() # read just a little bit
The module urllib2 can handle some fancier HTTP requests, such as basic authentication. For further details, consult the module documentation. New Feature
The module urllib2 is new in Python Version 2.1.
4807-7 ch16.F
5/24/01
8:59 AM
Page 279
Chapter 16 ✦ Speaking Internet Protocols
Sending HTTP Requests HyperText Transfer Protocol (HTTP) is a format for requests that a client (usually a browser) sends to a server on the World Wide Web. An HTTP request includes various headers. Headers include information such as the URL of a requested resource, file formats accepted by the client, and cookies, parameters used to cache userspecific information (see RFC 2616 for details). The httplib module lets you build and send HTTP requests and receive server responses. Normally, you retrieve Web pages using the urllib module, which is simpler. However, httplib enables you to control headers, and it can handle POST requests.
Building and using request objects The module method HTTP([host[,port]]) constructs and returns an HTTP request object. The parameter host is the name of a host (such as www.yahoo.com). The port number can be passed via the port parameter, or parsed from the host name; otherwise, it defaults to 80. If you construct an HTTP object without providing a host, you must call its connect(host[,port]) method to connect to a server before sending a request. To start a Web request, call the method putrequest(action,URL). Here, action is the request method, such as GET or POST, and URL is the requested resource, such as /stuff/junk/index.html. After starting the request, you can (and usually will) send one or more headers, by calling putheader(name, value[, anothervalue,...]). Then, whether you sent headers or not, you call the endheaders method. For example, the following code informs the server that HTML files are accepted (something most Web servers will assume anyway), and then finishes off the headers: MyHTTP.putheader(‘Accept’, ‘text/html’) MyHTTP.endheaders()
You can pass multiple values for a header in one call to putheader. After setting up any headers, you may (usually on a POST request) send additional data to the server by calling send(data). Now that you have built the request, you can get the server’s reply. The method getreply returns the server’s response in a 3-tuple: (replycode, message, headers). Here, replycode is the HTTP status code (200 for success, or perhaps the infamous 404 for “resource not found”). The body of the server’s reply is returned (as a file object with read and close methods) by the method getfile. This is where the request object finally receives what it asks for.
279
4807-7 ch16.F
280
5/24/01
8:59 AM
Page 280
Part III ✦ Networking and the Internet
For example, the following code retrieves the front page from www.yahoo.com: >>> Request=httplib.HTTP(“www.yahoo.com”) >>> Request.putrequest(“GET”,”/”) >>> Request.endheaders() >>> Request.getreply() (200, ‘OK’, <mimetools.Message instance at 0085EBD4>) >>> ThePage=Request.getfile() >>> print ThePage.readline()[:50]
” traceback.print_exc()
Listing 16-6: Feedback.htmlFeedback form Submit your comments