VOLUME III - ISSUE 4
APRIL 2004
The Magazine For PHP Professionals
&
Internationalization! Multi-lingual websites made easy
Advanced Regular Expressions Improve your 404 error handling
www.phparch.com
Writing web services with SOAP Interpreting PDF Files
Exclusive:|
Zend on the new WinEnabler™
Plus:
Tips & Tricks, Security Corner, Product Reviews and much more...
This copy is registered to: livia carboni jackson
[email protected]
3UHSDUH\RXUVHOIIRU3+3« /HDUQ2EMHFW2ULHQWHG3URJUDPPLQJ ZLWKRYHU3UDFWLFDO3+36ROXWLRQV
0R QH \% DFN *8 $5 $
*HWWKLVVHWRIWZRQHZERRNV
17 ((
7KH3+3$QWKRORJ\9ROXPH,)RXQGDWLRQV 7KH3+3$QWKRORJ\9ROXPH,,$SSOLFDWLRQV /HDUQWREXLOGIDVWVHFXUHDQGUHOLDEOH 2EMHFW2ULHQWHG3+3DSSOLFDWLRQVXVLQJ SURIHVVLRQDO:HEGHYHORSPHQWWHFKQLTXHV 3UHYHQW64/LQMHFWLRQDWWDFNV 6HQG 3DUVH+70/HPDLO )LOWHUXVHUVXEPLWWHGFRQWHQW &DFKHSDJHVIRUIDVWHUDFFHVV &UHDWH\RXURZQ566IHHGV 3URGXFHFKDUWV JUDSKV :ULWH3URIHVVLRQDO(UURUKDQGOLQJURXWLQHV &UHDWHVHDUFKIULHQGO\85/V $QGRWKHUSUDFWLFDODSSOLFDWLRQV
%X\ERWKERRNVWRJHWKHUIRURQO\ 6$9( 1 H Z H HD V 5 HO
3/86¶3+3$UFKLWHFW·UHDGHUVJHWDQH[WUDRII RQO\XQWLO$SULOWK
7R2UGHU12:YLVLW« SKSDUFKLWHFWVLWHSRLQWFRP
TABLE OF CONTENTS
php|architect Departments
5
Features
Editorial
9
Four, oh four!
I N D E X
by Ilia Alshanetsky
6
What’s New!
16
Meet Your Match: Advanced Regular Expressions by George Schlossnagle
32
Product Review Lumenation Report Builder
24
A Trip Down PDF Lane by Marco Tabini
55
Security Corner SQL Injection by Chris Shiflett
38
Zend Does it Again by Marco Tabini
59
Tips & Tricks By John W. Holmes
41
Smarty and Internationalization by John Coggeshall
63
exit(0); Want to Share? Come to Canada! By Marco Tabini
48
Creating Web Services with PHP and SOAP by Alessandro Sfondrini
3
You’ll never know what we’ll come up with next
! W E N
Existing subscribers can upgrade to the Print edition and save! Login to your account for more details.
php|architect
Visit: http://www.phparch.com/print for more information or to subscribe online.
The Magazine For PHP Professionals
php|architect Subscription Dept. P.O. Box 54526 1771 Avenue Road Toronto, ON M5M 4N5 Canada Name: ____________________________________________ Address: _________________________________________ City: _____________________________________________ State/Province: ____________________________________ ZIP/Postal Code: ___________________________________
Your charge will appear under the name "Marco Tabini & Associates, Inc." Please allow up to 4 to 6 weeks for your subscription to be established and your first issue to be mailed to you. *US Pricing is approximate and for illustration purposes only.
Choose a Subscription type:
Canada/USA International Surface International Air Combo edition add-on (print + PDF edition)
$ 83.99 $111.99 $125.99 $ 14.00
CAD CAD CAD CAD
($59.99 ($79.99 ($89.99 ($10.00
US*) US*) US*) US)
Country: ___________________________________________ Payment type: VISA Mastercard
American Express
Credit Card Number:________________________________ Expiration Date: _____________________________________ E-mail address: ______________________________________ Phone Number: ____________________________________
Signature:
Date:
*By signing this order form, you agree that we will charge your account in Canadian dollars for the “CAD” amounts indicated above. Because of fluctuations in the exchange rates, the actual amount charged in your currency on your credit card statement may vary slightly. **Offer available only in conjunction with the purchase of a print subscription.
To subscribe via snail mail - please detach/copy this form, fill it out and mail to the address above or fax to +1-416-630-5057
EDITORIAL
E D I T O R I A L
R A N T S
S
pring is here, and so is another issue of php|architect. In fact, it seems to me that, while this issue of php|a will, hopefully, reach you on time, the postal service must have somehow forgotten to deliver spring here in Southern Ontario, where the air is still quite on the crisp side. On the plus side, colder weather means that one doesn’t mind having to sit inside and work that much—at least in theory. In practical terms, I would much rather be writing this editorial sitting on the edge of a swimming pool in the company of a piña colada, but, unless I crank up the heating, buy a sunlamp and fill the bathtub, it doesn’t look like that is going to happen any time soon. Oh, well. This month’s spotlight falls on the topic of website internationalization. As John justly notes in his article, developers sometimes lack the foresight needed to write websites that can easily serve people from different locales—or, as he puts it, the “works for me” syndrome is in full effect. Internationalization is a complex topic that is not easy to address using automated tools, although a large portion of it can be handled via software. For some countries, internationalization is a reality that it is necessary to deal with even if you are not expecting your site to attract visitors from outside your borders. Canada, for example, is a bilingual country where both English and French (Canadian French, to be accurate) are spoken. Thus, many vendors require that all their materials be produced in both languages, and websites are no exception. The problem with producing a website for multiple locales is that there is more to it than simply translating every sentence. For example, numbers are formatted differently in English and French—the former uses a comma to separate the thousands and a dot to indicate the beginning of the decimal part, while the latter does the exact opposite. Interestingly enough, PHP offers a wide variety of functions to handle localization, and John’s article fits in with them by providing a mechanism for Smarty developers to provide content in several different languages. Now, if we could only find a way to make computers render proper translations automatically, our troubles would really be over. Alas, we are not quite there, as you probably know if you ever tried to use one of the free translation tools available on the Net (at least, the translation tools are good for a laugh). Elsewhere among these pages, George Schlossnagle once again delves into the intricacies of regular expressions, uncovering some great techniques, such as comments, that would probably benefit most Regex users. As I’ve mentioned last month, I consider regular expressions to be one of the great tools in the programmer’s arsenal, and I just can’t get enough of
php|architect Volume III - Issue 4 April, 2004
Publisher Marco Tabini
Editorial Team Arbi Arzoumani Peter MacIntyre Eddie Peloke
Graphics & Layout Arbi Arzoumani
Managing Editor Emanuela Corso
Director of Marketing J. Scott Johnson
[email protected]
Account Executive Shelley Johnston
[email protected]
Authors Ilia Alshanetsky, John Coggeshall, George Schlossnagle, Alessandro Sfondrini, Marco Tabini php|architect (ISSN 1709-7169) is published twelve times a year by Marco Tabini & Associates, Inc., P.O. Box 54526, 1771 Avenue Road, Toronto, ON M5M 4N5, Canada. Although all possible care has been placed in assuring the accuracy of the contents of this magazine, including all associated source code, listings and figures, the publisher assumes no responsibilities with regards of use of the information contained herein or in all associated material.
Contact Information: General mailbox:
[email protected] Editorial:
[email protected] Subscriptions:
[email protected] Sales & advertising:
[email protected] Technical support:
[email protected] Copyright © 2003-2004 Marco Tabini & Associates, Inc. — All Rights Reserved
Continued on page 8... April 2004
●
PHP Architect
●
www.phparch.com
NEW STUFF and vice-versa. • Streams have been greatly improved, including the ability to access low-level socket operations on streams.
N E W
S T U F F
What’s New!
PHP 5 RC1 PHP.net announced the release of PHP 5 RC1! PHP.net announces: ”The move from Beta stage to RC stage means that PHP 5 is now feature complete, and is quite stable - stable enough for everyone to start playing with. Note that it is still not recommended for mission-critical use. Some of the key features of PHP 5 include: • The Zend Engine II with a new object model and dozens of new features. • XML support has been completely redone in PHP 5, all extensions are now focused around the excellent libxml2 library (www.xmlsoft.org). • A new MySQL extension named MySQLi for developers using MySQL 4.1 and later. This new extension includes an object-oriented interface in addition to a traditional interface; as well as support for many of MySQL’s new features, such as prepared statements. • SQLite has been bundled with PHP. For more information on SQLite, please visit their website. • A brand new built-in SOAP extension for interoperability with Web Services. • A new SimpleXML extension for easily accessing and manipulating XML as PHP objects. It can also interface with the DOM extension
April 2004
●
PHP Architect
●
www.phparch.com
PHP 4.3.6 RC1 This release addresses 2 major bugs introduced in the 4.3.5 release. One of these bugs caused problems when loading dynamic extensions on Windows and thread-safe (ZTS) builds and the other involves incorrect handling of daylight savings time. A few other minor bugs were fixed as well. For more information visit: www.php.net
ZEND WinEnabler If you run PHP on Windows, the ZEND WinEnabler is worth a look. ZEND announces: ”The Zend WinEnabler is a new product specifically designed to enable PHP to run in Windows environments. In terms of stability and performance, WinEnabler is the only solution that brings PHP on Windows up to par with PHP on Linux. Zend WinEnabler finally lets Windows PHP applications deliver as much as Linux PHP applications do”. Get more information from Zend.com, or read more about the WinEnabler in this issue’s exclusive interview with Zend’s Rinat Gersch.
Open PHPNuke 2.0.0 Open PHPNuke announces the 2.0.0 release. Changes include better overall performance while new modules include a spam honeypot and jobs. Get more information from openphpnuke.com
Top PHP Studio v1.19.6 Top Systems is proud to announce the release of Top PHP Studio v1.19.6 Top PHP Studio is an Integrated Development Environment for PHP. Top PHP Studio offers internal HTTP server, internal browser, built-in FTP client, syntax highlighting, file/server explorer, function/parameter/tag completion, code snippets and
6
NEW STUFF more... Key features: • Executing PHP scripts through the internal HTTP server • Preview of scripts in the internal browser (requires MSIE 4.0+) • Built-in FTP client • Syntax check • PHP function and parameter completion • HTML tag completion • Visual file comparison utility • Configurable syntax highlighting for PHP, HTML, SQL, CSS... • Integration with PHP documentation • Server explorer • Local file system view • Code snippets • Advanced code editing capabilities • Tabbed multiply document interface For more information visit: www.top-systems.net PHPXref 0.5 PHPXref is a utility to cross reference large PHP projects and generate pure HTML documentation from them. No web server is required to view the resulting docs. This version features numerous new features, bug fixes
and usability enhancements including cross-referencing of constants, search history, inclusion of PHP5 functions, etc. Some of the features are: • Minimal requirements, minimal setup • No web server required to view output. • Cross-references PHP classes, functions, variables, constants and require/include usage • Extracts phpdoc style documentation from source files • JavaScript enhanced output provides:(Mouse-over information for classes and functions in the source view, Hot-jump to the source of any class/function definition, Instant lookup of classes, functions, constants and tables by name, Search/lookup history.) • Version 0.5 - has the Biggest changelog so far including cross referencing of constants, a search history so you can get back to that function you were looking at, a link to the original plain-text version of the source, variable highlighting (mouse over a variable name that occurs on screen multiple times in the source view in IE or Mozilla) plus various updates and bug fixes. For more information visit: www.phpxref.net
Looking for a new PHP Extension? Check out some of the lastest offerings from PECL. Fann 0.1.0 Fann (fast artificial neural network library) implements multilayer feedforward networks with support for both fully connected and sparse connected networks. imagick 0.9.9 Provides a wrapper to the ImageMagick/GraphicsMagick library. It’s a native php-extension. You need the ImageMagick libraries from www.imagemagick.org to get it running. mdbtools 0.1 mdbtools provides read access to MDB data files as used by Microsoft Access and its underlying JetEngine. It is based on libmdb from the mdbtools package available at http://mdbtools.sourceforge.net/ uuid 1.0 This extension provides functions to generate and analyse universally unique identifiers (UUIDs). It depends on the external libuuid. This library is available on most linux systems, its source is bundled with the ext2fs tools. memcache 0.4 Memcached is a caching daemon designed especially for dynamic web applications to decrease database load by storing objects in memory. This extension allows you to work with memcached through handy OO and procedural interfaces.
April 2004
●
PHP Architect
●
www.phparch.com
7
NEW STUFF
Check out some of the hottest new releases from PEAR. MP3_ID 1.1.1 The class offers methods for reading and writing information tags (version 1) in MP3 files. PECL_Gen 0.8.3 PECL_Gen (formerly known as ext_skel_ng) is a pure PHP replacement for the ext_skel shell script that comes with the PHP 4 source. It reads in configuration options, function prototypes and code fragments from an XML description file and generates a complete ready-to-compile PECL extension. Cache 1.5.4 With the PEAR Cache you can cache the result of certain function calls, as well as the output of a whole script run or share data between applications. XML_sql2xml 0.3.2 This class takes a PEAR::DB-Result Object, a sql-query-string, an array and/or an xml-string/file and returns a xml-representation of it. It relies on the DOMXML extension of PHP.
Editorial: Contiuned from page 5 them—and I think George is doing a great job at demystifying them for us all. On the web services front, Alessandro Sfondrini officially brings his trilogy to an end with an article about creating your own service, either ex-novo or through the help of NuSOAP. If you have had the chance of reading the last two issues of php|a, you should now have all the elements you will ever need to handle web services in PHP from either the client or server side. And, if you didn’t get a chance to read the last two issues of php|a... you are still on time to get your hands on them through our website. Every time your server registers a 404 error (Page Not Found), you are, potentially, missing out on a visitor. For example, on an average day the php|a website registers something like 45,000 of them—and they are caused by the most varied reasons: a misspelled link, a pointer to a resource that no longer exists on our site, and so on. It seems fitting, therefore, that Ilia Alshanetsky should write an article for us on how to manage these errors and turn them to your advantage. The techniques that Ilia uses show, once again, the flexibility and power of both Apache and PHP: with a couple of commands and a well-written script, 404 errors become a thing of the past—and users will never get lost on your site. Finally, after a long hiatus, yours truly is back at the old typewriter for an article on the structure of PDF files. As you can imagine, we do a lot of work with files
April 2004
●
PHP Architect
●
www.phparch.com
formatted using Adobe’s brainchild specification here at php|a headquarters, and I figured it was time to share some of the lessons I have learnt with you. Naturally, the PDF format is so complex that it would be impossible to provide a complete guide to all of its features (unless, of course, we wanted to call this magazine “pdf|architect”), but over the course of the next couple of months I hope to provide you with enough information to do what few libraries can—modify the contents of a PDF file. As you can see, there’s lots waiting for you in the next sixty pages—not just the feature articles, but all our great columns, as well as an in-depth look at Zend’s new WinEnabler product. Therefore, don’t let me keep you—another issue of php|a is waiting, and I know you don’t want to hear (or read) me rambling on too long. Until next month, happy readings!
php|a
8
Four, oh four!
F E A T U R E
by Ilia Alshanetsky
One of the most common errors encountered on the net is the dreaded “404”, which indicates that the requested page is not found. To make things worse, this error has so many potential causes that it is virtually impossible to completely eliminate all of them. In this article, we will review some of the preventative steps that can be used to avoid 404s and, in the instances where they cannot be avoided, handle them as gracefully as possible.
W
hile it is easy to blame users for carelessness in entering website URLs, a fairly large portion of the page not found (404) errors can be placed solely on the shoulders of the developers. Web developers are not infallible—like everyone else, they tend to make typos when writing links or forget to modify links after a page has been renamed. Perhaps the most common problem that leads to a 404 message is improper page spelling. For example, Developer A creates a new page that is linked to from several other pages, but, unfortunately, the page has a spelling error in its name. In comes Developer B, who notices the problem and decides to correct it by fixing the spelling. Since it is likely that the second developer is not aware of all the links leading to the newly renamed page, some of these links will be left unmodified, which results in 404 errors when users try to follow the “hardcoded” links on the other pages. The road to Hell is paved with good intentions. So, can this problem be resolved? Ideally, in a multideveloper environment it is a good idea to contact the original developer and make them aware of the problem and, if possible, suggest a reasonable solution. The person who wrote the code should be more familiar with it, and is less likely to forget to modify all pages that would be affected by the change rather than a third party who noticed a particular problem. Alternatively, after renaming a page, take a moment to use a few command line tools to check if any pages link to the renamed page and if any are found adjust them accordingly. For example, after renaming wlcome.php to April 2004
●
PHP Architect
●
www.phparch.com
welcome.php, I use the grep command, which is available on any UNIX-based system (as well as for Windows), to go through my web files looking for instances of wlcome.php so that they can be adjusted to correspond with the new name as in the following example: mv wlcome.php welcome.php grep -r -I -n wlcome.php *
Since it is extremely likely that more than one directory contains web-viewable files, I always append the -r switch to make grep search recursively through the entire directory structure looking for the old file name. At the same time, I do not want to search through binary files like images or videos, as they would not only make the search process extremely slow but may also result in false positives. To that end, I add another switch, -I, that tells grep to first check if the file is plain text or binary and, in the case of the latter, to not scan the file for instances of wlcome.php. For the files that do have instances of wlcome.php, we do not want to spend a lot of time looking for the broken link in question, so we add another flag, -n. This flag will make grep show the line numbers in addition to the file names where
REQUIREMENTS PHP: Any OS: Any Other software: LinkChecker, LinkStatus, gURLChecker Code Directory: 404
9
FEATURE
Four, oh four!
wlcome.php was found, thus simplifying the process of fixing the links with a text editor later on. Please do keep in mind that I’m leaving the task of actually replacing the links for myself—just so that I have complete control over what gets changed. Still, in some cases, you may find yourself in a situation where a large number of files need to be modified, and manually performing the changes could take hours of frustrating manual labor that can end up producing even more broken links. To accelerate the link correction process, we turn to Perl, which can be used to quickly and efficiently perform a find and replace operation on any number of files.
output is then piped, yet again, to sort, which arranges the files by their names; this, in turn, is necessary for our next utility, uniq, which will remove duplicate filenames from the list and can only do so if the input data is sorted. Once we have our list of files, we once again use xargs to arrange the data in a format Perl can understand and pass the list of files to be modified to the Perl interpreter, which then performs the replacement operation. This whole process should take just a few seconds. In fact, it may take you longer to key in the commands correctly than it would take to run them! However, if you have several thousand files to work with, this would be a perfect time to grab a cup of coffee, as the process may stretch into several minutes.
“Despite all of
perl -p -i -e ‘s/wlcome\.php/welcome.php/g’ *.html *.php
the possible reasons for a broken page to happen, the vast majority of not found pages can be blamed on users...”
The example above will make Perl go through all of the HTML and PHP files inside the current directory, replacing the old page name with the new one. As convenient as that may be, there are a few problems with that approach. Because we give Perl a list of all the files inside a directory, rather than just those that need to be modified, all files matching the specified masks will be modified, and that is something we want to avoid, particularly if our project is under source control. Additionally, unlike grep, our Perl command is not recursive, meaning that it will need to be executed for every directory inside of the web tree. This, however, can be easily fixed by using the find command, which is recursive, to find all of the files that may contain instances of an old page name. perl -p -i -e ‘s/wlcome\.php/welcome.php/g’ \ `find -name \*.html -o -name \*.php | xargs`
The modified example will now find all PHP and HTML files and, using the xargs utility, convert the output from one file per line to a single line containing all of the file names, so that they can be safely passed to Perl for processing. In an ideal situation, however, we would, as I mentioned, only work with the files that actually have the string wlcome.php, thus avoiding pointless modification of all files. perl -p -i -e ‘s/wlcome\.php/welcome.php/g’ \ `grep -rI wlcome.php * | awk -F : ‘{ print $1 }’ \ | sort | uniq | xargs`
In the latest revision, we go back to grep, which will be used to find all of the files with the old file name. We pipe the output to awk, which separates the data based on a colon character and prints the first portion of the resulting string—this happens to be the filename. The
April 2004
●
PHP Architect
●
www.phparch.com
Taking Care of Orphans Of course, not all broken links are the result of bad renaming operations. Most websites go through a series of redesigns, feature enhancements and improvements that often result in orphaned links leading to 404s and/or orphaned pages that are no longer referenced from any point on the site, giving the users who access those pages (for example after following links from a search engine) outdated and possibly inaccurate information. It would be clearly impractical to go through every link, image tag and any other external reference available on all your pages by hand, so, once again, automated tools are needed. Fortunately, link checking is not a new concept and there are a large variety of applications designed to help in this task, the majority of which are free to use. My current favorite is a utility written in Python, aptly called “LinkChecker,” that can be downloaded from linkchecker.sourceforge.net/. You can run this utility from both UNIX and Windows machines, making it an excellent choice regardless of your operating system preference. Because this utility is a command line tool, it can be executed on the server itself, making the process that much faster, since running a link checker from a remote source can be quite slow, especially for large sites with hundreds if not thousands of pages. The application can produce its output in several formats, such as plain text, XML, SQL, HTML, and so on. In addition, LinkChecker has plenty of options that allow ample freedom for advanced users and yet are simple enough to be used by novices. linkchecker -v -r10 -s -ohtml \ http://mysite.com/ > linkchk.html
The command above will make LinkChecker validate the specified site and output HTML reports into
10
FEATURE
Four, oh four!
linkchk.html for future review. The -v parameter causes the utility to print valid pages as well, which would then allow for the comparison of the page names on the hard-drive with the ones LinkChecker came across, making it possible to identify orphaned documents. The -r10 option will make LinkChecker go up to 10 directory levels deep in every URL (thus preventing infinite loops caused by dynamic pages), and -s will ensure it won’t wander off from the specified domain— you really don’t want to end up validating the entire Internet. Of course, you don’t have to use command line utilities whose plethora of options may seem a little scary; there are also plenty of GUI link checkers that can do the job just as well. One such utility is called “LinkStatus” ( linkstatus.paradigma.co.pt/) and, like LinkChecker, it, too, can work on both Windows and UNIX. For Windows users, the developers offer a downloadable executable and a series of win32 DLLs that you will need to install, while UNIX users will need to spend a few minutes compiling the sources (which require the QT library). The overall functionality of LinkStatus is, in my opinion, a tad limited compared to LinkChecker, but more than sufficient for the majority of web sites. Another GUI link checker, “gURLChecker” (www.nongnu.org/gurlchecker/), is built around the GTK widget set. It gives you equivalent functionality to LinkChecker, but, unfortunately, won’t work on Windows, so at this time its user base is limited to UNIX users. Catching the Uncatchable As great as all these automated tools are, there will still be some situations where invalid URLs could slip through. These instances occur, for example, when the URL is generated by JavaScript or is located inside a tag that the link checkers do not yet recognize as a valid container of URLs. In such situations, you need a second line of defense, which is a logging mechanism that allows you to detect and track 404 errors as soon as they occur and provides graceful error handling in the eyes of the end-user. The generated log should provide sufficient information to allow the culprit behind the error to be tracked, thus providing an easy solution to the problem. The first step in this process is getting rid of the default 404 message as provided by your web server and creating a 404 handler of your own. This is a very important step, since in addition to getting information, you will be able to present the users with a custom page, which would allow them to locate the page they are really looking for. Not only does this enhance the user experience, but it also prevents some browsers and browser modules from taking the user away from your website to a search engine of their choice with the April 2004
●
PHP Architect
●
www.phparch.com
given URL as the search query when presented by a 404 error. The following Apache directive will allow you to take a wayward visitor to a 404 page of your own design. # Apache 404 Error Handler ErrorDocument 404 /404.php
So what should a 404 handler do? The first step is to gather as much information as possible and write it to a log file, or a database if possible. The advantage of a database is the ability to easily query the errors via SQL, as well as to keep count of how many times a particular error occurred, rather than blindly storing many duplicates of the same problem. This will allow you to sort errors by their frequency and focus on the more common problems, rather than working on a firstcome-first-serve basis. You may also want to rate the errors based on the nature of the request. A POST request that leads to a 404 is probably far more critical than a GET request, since the former implies loss of data that a user has sent. This is especially nasty if the form had some user data, since you will not only lose the information, but also end up with users upset at you because of all the time they wasted filling out a form that leads nowhere. Furthermore, POST requests usually originate from your own site, meaning that your pages themselves are broken, whereas GET request failures could have occurred for any number of reasons and are not necessarily your fault. It is also a good idea to break down 404 errors into types, based on the extension of the missing files. This will further improve prioritization of the errors, allowing for the separation between missing static and dynamic pages. In most instances, missing dynamic pages are more important than their static counterparts, so you’ll probably want to address them first—even if they are accessed less frequently. Another critical component of error logging is checking the referring page. This piece of information will allow you to determine where the user came from and what page has a broken link. If the referral is a local page, you can quickly address the problem, while if the request has a remote site as its source you will need to communicate with the webmaster of that site and get them to update their references to your site (or, alternatively, you can provide a stub redirect that points to the right resource). When communicating with other webmasters, always include the source page on their server containing the broken link to help them resolve the problem. Generic messages like “fix your links ASAP” tend to attract the attention of spam filters rather than the people who are supposed to fix your problems. Listing 1 contains a working example a 404-handler script that will gather all available information and log it to an SQLite database for future analysis. You may discover that not all 404 are actual errors. In
11
FEATURE
Four, oh four!
today’s world of worms, viruses and vulnerability-seeking scripts, a certain number of the 404 events that you will see on your server will be the result of these entities. The majority of those can be attributed to still vulnerable IIS servers that keep trying to find more IIS servers to infect. These are quite harmless to all but unpatched IIS servers. Still, when these requests are encountered, it is better to add a special handler to identify the vulnerable servers and, if possible, deny them access to your server via your firewall. This will prevent them from wasting your bandwidth with bogus requests and populating your logs with garbage data. Some of these requests will come from automated utilities seeking to determine what type of software is running on your server with the ultimate goal of determining whether it is vulnerable to exploits available for common web packages. When faced with such requests, it is imperative to identify them. If you see that they are originating from one source, deny that IP access to the server, since it is possible that eventually it may stumble across a piece of vulnerable software that you are using. While it certainly does not mean you should not take time to review your applications and upgrade them when appropriate, denying access to malicious users does give you a small measure of added security.
ticularly good at this) can break long URLs through their wrapping mechanisms or other alterations. This is particularly common for URLs with long query strings containing a vast range of characters, as many applications will confuse some of these characters with the end of a URL, effectively breaking the URL. Shorter URLs are also easier to remember and users have a much smaller window of opportunity to make a typo. If the URL must be long try to use short, well-known words that are written phonetically as that will dramatically reduce the likelihood of typos and misspellings. How to identify user errors? The majority of user errors have one easily identifiable property—they lack the referring URL. Since these URLs often originate from a non-HTTP source, like E-mail or Instant Messages, the browser has no back history to refer to and does not send the referring field. The same situation applies to a manually-entered URL, even if the user’s browser does have a back history, it is not relevant to this visit since the user actually came directly from the previous site to yours. Therefore, the rule of thumb regarding URLs without a referrer is that they are manually-entered URLs. There are only two exceptions to this rule, one being overly paranoid browsers and proxies that try to protect the user by not sending the referring URL even if there is one. Fortunately, if there is a genuine broken link you will likely see a lot more then one error regarding that particular document and at least one user will have their browser send a referring URL, thus allowing you to track down the error. The other exception is bookmarks; they, too, will not provide a referring URL, although some browsers will try to be helpful and send ‘-’ to assist you in identifying them. This is not a big issue since broken bookmarks usually are the result of site alterations where pages have been renamed or removed. In such situations, you may want to leave a pointer to the new page for a period of time to allow your user base to adjust to the changes. Consider the following example:
“So what should a 404
handler do? The first step is to gather as much information as possible and write it to a log file, or a database if possible.”
Blame the User Despite all of the possible reasons for a broken page to happen, the vast majority of not found pages can be blamed on users, who make typos while typing a web address or have their URLs somehow mangled by their E-mail, IRC, or other client tools. In these situations, the job of the error handling mechanism is twofold. First of all, the error should be logged, thus allowing you to determine what kinds of spelling mistakes your users make and their frequency. If you see that a particular typo is extremely common, perhaps you should make a symlink to the real page so that the users can get the requested page right away or setup a redirect if the operating system you are using does not support symlinks. An example follows: ln -s extremelylongpagename.php extrimelylongpagename.php
The logging approach is also useful for research purposes. If a particular typo is extremely common, you can reconsider the URL’s length or spelling and come up with a different name that is easier to key in. Moreover, many applications (e-mail programs are parApril 2004
●
PHP Architect
●
www.phparch.com
The page you’ve requested has been moved to: http://somesite.com/new/url. Please update your bookmarks, you will be redirected to the new page in 5 seconds.
12
FEATURE
Four, oh four!
Listing 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
April 2004
●
PHP Architect
●
www.phparch.com
13
FEATURE
Four, oh four!
Even with all these precautions, you will still encounter 404s, which goes to show that the creativity of the end user is not to be underestimated. This brings us to the second portion of the error handler, whose job is to present the users with some helpful information, hopefully allowing them to find the page they are looking for with the minimum amount of effort. After all, a happy user is a user who is more likely to return to your site, buy your wares and tell all his friends about what a great developer you are (or not). The first thing to do is to make your error page helpful, rather than displaying a picture of the “milk carton” image with the picture of your page on it, which is surprisingly common. If possible, try to include either a sitemap of your site or, better yet, a search engine that will allow the user to locate the page they are looking for. If your site has a search engine, you may be able to grab the page name and pass it to it. That way, you can present the user with a list of similar pages right away, rather than having the user manually execute the search. If you do not have a search engine, you can use the power of existing search engines, such as Google, that probably have indexed your site. These search engines may even have an older copy of the page (if it had been moved) allowing the user to find the location of the new page via the old copy. Additionally, if the error occurred due to a typo, the search engines may present a correctly spelled alternative, which would then lead the user to the page they are looking for. The following example shows one way to accomplish this:
You can embed the above example in your error page, which will display a Google’s result page for the missing document on the domain that was accessed. It is very important that you limit the results to just your domain so that you don’t end up loosing or confusing visitors with search results not relevant to your site. You can also access Google directly through the Google Web Services API, which Alessandro Sfondrini explored in the January 2004 issue of php|a. Another remote site that may offer helpful informat i o n is the Way Back Machine, which can be found at http://archive.org. This site’s whole purpose is to periodically archive any websites its spider can get its mandibles on. If your website has been around for a while, it is very likely that this archive has a copy of it. When a page cannot be found you can display another window, linking to this archive in an attempt to retrieve
April 2004
●
PHP Architect
●
www.phparch.com
an older version of the page if one is available and, perhaps, the older page will lead the user to the new version. The syntax is much simpler than Google’s—simply pass the missing URL to http://archive.org and it will attempt to display the latest copy of that page it has available. For example:
As we have seen, in many situations 404 errors are due to very simple typos in the URL name due to the user spelling the portions of the URL phonetically or pressing a key adjacent to the one needed. The latter, in particular, is a big issue for laptop users, who are often forced to type on tiny keyboards whose layout is slightly different from their full-size desktop counterparts. In these situations, it may be possible to use PHP to determine what page the user is looking for and redirect the user to that page. As listing 2 demonstrates, this is a fairly simple thing to implement and it makes for a much more pleasant user experience, as typos are being handled transparently. This functionality is made possible by PHP’s metaphone() function, which generates a key based on how a particular string would be Listing 2 1
14
FEATURE
Four, oh four!
pronounced. The key of a missing file can then be compared to how valid files in the specified directory would sound and, if there is any match, you can transparently take the user to the matching page. If there happens to be more than one match, the user can be presented with a list of possibilities allowing them to choose one of the alternatives. To help the user determine which page they are looking for, you can use the get_meta_tags() function to retrieve the description and title of the page from any tags that are present in it. The following example demonstrates this thought:
You may wonder why there is a distinction made between dynamic PHP files and static files. With dynamic files, the meta tags may be generated by the script and may vary based on the input or may possibly be originating from some included file. Therefore, we need to emulate a request and fetch the data from the generated HTML. For static files, there is no such problem and, therefore, the meta information can be retrieved directly from the file on the drive without making a much slower web request. If metaphone() does not produce any results, you can try another PHP function, similar_text(), which will allow you to compare the similarity of two strings based on the characters contained within them. In most cases, this is a much less accurate comparison than the one performed by metaphone(), but in a situation where metaphone() fails, it is better than nothing.
Now that you have a list of files sorted by their similarity to the requested file you can redirect the user to the matching pages. To prevent confusion, you would only want to redirect the user transparently if the
April 2004
●
PHP Architect
●
www.phparch.com
matching file has a high match percentage to the requested file. This information can be retrieved by passing a 3rd argument to similar_text() that will cause it to return the similarity in a form of a percentage. If the percentage is 90% or higher, you can probably safely redirect the user directly to the file. Otherwise, you can present the user with list of all matching files who’s match percentage is 80% or higher. The minimum percentage should depend on the length of the filename—the longer the filename, the higher the limit, since when there are more letters, each mismatched letter has a smaller impact. For shorter filenames with only a few characters, each letter carries a greater weight, so you need to allow for lower match percentages. At the same time, you probably would want to avoid automatic redirections if many of your files have similar names to prevent sending the user to the wrong page. Instead, I’d recommend offering a list of possibilities and displaying meta information about each of them to help the user decide which URL they should pick.
Four-oh-four No More If you add the preventative measures during site design to an intelligent 404 handler, a majority, if not all, of 404 errors that occur on your site will be a thing of the past. The few that do slip through the cracks will be promptly detected and come with sufficient information to allow a rapid resolution to the problem. That said, an ounce of prevention is worth a pound of cure, so when designing a site always try to consider the possibility of typos and wrapping of URLs by various utilities and take the required steps to minimize the risk by keeping your URLs short and your query strings simple.
About the Author
?>
Ilia Alshanetsky is an active member of the PHP development team and is the current release manager of PHP 4.3.X. Ilia is also the principal developer of FUDforum (http://fud.prohost.org/forum/), an open source bulletin board and a contributor to several other projects.
To Discuss this article: http://forums.phparch.com/136
15
Meet Your Match: Advanced Regular Expressions by George Schlossnagle
F E A T U R E
Increase your regex mojo with techniques to create more robust and powerful regular expressions with the PCRE extension.
P
erl Compatible Regular Expressions (PCREs) are a powerful tool for text processing in PHP. While basic usage of regular expressions is relatively straightforward, there are a number of advanced regular expression features that are both poorly documented and syntactically obscure. In this article, you will see how to write maintainable and documented regular expressions (regexes), how to alter basic regex behavior and how to perform lookaround assertions.
Canceling the Noise: Creating Readable Regexes
A frequent (and valid) complaint about regular expressions is that they are concise to the point of obtuseness and difficult to read. Their highly concise syntax is largely responsible for Perl code being described as ‘line noise’. As with all code, it’s important to be able to document what a regular expression does so that as it becomes more complex it can still be modified, debugged and maintained. The principal difficulty with documenting regexes is that whitespace has semantic meaning, so they are generally required to be one long string. To allow for inline documentation, PCREs support the x (or extended regex) pattern modifier, which allows you to add whitespace and comments to your regular expressions. You can take a regular expression like the following (which performs weak RFC 822 email validation):
April 2004
●
PHP Architect
●
www.phparch.com
$regex = ‘/(\S*?)@((?:[\w-]+\.)+\w+)/’;
and add comments to it as follows: $regex = ‘/ (\S*?) # capture 1, (email localpart) @ ( # begin capture 2, the domain (?:[\w-]+\.)+ # match one or more subdomains \w+ # match the TLD (com|net|etc.) ) # end capture 2 /x’;
Notice that whitespace no longer has any semantic meaning and everything following an octothorpe (#) is a comment. If you want to have literal whitespace or octothorpes in your pattern, you will need to either escape them or use their hexadecimal ASCII representations. The preceding example was a bit basic, but to truly convince you of the usefulness of inline comments, consider the following regex, which performs the same URL decomposition as the PHP function parse_url: $regex = ‘!^(?:([^:/?#]+):)?(?://(?:([^@:/]+):([^:/@]+)@)?([^/ :?#]*)(?::(\d+))?)?([^?#]*)(?:\\?([^#]*))?(?:#(.*))?! ’;
REQUIREMENTS PHP: ANY OS: Any Applications: N/A Code Directory: regex
16
FEATURE
Meet Your Match: Advanced Regular Expressions
Deciphering that without comments is nearly impossible. Contrast that with a fully commented version, which you can see in Listing 1. Although the commented version is quite long, the use of meaningful indentation and clear comments helps to mentally decompose the regex. Another issue with long regexes is keeping track of what the offset in the match array various sub-patterns are. While closely maintained inline comments can help maintain clarity, it is still easy to lose track of which offsets are which—especially if you are only a consumer of the regex (for instance, if its capture set is returned to you from a function). To help you keep track of your captures, PHP versions after 4.3.0 support naming sub-patterns so that they can be looked up by the label in the capture array. To label a sub-pattern, you use the following syntax
Array ( [0] => http://testuser:
[email protected]/index.php?q= home [scheme] => http [1] => http [user] => testuser [2] => testuser [pass] => testpass [3] => testpass [host] => www.example.com [4] => www.example.com [port] => [5] => [path] => /index.php [6] => /index.php [query] => q=home [7] => q=home )
The result set still contains the numeric offsets, but it also contains lookups by key, exactly mimicking the parse_url return API.
$regex = ‘/(?P.*)/’;
This associates the label everything with the subpattern (.*). To see this in a larger example, take a look at Listing 2, which shows the URL-parsing regex with labels added so that it returns a result set fully compatible with the built-in parse_url. As an added bonus, the labels serve as natural documentation. Now, if we apply the regex to a sample url, we see the result set as an associative array with keys for the labels: $url = ‘http://testuser:
[email protected]/index.php?q =home’; preg_match($regex, $url, $m); print_r($m);
Listing 1 $regex =
returns the following Array ( [0] => Heading One [1] => Heading One )
However, even a slightly more complex example shows that the regex is flawed. If you change $text to have multiple tags as follows:
The problem here is that the capturing pattern (.*) is greedy. The matching logic of the pattern
works roughly as shown in figure 1. The solution is to make the regex non-greedy. As the name implies, this reverses the behavior of quantifiers. Instead of matching the longest possible subject, the quantifier will match the shortest possible subject. To convert a quantifier to non-greedy, you add a ? to the end of it. For example, in the current example, you would rewrite the regex as $regex = ‘!(.*?)!s’;
Now the pattern-matching semantic is to match an open tag, then capture as little text as possible before matching a closing tag. See how this now behaves for extracting all the tag texts:
Executing that code returns: Array ( [0] => Array ( [0] => Heading One [1] => Heading Two ) [1] => Array ( [0] => Heading One [1] => Heading Two ) )
Instead of aggressively matching to the end of the subject, the non-greedy quantifier advances a single character at a time, checking the subsequent match at every step. Backtracking
To understand what backtracking is, you first need to understand how the regular expression engine actually handles quantifier matches. As a simple example, consider the pattern \d+aa
$text = ‘Heading OneFoo.Heading TwoBar.’;
You will see the following result: Array ( [0] => Heading OneFoo.Heading Two [1] => Heading OneFoo.Heading Two )
April 2004
●
PHP Architect
●
www.phparch.com
Consider how this behaves in a negative match situation, for example on the subject 123ab 123ab] [1 123ab ] [1 123ab] [1 123ab] [1 123ab] [1 123ab] [1
match one digit match another digit match another digit match a fail to match second a, give up onedigit fail to match a, give up one more digit
18
FEATURE [123ab]
Meet Your Match: Advanced Regular Expressions
123ab 123ab
match failure
This act of giving up the greedy-matched subjects is called backtracking, since the regular expression engine is required to backtrack over its previous work to reevaluate the pattern. With your global view, you can see that if the match failed with \d+ matching 123, then it would similarly fail if \d+ only matched 1 or 12, but the regex engine lacks that insight. You, however, can instruct the regex engine not to backtrack on quantifiers by using possessive quantifiers. A possessive quantifier will never give up data for backtracking. To make a quantifier atomic, you append a + to it, as shown here: \d++aa
Now when \d++ matches 123, it will match as follows 123ab 123ab 123ab
match one digit match another digit match another digit
match a failure
You can see that the backtracking is not performed at all. For many cases, this can result in significant performance improvements. A more flexible form of possessiveness is atomic groupings. An atomic groupings is a grouping modifier that disables backtracking inside that group. To convert the above possessive quantifier into an atomic grouping you use the following syntax: (?>\d+)aa
This example is rather contrived—but there are more practical applications of this feature that we can examine. In a wiki application, for example, you often want to find so-called StudlyCaps words for automatic linking to internal documents. StudlyCaps phrases are composed of concatenated words with the start letter capitalized. A regex to identify these is: ^([A-Z]?[a-z]+)+$
Figure 1 [Heading OneFoo.Heading TwoBar.]
Match []
[ Heading OneFoo.Heading TwoBar.]
[.*] matches all
[ Heading OneFoo.Heading TwoBar.]
[.*] gives up [.]
[ Heading OneFoo.Heading TwoBar.]
[.*] gives up [r
... [ Heading OneFoo.Heading TwoBar.]
[.*] gives up [Bar.]
Heading OneFoo.Heading TwoBar.
[] matches Success!
Listing 2 $regex =
The problem with this regular expression is that it only performs the first substitution, since the pattern successfully matches the entire text. A positive lookahead assertion solves this problem by causing the right-hand part not to be included in the match, which causes a natural iterative action given the global replacement behavior of preg_replace. Here is the modified regex: $regex = ‘/(\d+?)(?=(\d{3})+$)/’;
Now the right-hand side of the match (the remaining digits to the right of the rightmost comma) is not matched by the pattern, but instead simply asserted to be there. Negative Look-Ahead (?! )
Look Up, Look Down, Look Allaround
Most regex users are familiar with the regex metacharacters \b and \B, which describe a ‘word boundary’ and ‘non-word boundary’ respectively. Both of these are known as zero-width assertions, since they don’t actually match any characters, but instead describe a boundary between characters. In your own regexes, you can create four kinds of lookarounds: zero-width zero-width zero-width zero-width
positive negative negative positive
look-ahead look-ahead look-behind look-behind
As you will see below, assertions are useful for many things, but since they do not appear at all in the match results, they are particularly useful where the match results are used directly (for example in preg_split or preg_replace). Positive Look-Ahead (?= )
A positive look-ahead allows you to assert that the following sub-pattern occurs, without including it in the match itself. The easiest illustration of this is the regex “match foo followed by bar, without capturing bar in the match”: foo(?=bar)
A more complex example is adding commas to numbers. To perform this via a regular expression, you want to match as few digits as possible such that the number of remaining digits is a multiple of 3, then insert a comma between the two parts. Naively, you might assume that code will do the trick:
It’s important to remember that all of these asser-
20
FEATURE
Meet Your Match: Advanced Regular Expressions
tions are zero-width. The canonical mistake involving negative look-aheads is using this regex: (?!foo)bar
to avoid foobar. Unfortunately this regex is actually just equivalent to: bar
That’s because it first asserts that the next three characters are not foo, then requires that those same three characters are bar, which, of course, matches foobar starting at the b. To correctly handle this case, you will want to use a negative look-behind, which inspects the previous characters instead of the following ones. Negative Look-Behind
A negative look-behind assertion guarantees that the text immediately preceding a sub-pattern does not match a given pattern. Continuing the example above of identifying “bar not preceded by foo”, you can correctly write it as: (? array(3) { [0]=> string(25) “This is a test paragraph.” [1]=> string(22) “It contains sentences.” [2]=> string(10) “Cool, huh?” }
As with the negative look-behinds, positive lookbehind patterns must be of fixed length.
21
FEATURE
Meet Your Match: Advanced Regular Expressions
Conditional Sub-Patterns
One of my personal favorite (and lesser-known) regex constructs is the conditional sub-pattern. Conditional sub-patterns allow you to conditionally match a pattern based on your current match state. Conditional sub-patterns can either look like (?(condition)yes-pattern)
or (?(condition)yes-pattern|no-pattern)
If condition is a number, then the conditional evaluates whether the sub-pattern with that index has captured anything, and then evaluates yes-pattern. If the condition index has not captured anything and the optional no-pattern was specified, then no-pattern is evaluated instead. The standard example for this is capturing a phone number area code that may or may not be in parentheses. Without conditional sub-patterns, you must actually use two sub-patterns in an alternation and determine which returned true: $regex = ‘/(?: \((\d{3})\)
# match 1 - if the area code is in ()s
| (\d{3})
# match 2 - if it’s not )/x’; if(preg_match($regex, $phone, $m)) { $area_code = $m[1]?$m[1]:$m[2]; }
This is awkward. With a conditional sub-pattern, the logic is much cleaner: $regex = ‘/ (\()?
# capture 1 : the optional open paren (\d{3}) # match the digits (?(1) \) ) # if capture 1 matched, require the closing paren /x’; if(preg_match($regex, $phone, $m)) { $area_code = $m[2]; }
Now the end result is only stored in one place—no looking around for it. Note that since $m[1] is used for tracking whether a leading ( was matched or not, the actual results are stored in $m[2]. A more complex example is extracting all the fields from a comma-separated string, allowing for some of the strings to be contained in double quotes. The logic path you can take for this goes as follows: •Try and match a starting “ •If that match succeeds, capture everything up to the next “, or to and ending “ •Otherwise, match up to the next comma, or to the end of the string. April 2004
●
PHP Architect
●
www.phparch.com
To do this, we can use a conditional regex with both a yes- and no-condition. Here’s the regex: $regex = ‘/ (“)? ( (?(1) [^”]*+ | [^,]*+ ) ) (?(1)”)
# capture 1: the optional opening quote # capture 2: the data # IF capture 1 matched a quote # THEN match as many non-quotes as possible # ELSE match up to the next comma # ENDIF # end capture 2 # IF capture 1 matched, then require an end quote # match either a comma or the end of the string
[,$] /x’; preg_match_all($regex, $data, $m);
As above, $m[1] is wasted on capturing the leading quote, so all of your values will be stored in $m[2]. Wrapping it Up
Regular expressions are a surprisingly deep field— what seems like a basic tool is actually very deep and nuanced. Hopefully this article has helped you gain insight into some of the more advanced features of regular expressions and expanded their possible uses in your own applications. Experiment with the techniques you’ve learned here and see how you can improve the elegance, performance and maintainability your own regexes. In the end, though, the key to mastering any language is consistent ongoing practice. Mastering regular expressions is no different. When you have a complex text parsing task, ask yourself “can I solve this with a single elegant regex?” Even if it’s not the solution you end up using, strengthening your regex techniques will teach you how to spot the situations where a positive look-ahead assertion or a atomic grouping is the right choice.
About the Author
?>
George Schlossnagle is a Principal at OmniTI Computer Consulting, a Maryland-based tech company specializing in high-volume web and email systems. Before joining OmniTI, George led technical operations at several high-profile community web sites where he developed experience managing PHP in very large enterprise environments. George is a frequent contributor to the PHP community. His work can be found in the PHP core, as well as in the PEAR and PECL extension repositories. Before entering into information technology, George trained to be a mathematician and served a 2 year stint as a teacher in the Peace Corps. His experience has taught him to value an inter-disciplinary approach to problem solving that favors root-cause analysis of problems over simply addressing symptoms.
To Discuss this article: http://forums.phparch.com/137
22
Can’t stop thinking about PHP? Write for us! Visit us at http://www.phparch.com/writeforus.php
A Trip Down PDF Lane by Marco Tabini
F E A T U R E
PDF files are notoriously complex to decipher. Or are they? This article tries to sort through the 1,000-plus page Adobe PDF documentation to provide you with the skinny on how a PDF file is structured and how you can go about reading or modifying it
A
few years ago, a friend of mine and I were working on a project that required, among other things, the ability to modify the contents of a number of PDF files and send them back to the end user. I remember downloading the PDF specifications manual from the Adobe website, taking a quick glance at it and immediately moving on to looking for another solution—given the time constraints we were under, trying to understand how the format worked and making the release date of our product seemed to be two wildly orthogonal concepts. Fast forward a few years to October of 2002, and you’ll ?find me—and the rest of the php|a team— labouring to get our first issue out the door. The spectre of having to manipulate PDF files, in this case to “personalize” each issue of the magazine with its licensee’s name and e-mail address, presented itself once again in all its ugliness—except that, this time, there really wasn’t another solution. Understanding PDF
As you can imagine, I now have a much better understanding of how PDF works, and over time I’ve come to appreciate this format, which I once considered clumsy and poorly designed. Don’t get me wrong—it is clumsy and poorly designed, but only if one doesn’t take into consideration what the business goals behind its creation were. Let’s look at them quickly:
April 2004
●
PHP Architect
●
www.phparch.com
• A PDF file is, first of all, typographically accurate. Unlike HTML, whose primary goal is to ensure that a document can be displayed on any platform and at all possible resolutions in a way that best adapts itself to the platform’s capabilities, the goal of PDF is to ensure that a document will be reproduced in exactly the same way on any output device, including screens and printers. • A PDF file is a collection of an arbitrary number of pages. As such, and particularly because of the possibility for high-resolution images, it can grow to be extremely large. Nonetheless, one of the design goals is to ensure that a viewer will be able to promptly access any parts of the document without having to scan through its entire data stream. • A PDF file must modifiable without requiring a complete rewrite of the entire file.
REQUIREMENTS PHP: Any OS: Any Other Software: N/A Code Directory: pdf-lane
24
FEATURE
A Trip Down PDF Lane
This may seem trivial, but once you’ve decided that you need structures that allow you to randomly access every single item in the document—be it a page, a text string or an image—making arbitrary changes to a document without affecting these structures and the remainder of the content becomes very challenging. Once one manages to figure out the impact that these design decisions have on PDF’s flexibility and capabilities, all claims of clumsiness and poor design are immediately dispelled and the format specification becomes nothing short of genial. The immense level of complexity remains, however, and a whole cottage industry has grown around the relatively simple concept of creating and altering documents rendered in PDF format. This is particularly true if you want to write or modify a PDF file programmatically from your applications—libraries for this purpose vary in cost from just over a hundred dollars to several thousands. Still, it’s not quite impossible to understand how a PDF file works—the real complexity is in the fact that the PDF specification is written as a reference rather than a tutorial, so that one finds himself continuously jumping around and trying to understand how a file is structured in a practical, rather than abstract, way. The goal of this article is to provide you with a guided tour of how the basics of a PDF file work by walking you through its structure with plenty of examples based on actual documents. PDF Data Types
Let’s start with the basic data types that a PDF document can contain. There are quite a few of them, some of them primitive, while others are composite. “Primitive” data types (my nomenclature—not official PDF-speak) are essential building blocks— that is, elements that are not created by combining one or more data pre-existing data types together. The simplest primitives are the Null object, identified by the string Null, and the Boolean data type, which, just like in PHP can assume the values True and False. Unlike PHP, PDF does not make any distinction between different types of numeric data—generally speaking, any number can be interpreted as floatingpoint. Numeric values are indicated by a string of characters composed of the digits from zero to nine and, optionally, by a minus prefix to indicate negative numbers and a decimal point to indicate fractional values. The PDF specification also includes two different April 2004
●
PHP Architect
●
www.phparch.com
type of strings: literal and hexadecimal. A literal string is the same as the strings we are all familiar with in PHP—except that, instead of using single or double quotes as delimiters, the PDF uses parentheses. For example, the string “This is your lucky day\n”
is represented in PDF as (This is your lucky day\n)
Note that the \n special character remains the same both in PHP and PDF—in fact, the two languages support the same escape sequences, with the exception of single and double quotes, which do not need escaping in PDF, and parentheses, which, of course, need to be escaped in a PDF document. In fact, the escaping rules are a bit more complicated than that. The simplest way to escape a string that contains parentheses is to simply prefix each of them with a backslash: (This is a string \(just in case\))
However, if your string contains a balanced number of parentheses in the proper sequence (that is, if you have a number of open parentheses followed by an equal number of closed parentheses so that, at any position in the string, the number of closed parentheses is never higher than the number of open ones), you can avoid escaping them. Thus the previous string could have been written as follows: (This is a string (just in case))
As you can imagine, this makes parsing a PDF file all that much more fun. However, in practical terms I have never seen a balanced string that takes advantage of this feature—in all the files I’ve ever seen, all parentheses inside a string are always escaped. It is also possible to write a literary string over multiple lines by using a single slash before the end of each line: (This \ is a string \(\ just in case\))
Clearly, this notation is useful only for human readability purposes—and chances are that a human being will never be reading through a PDF file. Thus, from a practical perspective you will rarely see this notation used in a real-life PDF file. Hexadecimal strings are collections of binary data expressed as literary strings in hexadecimal format. They are delimited by angular brackets (< and >—the same ones you use to delimit your HTML tags) and show up in the stream like so:
25
FEATURE
A Trip Down PDF Lane
changeable—in fact, the latter are often used to Note how the key of each elements is always a draw non-printable characters where the use of literal strings may be problematic. named object. The value, on the other hand, can be The final primitive data type supported by PDF is of any type—it can, in fact, be even another dictioncalled named object. A named object is identified by ary. The final composite data type a single token that starts with a slash that we will discuss is the Stream. and followed by an arbitrary number of characters. For example, As you can imagine, a stream repnlike PHP, PDF these are two valid named objects: resents a sequence of data—for example, it is often used as a condoes not make any /Type tainer for image information or /Pages distinction between for the commands necessary to Named objects are often used to different types of render the contents of a page. identified particular items in a PDF A stream object is composed of numeric data—gendictionary—as we’ll see shortly. two parts: a dictionary and the erally speaking, any It’s now time to move on to comdata proper. The dictionary serves posite data types. The simplest of number can be interto provide information about the these is the array, which is a data stream, such as its length preted as floatingsequence of objects delimited by and whether it is compressed or point.” square brackets. For example, the encrypted in any way. The stream following is an array of five numeric data is enclosed between the keyobjects: words stream and endstream. For example, here’s a simple plain-text stream that draws a line of text: [10 15.2 123 11 -0.333]
“U
Like PHP, arrays are heterogeneous—that is, they can contain data of arbitrary types mixed together: [10 (A string) [123 1 (Another string)]]
In the previous example, the array contains a numeric value, a literary string and another array. The equivalent could be generated by the following PHP code:
Note that arrays do not provide the possibility of having named elements (which in PHP can be achieved by writing something like $array[‘key’] = ‘value’). Each element is simply assigned a sequential numeric key starting from zero. Named arrays do exist in PDF, however—they are called Dictionary Objects. A dictionary is delimited by double angular brackets (>) and contain a sequence of key/value pairs. For example:
April 2004
●
PHP Architect
●
www.phparch.com
> stream (This is a string of text) Tj endstream
In this case, the dictionary only contains the /Length key/value pair to indicate that the data in the
stream is twenty-nine bytes long. Because of the verbosity of PDF commands, however, you will rarely find streams this simple in a real-life file—most of the time, the stream data is compressed using one of a number of different algorithms and is completely unreadable to the naked eye, although it is possible to write a relatively simple parser that is capable of “extracting” the actual data from the stream blob. Most often, dictionaries have a key/value pair whose key is /Type and whose value is a named object that indicates the type of data the dictionary contains. For example, a page object has a /Type of /Page, and so on. The /Type key/value pair is usually present in those objects that, logically, could contain several different types of information, so that a reader can determine how to handle the data inside them properly. We’ll make an example of this later on, once we get into the structure of a PDF file. Pure Evil: Indirect Objects and References
Now that we know what data types are supported
26
FEATURE
A Trip Down PDF Lane
by PDF, it’s time to move on to the real killer: indirect objects. An indirect object is an object that can be referenced from another object. The difference between a normal object and an indirect object is akin to the difference between a direct value and a variable: while you can only use a direct object where it is declared, an indirect object is declared separately and can actually be referenced from multiple places within the document. An indirect object is delimited by an object header and a footer: 144 0 obj 1322 endobj
The example above describes indirect object number 144, generation 0, which, in turns contains a numeric object whose value is 1,322. The numeric ID associated with every indirect object uniquely identifies within the document. Theoretically, the generation number is set to zero for all objects when a PDF file is created, and it should be changed only if you modify an object later on as part of an update to the document itself. In practical terms, all the updated PDF files I have seen simply create a new object with its own ID and use it to replace the old one—after all, it’s not easy to run out of object IDs. Indirect objects can be referenced using a structure called—with a stretch of the imagination—an indirect object reference, which looks like the following: Figure 1 xref 0 22 0000000000 0000000017 0000005769 0000005796 0000006443 0000048094 0000006469 0000006576 0000006683 0000017252 0000017278 0000017537 0000017884 0000018130 0000046635 0000046662 0000046912 0000047485 0000047970 0000048027 0000048249 0000048309
April 2004
●
PHP Architect
●
65535 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000
f n n n n n n n n n n n n n n n n n n n n n
www.phparch.com
144 0 R
As you can imagine, the example above points to object 144, generation 0—that’s the object we just declared up there. Indirect objects are the lifeblood of a PDF document—and the bane of your existence as a developer of applications that manipulate PDF files. This is because any object, at any point of the document stream can be substituted by a reference to an indirect object. Take, for example, the stream we declared in the previous paragraph: > stream (This is a string of text) Tj endstream
This object could have also been written as follows: 1 0 obj > stream (This is a string of text) Tj endstream endobj 2 0 obj 29 endobj
As you can see, the value of the /Length parameter of the stream object is now indicated by a reference to the object 2 generation 0, which, in turn, contains nothing more than a direct numeric object whose value is 29. This particular arrangement with streams, which is extremely common within real-life PDF files, may look like a waste of valuable bytes, but it actually serves a very useful purpose. As I mentioned earlier, streams are often compressed, and it’s not always possible to predict how big a memory area will become once it has been compressed. Therefore, an application can create a stream object, knowing that it will be able to enter its size within the PDF stream after it has determined it without having to go back and replace the value. Why is this useful? Keep in mind that PDF can deal with some very large files. A full-page, letter-size (8.5”x11”) image at 300 DPI in CMYK format—the resolution one commonly uses for 1:1 printing—is about 33.5MB uncompressed and that, by far, is not the biggest file a PDF distiller will ever to deal with. As you can imagine, therefore, a PDF creation utility must have the opportunity to process each stream in chunks, reading from and writing to the storage medium directly, rather than having to load all of it and process it in memory in one pass—or run the risk of rapidly depleting the memory available on the sys-
27
FEATURE
A Trip Down PDF Lane
tem it’s running on. Thus, while indirect objects are extremely useful, they are also rather complicated to deal with, particularly if you’re modifying a PDF file. If you’re writing the file, you have complete control over what is happening and you can use objects liberally as you feel appropriate. Similarly, if you’re writing a PDF reader, it’s not too difficult to write a function that resolves all the references until you end up with all the direct objects you need. If you want to change the contents of a PDF file, however, you’re up for a fun ride—because, on top of resolving every reference, you also have to keep track of those references in all the new objects you create. Believe me, when you first start working on a PDF modification utility, this aspect of the format is quite startling. Since we’ll be talking about a lot objects throughout the rest of this article, let’s establish a simple convention—again, something out of my personal dictionary rather than official PDF lore. Instead of referring to an object as “object x generation y”, I’ll just say object (x, y). It’s shorter—and easier to read. Also, from now on I will use a sample document— containing my editorial from the March issue of php|a—as the guinea pig for all our examples. In my opinion, it’s a lot easier to follow examples that are based on real-life data, rather than trying to explain things in a completely abstract way. You can find the file in this month’s code file under the name editorial.pdf. Examining the Structure of a Document
We now have all the elements we need to understand the basic structure of a PDF file. As I mentioned earlier, a PDF file is a hierarchical collection of objects. How are they organized? Let’s take a look. At the top of the structure is the Root object. This object is an object of /Type /Catalog and contains a few important pieces of information about the PDF file. At a minimum, it should contain a pointer to the pages that are part of the document. For example: 20 0 obj > endobj 5 0 obj > endobj
April 2004
●
PHP Architect
●
www.phparch.com
In this case, the Catalog is telling us that information about the pages is to be found in object (5, 0), which, in turn, contains a whole lot of data that we’ll look at in a moment. Note that, in this particular case, the objects are out of sequence—this is perfectly legal and, in fact, quite common. If you’re thinking that this makes parsing a PDF file a little more difficult (as if it wasn’t difficult enough), you’ll see later on that this is only true if you don’t parse the file properly (which is what most people tend to do the first time over). If you look at the /Pages dictionary, which is the next step in our hierarchical structure, you’ll notice that it provides a number of different elements. The /MediaBox array indicates the size of the physical device on which a default page is intended to be displayed. Smart viewing applications can use this data to determine the appropriate display method for the
PDF Manipulation: Your options If you want to manipulate a PDF file from within PHP, you have a good number of different options. For example, if you want to create a new PDF file, you can use either the PDFLib extension (www.php.net/pdf), or the FPDF library (www.fpdf.org). The former is a C extension to the PHP language based on the PDFLib library; because it is written in C, it is faster than FPDF, which is written entirely in PHP. The downside is that, while FPDF is entirely free, PDFLib is a commercial product, and you’ll have to buy a license if you want to use it—and licenses cost $450 US a CPU. If your goal is that of modifying a PDF file, then you may want to look into PDFLib+PDI, an extension to the PDFLib library that costs $900 US a CPU. You can also look at the applications sold by PDF-Tools, a Swiss company that makes an excellent series of PDF-manipulation tools. Finally, we at Mta (the company that publishes php|a) have also been working on a PDF manipulation library—at the time of this writing, it is still in beta, but can be downloaded from my blog at : blogs.phparch.com/s9y/archives/20040327.html
28
FEATURE
A Trip Down PDF Lane
document—or whether displaying it is even possible. The /Resources key/value pair contains information about the typographical resources used by the default page. Let’s take a look at that object: 19 0 obj > endobj
As you can see, this is a dictionary that has no /Type pair—in fact, the dictionary object itself could have been written directly into the /Resources pair of the /Pages dictionary above: 5 0 obj /MediaBox [ 0 0 595 842 ] /Kids [ 6 0 R 7 0 R ] /Count 2
In this case, however, the application that created the PDF file chose, for its own reasons, to use a indirect object instead. The /Resources dictionary, in this case, contains two different values. The /Procset pair indicates the type of PostScript commands that the page or pages contain. This parameter, which only really matters when the PDF file is sent over to a PostScript printer, is now considered obsolete, but the PDF specifications recommend that it be included anyway for backwards compatibility. The /Font pair indicates an arbitrary number of font resources that a specific page, or group of pages, uses: 18 0 obj > endobj
As you can see, the font specification is nothing more than a dictionary that contains a key/value pair for every font used by the application. If you look at the PDF file with a viewer, however, you’ll notice that the document actually only contains one font— Times New Roman. Why are there two font entries? Simple—if you look closely, you’ll notice that some of my text in my editorial is typeset in italic font; from a word processor’s point of view, this is just a logical concept that does not necessarily have anything to do with the shape of the font itself, but not so from a typographical perspective. Let me make another example. The text you’re April 2004
●
PHP Architect
●
www.phparch.com
reading now is typeset in a font that comes from a family called “Garamond”. Now look at the difference between the font in its regular form (“example”) and in its italic form (“example”). As you can see, the italic version of the text looks entirely different from its regular counterpart—almost like... a different font altogether. In fact, part of the confusion is that we often confuse “italic” with “slanted”—in reality these are two completely different things. If you follow the references in the font dictionary, therefore, you’ll notice that the document does need two fonts: 17 0 obj > endobj 12 0 obj > endobj
We don’t have much of an opportunity to get into the intricacies of how font descriptors work here— there is, after all, a reason why the PDF specification document is over 1,000 pages long—but even by just glancing at it you should be able to make out the basics: each of these two objects describe a TrueType font, the subset of its symbols that is available in the document, the width of each character and a pointer to the font data. Even though embedding a font inside a PDF file is very useful, you can also take advantage of a number of built-in fonts that every PDF viewer must be able to display without their glyphs being encoded in a file. Let’s now go back to our /Pages dictionary—as you can see, it’s easy to get lost inside the maze of objects that constitute a PDF file (and, trust me, we could have gone much deeper into the file before reaching the bottom!). The last two pairs that we have to look at are /Kids and /Count. The former is an
29
FEATURE
A Trip Down PDF Lane
array that contains a pointer to the page objects that depend on the /Pages dictionary, while the latter provides the number of elements in the array itself. Now, here’s the kink: each “page” to which the /Kids array points could actually turn out to be another /Pages dictionary—and that, in turn, could point to a mix of page objects and other pages dictionaries, and so on. This particular arrangement occurs in order to allow PDF writing applications to manage the internal structure of very large documents comfortably, but can make reading the page structure of a PDF file challenging—to say the least. Moving on, let’s take a look at one of the children of our dictionary: 6 0 obj > endobj
2 0 obj 5667 endobj
We’ve finally reached the bottom of the chain— this stream contains the actual commands that will cause a PDF viewer to render the page. Note how the /Length pair uses an indirect reference in the stream dictionary. The /Filter pair is used to indicate that the stream data is encoded using a particular algorithm—the /FlateDecode value in this case tells us that the stream has been compressed using the zlib
PHP Architect
●
www.phparch.com
How Do You Find the Catalog?
This is all great, but how does one find the Catalog—or any other object in a PDF file, for that matter? As I mentioned at the beginning of the article, a PDF file contains the necessary facilities for randomly accessing every object in the stream. These are called the crossreference table and the trailer. The most current trailer is always found at the end of the file. I say “the most current” here because, if a document is altered after its creation, any new objects, together with a new cross-reference table and trailer are added at the end of the file. Now, if you look at the last three lines in our editorial.pdf file, you’ll see the following:
objects are the lifeblood of a PDF document— and the bane of your existence as a developer of applications that manipulate PDF files.”
1 0 obj > stream [stream data] endstream endobj
●
q 0 -0.1 612.1 792.1 re W* n q 0 0 0 rg BT 90.1 709.3 Td /F1 12 Tf Tj ET
“Indirect
As you can see, in this case we landed directly on a /Page dictionary, which specifies an actual page in the document. Notice that the /MediaBox array is present here again, while the /Resources dictionary isn’t. They are both inheritable objects—that is, if they are not available, they are inherited from the parent object (which also explains why there is a /Parent reference: so that you can go back in the hierarchy if needed). This leaves us with the /Contents objects, which points to the indirect object (1, 0):
April 2004
library. If you uncompress the stream, you’ll find that it contains a series of seemingly unreadable commands—here’s an excerpt:
startxref 48475 %%EOF
The last line (%%EOF) is the end-of-file marker. Note that the file does not end with it—there is a newline character right after. The first line indicates that the next line will provide us with the byte offset of the actual cross-reference table inside the file—in this case, the table starts at offset 48,475. If we move there, we’ll find the text shown in Figure 1. This looks rather abstract, but is really easy to interpret. The first line (xref) tells us that the cross-reference table is about to begin. The next one (0 22) tells us that this portion of the cross-reference table contains information about 22 objects, starting from object 0. The subsequent twenty-two lines contain the position of each object in the stream, the generation number and a marker that indicates
30
FEATURE
A Trip Down PDF Lane
whether the object is in use (n) or not (f). For example, the second line contains the following information: 0000000017 00000 n
This means that object (1, 0) starts at position 17 and is in use. Note that the offset and generation number are always 10 and 5 digits long, and should be padded with zeros as necessary. Together with a Windows-style newline (\r\n), every line in the crossreference table contains exactly 20 bytes—and makes for very easy parsing. At the end of the cross-reference table, you can find one of two things—the trailer or another crossreference table. How is this possible? Well, consider the situation in which you are making an update to a file, and you only changed, say, objects (3,0) and (15, 0). If you remember from earlier in the article, as part of your update exercise you’ll also have to add a new cross reference table. Now, since the entries in a cross-reference table have to be consecutive, you are forced to create a new one with all the objects between 2 and 15—and that also means that you have to know where all the other objects are, right? Well, not exactly. In fact, you could rewrite the cross-reference table as follows (feel free to ignore the actual entries—they are here for illustration purposes only): 2 1 0000000010 00001 n 15 1 0000001234 00003 n
tional items. For example, if the file has been altered, a /Prev pair will provide you with the address of the previous cross-reference table. And, if our file were encrypted, an /Encrypt pair would point us in the direction of an indirect object with all the information needed to authenticate the user and decrypt the document’s contents. Next Time: PDF Parsing Techniques
Although we’ll discuss the actual techniques that make it easy to parse and modify a PDF file from PHP in next month’s issue (even the publisher of a magazine can take up only so much space), here’s a quick overview of how the algorithm could work: • Start by finding the end-of-file marker • Get the address of the cross-reference table • Read the cross-reference table • Look for the trailer marker. If it is not found, continue reading the cross-reference table (go to step 3) • Read the trailer • If a previous cross-reference table is present, move the file pointer to it and go back to step 3 From this point on, accessing every object is a simple matter—relatively speaking, since there are plenty of other gotchas to keep in mind. %%EOF
The reader will parse through the first part of the cross-reference table and then, because it won’t find a marker for the trailer, it will identify the second part and parse that as well. This way, only two entries are created—thus saving a whole lot of disk space and computing time that would otherwise be required to recreate all the entries. The trailer itself is nothing more than a dictionary prefixed by the keyword trailer:
Well, this is the end of the line for this article. I hope that this quick walk-through of how a PDF file is structured will help you better appreciate that, as complicated as the format is, it’s not quite that difficult to understand it. In the next issue, we’ll delve a bit deeper into how PDF files can be manipulated directly through PHP— not using one of the many available libraries to create one, but the language itself to interpret a file’s contents.
trailer >
The /Size pair indicates that, in total, the file contains twenty-two active objects, while the /Info pair points to an object that provides some information about the file. The /Root pair, finally, points to our catalog, so that we can now finally find it without having to parse the entire file! Now, the trailer dictionary can contain a few addiApril 2004
●
PHP Architect
●
www.phparch.com
About the Author
?>
Marco is the Publisher of (and a frequent contributor to) php|architect. When not posting bogus bugs to the PHP website just for the heck of it, he can be found trying to hack his computer into submission. You can write to him at
[email protected].
To Discuss this article: http://forums.phparch.com/138
31
Lumenation Report Builder
P R O D U C T
R E V I E W
www.lumensoftware.com
by Eddie Peloke
I
n September, we brought you a review of Lumenation and Light Bulb. Lumenation, if you remember, is an enterprise framework written in PHP which provides through the Light Bulb SDK an application suite consisting of: • • • • • • •
User Management User Application access management User Data Access Management Application GUI Interface Database Access Management User Report Generator EzHelp Builder
This month, I had the pleasure of using an application within the SDK—the Report Builder. If this application is any indication of the rest of the Lumen development suite, they have something cool. As PHP developers, it is not necessarily difficult to write queries and code to present data to your clients. The thought, however, of writing a PHP-driven report building application is impressive. Regardless of your application, the ability to present your clients with data in a logical and informative way is a welcome and often required feature. Whether you application requires a monthly expenditure report, purchase report, or client medical chart, reports are an integral part of most systems. If you have ever had the opportunity to use a commercial reporting system, such as Actuate or Crystal Reports, you know there is sometimes a steep learning curve involved. In fact,
April 2004
●
PHP Architect
●
www.phparch.com
many companies, mine included, have developers whose main task is report generation. Many of the large commercial reporting tools will work with many different applications, but often come with a high price and overhead. Well, say hello to the Lumenation Report Builder! Report Builder strives to allow developers, regardless of PHP or SQL experience, to create professional-looking reports quickly and easily. If data reports are a must have for your application, read on. Requirements The Lumenation middleware requires the PostgreSQL database engine and can be operated on Linux, Windows, Unix, WebShpere, Novell, and more. If you are merely trying to connect to Lumenation, all that is required is a browser. Mozilla 1.4 is the recommended browser but you can also use Netscape 7.1 or FireBird/FireFox (based on Mozilla 1.4 or higher). First Impressions I had the opportunity to meet with a member of the Lumen team for several sessions to review Report Builder. My first impression of the Lumenation middleware was ‘WOW’. It is truly an impressive looking application. You quickly forget you are accessing a webpage as it wonderfully simulates a desktop application. As with many desktop operating systems, the applications available are easily found from the main menu providing a comfortable feel and experience. Speed of the system was also impressive. I tested the system using a
32
PRODUCT REVIEW connection from a T-1 line as well as through my home cable modem and had very quick page loads and processing speeds (see Figure 1). Let’s Code Now that we have all the formalities and first impressions out of the way, let’s actually get our hands dirty and create some reports. Lumenation’s report builder contains a few specific modules to help you get your report written quickly and easily. The Data Dictionary (DD) will more thank likely be your first stop. It allows you to set up your data structures for use with the report builder as well as some of Lumenations’s other development tools. Once nice feature you will discover in the DD is the ability to use various database engines and structures. The DD can automatically learn your company’s database structure, allowing you to choose all, or the selected tables you wish to include in the data dictionary (see Figure 2). Once you data structures are defined and in place, you will want to move to the Query Builder (QB). The QB allows you to, as the name implies, create queries, which will be used by your reports. One of the benefits of the QB is all of the data structures have already been defined, so there is little worry of pulling the wrong
Lumenation Report Builder
data. The QB allows for the developer to quickly create a query even with very little SQL experience. The QB allows the developer to add conditions, select which tables to add, sort, add functions, etc. Don’t worry if you are a more hands on developer, the QB does have a manual mode which presents you with a text area where you can simply type your query. If data security is an issue, Lumenation easily allows for the separation of work. Your dba can set up all the data structures in the data dictionary handing off the query building to the developer. This will ensure the developer only has access to the data necessary for the reports (see Figure 3). Now that you have all the groundwork complete, it is time to actually put together the report. The report builder contains three main areas, Data Model, Layout Module, and Generate Report. The Data Model is where you will pull in all your tables and queries from the Data Dictionary and Query builder. You also have the option to add another level of sorting, grouping, and so on. The Layout Module, as the name implies, is where most of the nitty-gritty of the form is accomplished. From headers, fonts, bullets, colors—this is where the report gets the ‘look and feel’ you desire. If you are experienced with reporting applications such as
Figure 1
April 2004
●
PHP Architect
●
www.phparch.com
33
PRODUCT REVIEW
Lumenation Report Builder
Figure 2
Figure 3
April 2004
●
PHP Architect
●
www.phparch.com
34
PRODUCT REVIEW Crystal Reports or Actuate, you may find the Layout Module missing some of the functionality to which you are accustomed. It does have most of the basic and some of the advanced functionality of the other commercial products but lacks the ability to write custom functions. This is understandable, as many of the other reporting applications are written in Visual Basic or other languages, which makes adding custom VB-type functions easier. Also, the ability to add custom PHP functions may not be a desirable addition to a web product, as security would be a high issue. Once your report is complete, all that’s left is to hit the compile or “Generate Report” button, which presents you with your finished report (see Figures 4-7). So, your report is complete. Now what? I must admit, I was skeptical and thought I would have to send people from my site to Lumen’s site just to view the report, but that is not the case. The Report Builder gives you a few nice options on how to save your report—HTML, PHP, and PHP Standalone are just a few of them. The HTML output, in my opinion isn’t really helpful. I am sure there are scenarios in which it could be useful, but it merely generates an HTML page with the static
Lumenation Report Builder
report date. The PHP and PHP Standalone are the options most developers will use. The PHP option generates all the PHP files necessary to run the report dynamically through Lumenation. This is cool but, in my opinion, the PHP Standalone is the best. It creates all the necessary PHP and library files so I can download the report and stick it on my own server. Even if I knew very little PHP, I could use report builder to create my report and allow it to do all the PHP coding for me. This isn’t entirely foolproof and may require you to dig into the database file a bit, changing passwords and other connection information, but that is a small price to pay for a nice clean report. What I Liked As I have stated several times already, this system is cool. After my first introduction to the system by the Lumen team, I quickly showed the application to another developer at work, proud that is was written with PHP. Building the reports also is also quite easy. I had the pleasure of a tutorial with a Lumen developer, but this should not be a taxing task even without one. The ability to download the reports to run on your
Figure 4
April 2004
●
PHP Architect
●
www.phparch.com
35
PRODUCT REVIEW
Lumenation Report Builder
Figure 5
Figure 6
April 2004
●
PHP Architect
●
www.phparch.com
36
PRODUCT REVIEW servers is also a very nice touch. One huge feature of the report builder is the ability to user several databases—not only can I use two different MySQL databases in a report, but the same report can use an Oracle db, PostgreSQL, MySQL, or anything else supported by PHP.
What I Didn’t Like There really isn’t much that I don’t like about the system. Being a Windows guy, I do have the same complaint mentioned in our September review—the product doesn’t work as well with Internet Explorer. While the system as a whole supports Microsoft browsers, Lumen suggests Mozilla for a better ‘user experience’. This isn’t a big gripe, as Mozilla works fine and is easily used on Windows. I understand the strong support for Mozilla, as clients such as schools and other organizations can utilize the power of Linux and free software, thus keeping their overall costs low. Another, albeit small, complaint I have is with the help system. There is a help menu for most of the modules, but, for the most
Lumenation Report Builder
part, it requires a flash plug-in. It would also be a nice touch to add ‘tool tips’ or the inclusion of context sensitive help. Conclusion Overall, I love this product. Just the looks of it are enough to encourage developers who are uncertain of PHP’s potential. There are many areas of this product that are worthy of being mentioned, but are beyond the scope of this small review. From the HIPAA (Health Insurance Portability and Accountability Act) compliance features, to the array of other applications such as Page Builder, Website Design, and so on, this is definitely a well rounded product.
php|a
Figure 7
April 2004
●
PHP Architect
●
www.phparch.com
37
Zend Does it Again
F E A T U R E
by Marco Tabini
W
hile PHP has definitely become the platform of choice for developers who work under Unixlike operating systems, it has failed to do so under Windows. The reason for this is probably twofold. On one hand, people who go to the extent of purchasing a Windows license may find it more convenient to stick with an all-Microsoft solution. On the other, Windows and most Unix-like environments differ on how they manage processes and threads, and most of the libraries on which PHP is based work with the Unix model, thus causing reentrancy problems on Windows systems, which, in turn, make PHP under ISAPI unstable. Until now, this meant that, in order to have a properly-working PHP system under Windows, you either had to use the Apache web server, which is not considered stable (at least for production purposes) under Microsoft platforms, or use a CGI solution under Internet Information Services (IIS), which caused the overall system to perform much worse than a comparable native solution. While from a technical perspective the lack of a proper Windows solution for PHP developers has been rather downplayed by the community, from a businessstrategy point of view the Windows market is not something that is worth ignoring. For one thing, Windows adoption is significant, particularly in the corporate world where the existence of a large business entity behind a product is seen as an advantage. In addition, companies that have adopted Windows subscribe to a business model that puts a monetary value on software as well as know-how, whereas open-source users are April 2004
●
PHP Architect
●
www.phparch.com
typically less receptive to the concept of commercial software. Making PHP a wholly viable choice for the Windows market, therefore, would open up two interesting possibilities. First, Windows adopters can take advantage of the lower cost of development and ownership that comes with a PHP solution, due in part to the powerful language and in part to the relatively lower cost of PHP developers. Second, producers of PHP software gain access to a market whose players are used to having to pay a price for the software they acquire. It is, therefore, not surprising—although pleasant— that, in late March, Zend Technologies announced the release of their new WinEnabler™ product. WinEnabler works by creating a level of insulation between IIS and the individual PHP processes, thereby protecting the web server from crashes in the PHP executable (which, although very rare, on a multi-threaded system like Windows can bring down the entire server) and, at the
REQUIREMENTS PHP: 4.x OS: Any Applications: N/A
38
FEATURE
Zend Does it Again: WebEnabler
same time, dramatically improving the performance of the interpreter by creating a cached pool of several PHP instances. From a technical perspective, there isn’t much to say about the WinEnabler—it “just works”. The concept behind it is ingenious and yet simple: the WinEnabler Server Plugin interfaces directly with IIS on one end and with a specially-compiled version of PHP (with thread safety enabled) on the other. The plugin creates a pool of persistent PHP processes that are never destroyed, so that the performance hit that, under normal CGI conditions, comes from having to load up and initialize the PHP interpreter only takes place once when the server is started. At the same time, each PHP interpreter is instantiated as a separate process, so that, in the event of a crash, the overall web server’s stability is not affected. Naturally, PHP remains its own good old self—you can add any extensions, accelerators and cache addons to it just as you would under normal circumstances. From a business and strategic point of view, however, the introduction of the WinEnabler has far-reaching implications and, therefore, the php|a team unleashed its investigative hounds all the way to our nearest phone to ask Rinat Gersch, Online Marketing Manager at Zend, a few pointed questions on the ins and outs of this fascinating new product. Q: Can you explain the business motivation behind this new product? WinEnabler was specifically designed in response to demand from enterprises, software vendors and solution providers. These wanted to expand their offerings from Linux to the Windows arena, thereby expanding their potential client pool. Additionally, a viable solution was requested to enable PHP to run on Windows with stability and performance that is up to par with PHP on Linux/Unix. WinEnabler supports PHP’s promise of true multi-platform compatibility by positioning PHP as a viable enterprise-grade technology for every environment: open source or proprietary.
April 2004
●
PHP Architect
●
www.phparch.com
Q: What licensing model have you adopted for it? A simple and straightforward one: one license per machine, regardless of the number of CPUs installed. Q: What is your target market? Are you after the SME industry, or the enterprise sector? The target is the PHP on Windows market, whether for small, medium or large enterprises. WinEnabler licensing starts at $195—well within everyone’s budget! WinEnabler is ideal for those who want to enter this market, as well as those already providing PHP solutions on Windows, but desperately require a PHP-onWindows solution they can rely on and market to their customers with assurance. Q: Do you really think that PHP under Windows is a viable platform? Why should developers prefer it to .Net? Absolutely! There was never a doubt that PHP on Windows has commercial value. The problem was that until Zend’s WinEnabler, there simply was no solution that brought PHP on Windows to be on par with Linux and UNIX in terms of stability and performance. Incidentally, Zend ran a survey during the WinEnabler’s beta cycle, asking users what they thought were the advantages of PHP over other scripting languages. Their answers included: • ease of use • quick to learn • flexibility • rapid development of code • continuous development and evolution of the language • wide and rich features • Open source preference • Linux-compatibility • lower hosting costs • cross platform support • C-like syntax • DB support and many more.
39
FEATURE
Zend Does it Again: WebEnabler
Q: What would be the motivation, from a strategic perspective, for a decision-maker to choose Windows + PHP to an all-native Windows implementation? Choosing PHP as a strategic direction helps the customer to preserve the investment in his application. PHP is an open and portable platform that is available for a large selection of web servers, hardware platforms and operating systems. It provides the decision maker with the ability to choose the Hardware/Web Server/Operating System combination that provides the best Cost/Performance value for his/her application without having to worry about expensive software porting costs when the infrastructure matures up. Q: How big do you estimate the Windows market for PHP is? Do you think that the Enabler will increase it? According to Netcraft, 7% of the PHP sites run on Windows. This figure has doubled over the last year, making PHP the most popular non-Microsoft scripting language used on Windows. We believe that the WinEnabler will support the natural growth of this trend. As PHP gains in popularity, more Windows users will recognize its ease, simplicity and robustness. WinEnabler will boost this conversion level because it allows PHP to run on Windows as reliably as on Linux. Q: Do you plan to use the Enabler in conjunction with .Net in the future? PHP in general—and especially PHP 5—provides the user with full capabilities to integrate with Microsoft
Have you had your PHP today?
.Net and access .Net classes and objects in a very convenient and easy way. At the architecture level, PHP does not use the .Net CLR, and the WinEnabler is no different in this case. We’ve evaluated the possibility of implementing a version of PHP on top of the .Net CLR, but that proved to be impractical because of the dynamic nature of PHP, which is incompatible with the architecture of CLR. Q: Do you expect the introduction of the Enabler to improve the overall PHP market? In which way? The introduction of WinEnabler supports and enhances PHP’s promise of multi-platform compatibility. By improving PHP’s usability on Windows, we expect to see more Windows-based enterprises and SMEs choosing PHP for their web applications. WinEnabler also supports solutions providers and ISVs by providing them with the business opportunity to expand their market. The existence of high-quality PHP applications for the Windows market will be a second driving force in the adoption of PHP in this environment. Oops, they did again! With the release of the WinEnabler, Zend once again demonstrates that it has a firm understanding of the business side of the PHP market. The WinEnabler should allow existing PHP developers access to a whole new market, and a whole new set of opportunities. To Discuss this article: http://forums.phparch.com/139
http://www.phparch.com
NEW COMBO NOW AVAILABLE: PDF + PRINT The Magazine For PHP Professionals
April 2004
●
PHP Architect
●
www.phparch.com
40
Smarty and Internationalization
F E A T U R E
by John Coggeshall
I
nternationalization is one of those topics that no one really wants to deal with. Implementing a single web site in four different languages can be a very daunting task. Considering the “Works for me” mentality is in full effect (after all, you can read your own site), there is little discussion of it in the Web development arena. PHP supports a number of different tools to assist in the development of multi-lingual sites, and, in this article, I’ll borrow some ideas from many of them to develop a version of the Smarty templating engine which supports a fairly robust internationalization architecture.
Designing Internationalization for Smarty
The concept of developing an internationalization layer for Smarty was born out of the website being developed for the PHP Community project (www.phpcommunity.org/). As a project which aims to serve the PHP community, it was decided early on that content would be presented in a number of different languages on the site itself. Considering the web site was using Smarty for its presentation layer, the popular templating system seemed like a very logical choice for the implementation of an internationalization system. In this article, I’ll explain the way I have implemented internationalization in Smarty (which I have renamed IntSmarty for the occasion) and explain its use within templates to generate multi-lingual sites quickly and easily.
April 2004
●
PHP Architect
●
www.phparch.com
IntSmarty Template functionality
One of the primary goals of the IntSmarty library was maintainability. Thus, it was very important that IntSmarty be implemented in such a way that it required no modification to the underlying Smarty library itself. Instead, IntSmarty is designed to be an extension of a previously existing Smarty class coupled with a series of custom plugins which provide the internationalization functionality that was desired. Although there are many different ways to implement internationalization, for the IntSmarty library it was decided that a gettext-like approach would be used. In this approach, all strings in a template which are to be translated into a different language are extracted from the template, assigned a unique ID and stored in a table for the language they were written in. This table (if available, the one for the desired language) is then referenced any time the given string is used within the template. String extraction and translation
In the IntSmarty library, internationalization strings are wrapped within the {l} block function within a
REQUIREMENTS PHP: 5 OS: Linux Applications: Smarty Code Directory: smarty
41
FEATURE
Smarty and Internationalization
template as shown below: This is not going to be translated. {l}This text will be translated{/l}
When this template is processed by IntSmarty, any text within {l} tags is automatically extracted, given a unique ID, and stored into a file in a language table. In each future request for that template, rather than the string provided in the template itself, the value from the language table is used. Since all strings which are to be internationalized exist within a single file on the server per language, this file can be passed along to translators for localization to another language quickly and easily. To provide our example in Listing 1 in three different languages (German, Spanish, and French) three different language table files must be created: en-us.php (the default english), de.php, fr.php and es.php, as shown in Listings 1, 2, 3 and 4. IntSmarty determines which language table it will use for the template based on the primary language accepted by the browser visiting the web site, which is extracted from the $_SERVER superglobal, as we’ll see later on.
ties to the class. You can see them in Table 1. Of all of these new properties, only two $ d ( efault_lang and $lang_path) should be hardcoded and configured before using the IntSmarty class. The remainder will be used throughout the library itself. Determining the language The first task of the IntSmarty class is to provide a means to determine exactly what language is that the browser would like the page to be returned as. To do this, we must rely on the browser providing that information to the web server and retrieve it using the HTTP_LANG_ACCEPT PHP element of the $_SERVER global superarray. This variable is provided to PHP by the web Table 1 $default_lang
The default language code which IntSmarty willuse if none was provided.
$lang_path
The path where IntSmarty will store its language tables. Must be writable by the web server.
$a_languages
An indexed array created at run time which lists in order of preference the languages accepted by the browser.
$cur_language
A string representing the current language being used by the IntSmarty class.
$translation
An array representing the Translation tables.
$translation_size
The size of the translation table.
File translation
Although useful for strings, the {l} block function is not very suitable for translation if files such as images are used in menus. For this purpose, I devised a second function, which I called i18nfile. This function takes a maximum of two parameters. The first of these parameters is file, which returns an internationalized version of the file to load by prefixing the last portion of the path with the current language code. Optionally, a lang parameter can also be provided which should be set to the language you’d like the file to be represented as. Thus, for a given path /gfx/mymenuitem.jpg, the i18nfile function would translate that file name to /gfx/en/mymenuitem.jpg automatically (Listing 5). How IntSmarty works internally
Now that you have an idea of how the IntSmarty class works from the side of a template designer, lets take a look at how it works internally. As stated before, the IntSmarty class is an extension of the standard Smarty class and, as such, it has access to all of the methods and properties of it. However, the IntSmarty class does add a number of properties and methods to the base Smarty class (as well as some new plugins) to which we’ll turn our attention now. The first addition to the standard Smarty is the introduction of Internationalization-related properApril 2004
●
PHP Architect
●
www.phparch.com
Listing 1 1
Listing 2 1
42
FEATURE
Smarty and Internationalization
server in the following fashion: “en-us,en;q=0.5”
where the primary language is listed first followed by other accepted languages. To extract this information into our PHP code, I created a helper function and called it _determineLangs(). You can see it in Listing 6. As you can see, the _determineLangs() function is basically a wrapper for a call to preg_match_all(). This function uses the provided regular expression to return an array that contains each language accepted by the browser. If, for whatever reason,
Listing 3 1 2 3 4 5 6 7 8 9
preg_match_all() fails to extract the desired information, an array with the value from the $default_lang property is instead returned and stored into the $a_languages property for future reference. Language String Tables
Once the language the browser would like to use is determined, the IntSmarty class must next locate and load the most suitable language for the browser. This is done by cycling through the $a_languages property and attempting to load a language table for each language specified. Because the $a_languages property will always have as its last item the value of the $default_lang property, if no suitable language
Listing 7
Listing 4 1
1
Listing 5 1 2 i18nfile 3 4 5 6
Listing 6 1
April 2004
●
function _determineLangs() { @preg_match_all(“/([a-z\-]*)?[,;]+/i”, $_SERVER[‘HTTP_ACCEPT_LANGUAGE’], $matches); $matches[1][] = $this->default_lang; return $matches[1]; }
PHP Architect
●
www.phparch.com
Listing 8 1 ’; 9 $fr = fopen($filename, “w”); 10 11 if(!$fr) { 12 return false; 13 } 14 15 fputs($fr, $code); 16 fclose($fr); 17 18 } 19 20 return true; 21 } 22 ?>
43
FEATURE
Smarty and Internationalization
was found the default will be used. The process of actually attempting to load a language table is handled through the loadLanguageTable() method, whose source is shown in Listing 7. This method accepts a single parameter (the language to load the table for), and returns a boolean True or False indicating if the table was loaded successfully. This method is fairly straightforward—it constructs the filename where the language file is expected to be found and attempts to load it using the [require_once()] function. As shown earlier in the article language, tables are stored as PHP scripts which define the $__LANG array, so the process of populating the $translation property is reduced to simply checking for and assigning this array. As I just mentioned, a boolean value is returned indicating if the language table for the specified language was loaded successfully. Although used internally by the IntSmarty class to load the default language table, this function can also be used from within your PHP scripts to force a particular language to be displayed by passing the desired language code. Saving a Language Table The sister method to loadLanguageTable() is saveLanguageTable(), which, as its name implies,
saves the language table back to the file system and must be called before script termination. As you can see in Listing 8, saveLanguageTable() accepts no parameters. Assuming the translation table has changed in size in the course of the script’s execution (which should only happen if a string with no known translation is found), a new translation table is created and written to the appropriate file. For the sake of simplicity, these tables are stored as valid PHP scripts using the var_export() function to generate an array. When the method is executed, it returns a boolean value indicating a success or failure status. Displaying the correct language strings
Now that I have introduced to you the rather boring methods which support the IntSmarty class, let’s take a look at where this class gets the majority of its power by discussing the implementation of the template block function {l}. Although Smarty itself provides the necessary functionality to create block functions using the Plugin system, they are unsuitable for the implementation of this function. Consider the following code fragment: This isn’t translated {l}This is translated, the
April 2004
●
PHP Architect
●
www.phparch.com
value of foo is {$foo}{/l}
Using standard block level functions, the value of the {$foo} variable will be inserted prior to the {l} block level function being executed. Thus, instead of getting the following string in your translation table: This is translated, the value of foo is {$foo}
Assuming the value of the {$foo} variable was 10 the following would be stored: This is translated, the value of foo is 10
In order to properly implement the {l} blocklevel function, its execution must take place before Smarty attempts to compile any other tags within the template. For this purpose, I designed a “prefilter” plugin. When an IntSmarty object is initialized, a prefilter called smarty_lang_prefilter() is registered internally using the register_prefilter() method. While I was developing the IntSmarty class, I quickly found that even registering a pre-filter to handle the {l} function was not enough. To understand why, let’s take a look at the function signature of the smarty_lang_prefilter() function: function smarty_lang_prefilter($content, &$smarty)
As one might expect, two parameters are passed into this function. The first is the contents of the template itself, while the second is a reference to the Smarty class. Although this makes sense, the problem is that the $smarty variable passed into the prefilters is not the same instance of the Smarty class that you use within your PHP scripts! Thus, it will contain no information regarding the translation tables that we previously discussed. The reason for this can be found in the way Smarty was designed— specifically, in how Smarty compiles templates and when pre-filter plugins are called internally by Smarty. Smarty doesn’t mean its Smarty!
Ignoring configuration files, Smarty itself functions on two different classes. The first is the Smarty class, which provides all of the methods and configuration values actually used within your PHP scripts. However, a second Smarty class, Smarty_Compiler, is used internally by the Smarty class to compile your templates into PHP scripts. Because of the way the engine works, when compiling a script a new
44
FEATURE
Smarty and Internationalization
instance of Smarty_Compiler is created and assigned relevant values from the real Smarty class (namely configuration options). Because pre-filters are a compiler plugin (not called from the context of the Smarty class itself), when the smarty_lang_pre filter() function is called it has only access to the instance of the Smarty_Compiler class which was created. In order for the IntSmarty class to work, it must somehow provide an instance of itself to the smarty_lang_prefilter() function. After a little research into the Smarty library itself, I determined that the function ultimately responsible for the creation of the Smarty_Compiler class that compiles our templates was the _compile_template() function. Within this function, the Smarty_Compiler() class is created and all of the configuration values stored within the originating instance of the Smarty class are passed to it. Because there is no reference to the Smarty compiler outside the scope of this method, in order for our pre-filter to work we must implement our own _compile_template() method within IntSmarty. This method is a direct copy and paste from the internal Smarty version with one minor difference—the addition of the following line of code:
$smarty_compiler->parent_inst = &$this;
This line creates a single additional property within the Smarty_Compiler class, the $parent_inst variable. This variable, in turn, will always reference the IntSmarty instance that created and called the
Listing 10 1
Listing 9 1
April 2004
●
PHP Architect
●
www.phparch.com
45
FEATURE
Smarty and Internationalization
Smarty_Compiler class, thus providing a means for our pre-filter to access its methods and properties. The smarty_lang_prefilter() function
With the problem of accessing the appropiate instance of Smarty solved, let’s now take a look at the smarty_lang_prefilter() function. This function is responsible for the extraction and replacement of all strings within the template that use the {l} template function. Listing 9 provides the source code. As I mentioned previously, because the $smarty parameter passed to this function is not the instance we need to access, the first step is to assign the $inst Listing 11 1
variable to the parent instance we need. Looking at this code, it is clear that it relies heavily on regular expressions to accomplish its goals. Since regular expressions are beyond the scope of this article (you can read up on them through the excellent series that George Schlossnagle is writing for php|a), I will limit myself to providing only a basic description of each regex. The goal of this prefilter is to take any string wrapped in the {l} template function and exact its literal contents for translation. The first step in this process is the creation of two variables, $ldq and $rdq, using the preg_quote() function. These variables will represent the “quoted” versions of our function delimiters { and }, which will be used in future regular expressions. The first real step of the pre-filter function is to extract all of the strings which are wrapped within {l} template functions by using the preg_match_all() PHP function, which matches a regular expression against a particular string for as many times as possible and returns each result in the $match array. The template is then processed, in the following fashion: • A unique hash of the string is created using md5() • The hash value is searched for in the $translation table to see if a valid translation for it already exists • If a valid translation exists, the preg_replace() function is used to replace
Listing 13
1 2 3 {l}My international web site — {$subtitle}{/l} 4 5 6 {l}Welcome to my first IntSmarty web site!{/l} 7 8 Here is a picture of a flag representing your language: 9 10 11 12
Listing 14 1
April 2004
●
PHP Architect
●
www.phparch.com
46
FEATURE
Smarty and Internationalization
all instances of that particular string with its translation • If no valid translation exists, the string and its hash are stored in the $translation table to be written later • All {l} and {/l} tags are removed from the template, leaving only the translated strings (if translations were found) • The modified template is returned back to the compiler for further processing. The i18nfile function
Earlier in the article I introduced the i18nfile function as a function useful for providing a languagecontext within your templates when files are used. Unlike the {l} template function, which required special consideration, the i18nfile function is a simple Smarty plugin function which is registered in the IntSmarty constructor using the register_function() method. This method registers the i18nfile function and ties it to the smarty_function_i18nfile(), which is shown in Listing 10. In Smarty, function plugins accept two parameters. The first is the $params plugin, an associative array which provides each parameter passed to the function and its value. The second is the instance of the Smarty class that called the plugin. Because our plugin in this case is not a compiler plugin, the instance of Smarty provided is the correct one and, therefore, no magic is required to access important information such as translation tables. As the Smarty manual indicates, this function returns a string representing its output (which is then used in place of the function call in the template).
traditional assign() function to assign the variable. Putting it all together
Now that all the piece of the puzzle are ready, it’s time to place them on the board and try them out. Listing 12 provides the complete IntSmarty class for your reference (can be obtained from this month’s code package at code.phparch.com/22) . Using the internationalization functionality from your own Smarty templates is very simple. For example, consider Listing 13: as you can see, this simple HTML template contains both internationalized text—which, in turn, includes a dynamic variable— as well as internationalized file names. In Listing 14, which shows the source code for the backend to Listing 13, I create an instance of the MyIntSmarty class, assign it a language and show the template. Note how, at the end, I always save the language table. Once the internationalization features are in place, you should have no problem making your sites multilingual—one thing you may want to consider is building a simple web-based localization system that would allow your content to be translated by personnel that has little or no understanding of PHP code, thus keeping with the Smarty philosophy of separating the presentation layer from the application’s logic.
The assignLang method
The final method that I will discuss is useful to provide a means of using the IntSmarty translation services from within the backend PHP scripts that power the template. This method is called assignLang() and is essentially identical to the assign() method, which you should already be familiar with if you use Smarty in your day-to-day work. However, unlike assign(), the assignLang() method will automatically translate the passed string just as if it had existed within {l} tags within the template itself. The code associated with this function is found in Listing 9. The assignLang() method uses a technique to replace the string with the translation (or at least store it within the translation table for translation at a later time) similar to the one employed by the prefilter method that handles the {l} template function. As you can see, once the translation process has been completed, the method cascades down to the April 2004
●
PHP Architect
●
www.phparch.com
About the Author
?>
John Coggeshall is a PHP consultant and author who started losing sleep over PHP around five years ago. Lately you'll find him losing sleep meeting deadlines for books or online columns on a wide range of PHP topics. You can find his work online at O'Reilly Networks onlamp.com and Zend Technologies, or at his website http://www.coggeshall.org/. John has also contributed to Apress' Professional PHP4 and is currently in the progress of writing the PHP Developer's Handbook published by Sams Publishing.
To Discuss this article: http://forums.phparch.com/140
47
Creating Web Services with PHP and SOAP
F E A T U R E
by Alessandro Sfondrini We already know how to use SOAP and PHP together to write a client able to communicate with a Web Service. In this article we will understand how to write our own Web Service, both on our own and aided by a useful tool such as NuSOAP—the library we used last time to write our Amazon.com Web Services client.
Introduction You can think of a Web Service as an application designed to make itself available over a network (typically—but not always—the Internet) and capable of sending and receiving data encoded in XML documents. Since the main goal of Web Services is to allow software based on different platforms or compiled in different languages to communicate, a common language has to be used for this purpose, and XML (eXtensible Markup Language) is the best candidate out there, given the widespread implementation of libraries for accessing documents written using it. Actually, there is a subset of XML, called SOAP (Simple Object Access Protocol), that has been expressly created with the purpose of sharing data and invoking methods remotely, regardless of what platform or server they run on, as long as they are accessible through a network protocol like HTTP. In the two previous issues of php|a, I have written a couple of articles that showed how to perform a Remote Procedure Call (RPC) and fetch its result using PHP and SOAP. We built two PHP applications that make use of SOAP to access the Web Services provided by Google and Amazon.com, and found out that there are some tools that can help our coding, such as NuSOAP, a PHP library useful for writing both client and server applications. This time, we’ll write our own Web Service, that is the server application which receives the calls and sends back the responses. In the first part of this article, we will analyze the main, common structure of Web Services, looking closApril 2004
●
PHP Architect
●
www.phparch.com
er at their peculiarities: this will make us able to design our own Web Service—which won’t be a very complex application but will, nonetheless, contain every typical Web Service feature. Next, we will write and test our application, taking a particular interest in error management and security issues. How Do Web Services Work? Our purpose is to allow an application designed to perform SELECT and INSERT queries against a database table to be executed by an external party. This means that it has to be able to receive a request from other systems and return a response that those systems can understand—regardless of what platform and architecture they are running on. We will use the HTTP protocol and only process requests sent using the POST method—this means that, for example, when an user tries to open the page using a browser, an error will be displayed. We may also decide to process only requests received from a certain IP (or from a set of IPs), but, in our example case, we won’t apply any filter of this kind. Next, we have to fetch the request and parse it; we will also validate whether the parameters the client has sent are the ones required in order for the remote pro-
REQUIREMENTS PHP: 4.01 OS: Any Applications: NuSOAP 0.6.4, MySQL Code Directory: soap
48
FEATURE
Creating Web Services with PHP and SOAP
cedure call to function properly (for example, we will check username and password). If everything works out, we will perform the appropriate SELECT or INSERT query and return the results—the records matching the query in the first case, a success message in the second. Thus, we can now write the general operation outline of a web service: • Fetch the SOAP-encoded request • Parse the request • Check the parameters passed as part of the request • Perform the operation(s) required • Send a SOAP-encoded response Each of these operations requires an accurate error management strategy: the request may not be correctly encoded or come from a non-acceptable client, the parsing process may fail, the parameter count may be
wrong, and, of course, the operation itself may encounter a problem. We must be prepared to stop the execution of the script and send an error message explaining what happened to the client in any of these situations. Designing the Application Based on the step-by-step procedure above, we will write our application and split it in two files: • insert.php (Listing 1), designed to perform the INSERT query. We will write this file without using any SOAP library to understand how a SOAP server works; • select.php (Listing 2), which will perform the SELECT query. In this case we will take advantage of NuSOAP, to realize how such libraries can be helpful.
Listing 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
$type$actor $msg END; exit; } }
Continued on page 50... April 2004
●
PHP Architect
●
www.phparch.com
49
FEATURE
Creating Web Services with PHP and SOAP
Remember that, in general, splitting a web service in two or more separated files is a mistake, especially if each file’s purpose is quite similar to the others’. However, for the purposes of this article, we will work in this way so that you can see how the application can be developed with and without making use of NuSOAP. The table which both application will work on is called people; on my server, it resides in the database test and has the following structure: CREATE TABLE people ( id int (10) unsigned NOT NULL auto_increment, name varchar (30) NOT NULL, age tinyint (3) unsigned NOT NULL, city varchar (30) NOT NULL, PRIMARY KEY (id) ) TYPE = MyISAM;
Both files will require login information (contained in the USER and PWD tag parameters). Insert.php will also require NAME, AGE and CITY and, after inserting a record in the people table, will return two values: STATUS, which will be “OK” if the query has been successful, and TIME, which will contain the current UNIX timestamp. Select.php, on the other hand, will require, besides the login information, only ID, an integer that will be used to perform the query. This file will return NAME, AGE and CITY (all taken from the record selected from the people table) and the current timestamp TIME.
Coding Without NuSOAP In Listing 1, you can see a class named SOAP, which we will use to parse the XML data, and to send error messages and responses. The methods of this class ParseStart() , ParseEnd() and (SSOAPparse() , doParsing()) are quite self-explanatory, and are used to interface with PHP’s XML functionality. The main method is SOAPparse(), which requires the XML document $xml—in textual form—to be passed to it. The function works by first creating the XML parser $parser and linking it to the current class by using xml_set_object(); this is always required if you want to use an XML parser inside a class. In the following line we use xml_set_element_handler() to define the functions we want to use for the initial element handler (the opening tag) and for the ending element handler (the close tag) of every XML element in the document. These two functions (PParseStart() and ParseEnd()) can have any name, but their declaration always has to follow this template: function ParseStart($parser, $name, $attribs) function ParseEnd($parser, $name)
Any different declaration will prevent the application from working. In our listing ParseStart() and ParseEnd() are declared as methods of the class SOAP.
Listing 1: Continued from page 49 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
$s = new SOAP; if($_SERVER[“REQUEST_METHOD”] != “POST”) // Checks the method $s -> fault(“Client”, “Bad request method”, “Method POST is required”); $s -> SOAPparse($HTTP_RAW_POST_DATA); // Parses the data if($s->req[“USER”] != “myuser” || $s->req[“PWD”] != “mypwd”) // Checks the user/pass $s -> fault(“Client”, “Login incorrect”, “Bad value of params ‘user’ or ‘pwd’”); if(empty($s->req[“NAME”]) || empty($s->req[“AGE”]) || empty($s->req[“CITY”])) // Checks the params $s -> fault(“Client”, “Bad request”, “‘name’, ‘age’ and ‘city’ can’t be empty”);
$query = “INSERT INTO people (name, age, city) VALUES (‘“.$s->req[“NAME”] .”’, “.$s->req[“AGE”].”, ‘“.$s->req[“CITY”].”’)”; if(($conn = @mysql_connect(“localhost”, “root”, “”)) === FALSE) $s -> fault(“Server”, “MySQL”, mysql_error()); if((@mysql_select_db(“test”, $conn)) === FALSE) $s -> fault(“Server”, “MySQL”, mysql_error()); if((@mysql_query($query, $conn)) === FALSE) $s -> fault(“Server”, “MySQL”, mysql_error()); @mysql_close($conn); $now = time(); header(“HTTP/1.1 200 OK”); header(“Content-Type: text/xml”); echo
April 2004
●
PHP Architect
●
www.phparch.com
51
FEATURE
Creating Web Services with PHP and SOAP
response) and, inside of it, we store STATUS, which is set to “OK” and TIME, which contains the timestamp value in Unix format (obtained from the $time variable). Using NuSOAP In writing the second application, we will take advantage of the NuSOAP PHP library—a group of classes designed to allow developers to manage SOAP web services which will speed up our coding. Since I’ve used it in my two previous articles for php|architect (which appeared in the January and March 2004 issues), I won’t dwell too much on NuSOAP’s generalities here other than to note that it is distributed under the LGPL and you can download it at dietrich.ganx4.com/nusoap/. To include NuSOAP classes, all we have to do is adding a simple require(‘nusoap.php’) statement to our PHP files. Our application is shown in Listing 2. As you can see, after including NuSOAP we create $s, an instance of soap_server; we then use the register() function to inform the server instance of the name of the method we want to allow the client to use. In the previous example, we skipped this step (that is, we didn’t define the name of the method) because our script was designed to perform only one operation—the INSERT query. The NuSOAP library, on the other hand, is designed to support multiple methods and, therefore, we must first register the method name with the server object, and then declare a PHP function with the same name. This function, in turn, contains the actual instructions we want to execute when the remote procedure call is performed by a client. In fact, in this second application the function itself contains almost everything that was Listing 3 1
April 2004
●
PHP Architect
●
www.phparch.com
outside our class in the previous example. From a functional perspective, the application is quite similar to its NuSOAP-free counterpart: first, we validate the login data and the input variables; if anything is wrong, we return a fault (in this case we use NuSOAP’s soap_fault() function, but, as you can see, this method isn’t very different form our own SOAP::fault()). Note that we don’t have to worry about the request method used by the client—NuSOAP will automatically take care of this for us. Next, we build the query and store it in $query. Once that is taken care of, we perform all the database operations (connecting to the server, selecting the DB, executing the query and closing the connection), making sure, as usual, that no errors occurred. This time, we also check whether a record matching the query we executed has been found and, if that is not the case, return a fault. Finally, we store the database results in the $resp array and return it. The only other line of code remaining to complete the script is the one that actually executes the Remote Procedure Call through NuSOAP: by using the $s->service() method, we cause all the POST data to be passed to the library, which will first parse the request and, look for the method that should be executed among the ones registered (ppeople_select in our case). If it finds it, NuSOAP executes the corresponding function and renders its output using the appropriate XML code. As it has been the case in the previous articles, NuSOAP serves the remarkable purpose of hiding essentially all of the XML parsing and generation from the application, allowing our code to focus exclusively on the functionality. On the other hand, as it is often the case with external libraries, it introduces a certain amount of overhead that you must keep in consideration. Testing and Debugging the Applications To test our applications, we have to write two SOAP clients. As we have done in the past, we can use NuSOAP again for this very purpose, as you can see in Listing 3. Creating a SOAP client using NuSOAP (as you may remember from the article about Amazon.com Web Services that appeared in the last issue of php|a) is very easy. Obviously, we first need to include nusoap.php, then create an instance ($$ins) of the soapclient class, passing our web service’s URI as an argument. The associative array $param contains the parameters to be passed to the remote procedure. Finally, we invoke the $ins->call() method, which requires two arguments: • The remote method name: in our example, we used people_insert as a method name, but any string would work, because we did-
52
FEATURE
Creating Web Services with PHP and SOAP
n’t really worry about the method name in insert.php. In fact, we could skip this parameter altogether, because our server script provides only one method. • The associative array containing the parameters. The results of the RPC are stored in $results, which is an associative array—since this is nothing more than a test script, we get away with just outputting its contents by calling print_r(). In the second part of Listing 3, these operations are repeated for the second SOAP server script (sselect.php), using the instance $sel of the soapclient class. We can now test our applications: if a correct call to insert.php is performed, the output will be like this: Array( [status] => OK [time] => 1079265466 )
If an error occurs, for example if the login information is not correct, the response will be a fault code containing these values: Array( [faultcode] => Client [faultactor] => Login incorrect [faultstring] => Bad value of params ‘user’ or ‘pwd’ [detail] => Array( [soapVal] => ) )
This is the SOAP fault message content. As we have s e e n , faultcode can be set either to Client or Server, depending on the error type; faultactor indicates what caused the error, while faultstring contains a description of the error. As you can see, there is a fourth key, detail which may contain further information about the error, but we haven’t used it in our fault message and, therefore, it is empty. Next, we’ll test select.php. The server’s response to a well-formed request will be like this one: Array( [name] => Robert White [age] => 28 [city] => Boston [time] => 1079271398 )
It contains the requested record and the Unix timestamp. Now, a quick test for fault conditions. For example, the following values are the ones contained in a fault document caused by a MySQL error; in particular, April 2004
●
PHP Architect
●
www.phparch.com
this will the response caused by an error on mysql_select_db() (Line 20 of Listing 2): Array( [faultcode] => Server [faultactor] => MySQL [faultstring] => Unknown database ‘xyz’ [detail] => Array( [soapVal] => ) )
This is a “Server” fault, caused by MySQL. In order to obtain this error, I had to change the argument of mysql_select_db() to a non-existent database (database xyz), just so that I could simulate a proper failure. As in the previous fault example, there are no details specified. This is a simple way to make sure that our servers work as we intend them to. If you find the output that print_r() generates in the case of a fault a little too confusing to be helpful, you may want to intercept errors and print out something more comprehensible. For example, we can add these line to the client application (Listing 3) right before the call to print_r(): if($client->fault) die(“FAULT:Code: {$ins->faultcode}” . “Actor: {$ins->faultactor}” . “String: {$ins->faultstring}”);
This will stop the execution of the application and print a fault report. To NuSOAP or not to NuSOAP? In this article, we have seen two different ways of creating a Web Service. Using a library like NuSOAP we can, in my opinion, have loads of benefits: first, we don’t have to write a class to parse the XML or to report a fault: we can simply use the ones included in the library. The real advantage, however, becomes evident when we have to write a complex Web Service: imagine if you had to join the “insert” and the “select” application in the same file. Doing this manually may be tricky (especially if you have to join more than two applications). You would have to write your own parser, identify the structure of the SOAP payload, identifying the name of the root structure (the structure contained in the body of the SOAP message which contains all the parameters) of the request, which, in turn, will contain the method name. Once you have obtained this last datum, you’d have to use an if-else control block (or a switch statements, if many methods are provided by our web service) to recognize the method required. Of course, you can do it. It would take you a long while, and the chances to introduce bugs would be
53
FEATURE
Creating Web Services with PHP and SOAP
many. Using NuSOAP, we simply had to register each method we want to make remotely accessible and declare a function with his name. When the library parses the incoming request, the right function is automatically called (NuSOAP uses eval() for this purpose). Therefore, in my opinion, using NuSOAP or an equivalent library can help you writing Web Services by allowing you to skip the complex and boring part of coding your own SOAP engine and concentrate on what really counts—the functionality. At Last... By now, you should have a pretty good understanding of how you can write a Web Service, although you may still be wondering whether it makes any sense to. If you are, I think that’s a good thing—the purpose of any Web Service is to share data and services on the Net, and that means that you should only really go through the trouble (and overhead) of introducing a Web Service if you have no other alternative. Typically, you want to use SOAP if there are external applications that do not know anything about the internals of your system and to which you want to provide access to your services under tightly controlled conditions, since a Web Service allows you to perform all sorts of valida-
tions and authentication processes. This article, together with the articles “Exploring the Google API with SOAP” and “Connecting to Amazon.com Web Services with NuSOAP”, which published in the January and March 2004 issues of php|a respectively, should have given you a good walkthrough of both SOAP clients and servers. If you want to learn more about the SOAP syntax, you can also check out the World Wide Web Consortium’s notes about SOAP at www.w3.org/TR/SOAP.
About the Author
?>
Alessandro Sfondrini is a young Italian PHP programmer from Como. He has already written some on-line PHP tutorials and published scripts on most important Italian web portals. You can contact him at
[email protected] .
To Discuss this article: http://forums.phparch.com/141
FavorHosting.com offers reliable and cost effective web hosting... SETUP FEES WAIVED AND FIRST 30 DAYS FREE! So if you're worried about an unreliable hosting provider who won't be around in another month, or available to answer your PHP specific support questions. Contact us and we'll switch your information and servers to one of our reliable hosting facilities and you'll enjoy no installation fees plus your first month of service is free!* - Strong support team - Focused on developer needs - Full Managed Backup Services Included Our support team consists of knowledgable and experienced professionals who understand the requirements of installing and supporting PHP based applications. Please visit http://www.favorhosting.com/phpa/ call 1-866-4FAVOR1 now for information.
April 2004
●
PHP Architect
●
www.phparch.com
54
S E C U R I T Y
C O R N E R
Security Corner: SQL Injection by Chris Shiflett Welcome to another edition of Security Corner. This month’s topic is SQL injection, a style of attack that frequents the minds of PHP developers, but for which there is a shortage of good documentation. Most Web applications interact with a database, and the data stored therein frequently originates from users. Thus, when creating an SQL statement, a developer may use client data in its construction. A typical SQL injection attack exploits this scenario by attempting to send valid SQL as unexpected values of GET and POST data. This is why an SQL injection vulnerability is almost always the fault of poor data filtering, and this fact cannot be stressed enough. This article explains SQL injection by looking at a few example attacks and then introducing some simple and effective methods for prevention. By applying these best practices, you can practically eliminate SQL injection from your list of security concerns.
F
or a moment, place yourself in the role of an attacker. Your goal is initially simple: to get any unexpected SQL statement executed by the database. You’re only looking to get something to work, because that will reveal the fact that the application either completely fails to filter data or that there are weaknesses in the data filtering logic. You have as many chances as you want, and you have a lot of information to work with. For example, consider the simple registration form shown in Figure 1. In order to get more information about this form, you view the source:
Username:
Email:
You can already make a very educated guess about the type of SQL statement that this application will construct. It will most likely be an INSERT statement that uses $_POST[‘reg_username’] and $_POST[‘reg_email’]. You can also make a guess about the naming convention used in the database, because it possibly matches the names used in the HTML form. Because this form is for registration, there is also likely to be a password generated and included in the query. From all of this, you guess that the following construction is performed: $sql = “INSERT INTO users ( reg_username,
April 2004
●
PHP Architect
●
www.phparch.com
reg_password, reg_email) VALUES (‘{$_POST[‘reg_username’]}’, ‘$reg_password’, ‘{$_POST[‘reg_email’]}’)”;
Assuming this guess is correct, what can you do to manipulate this query? Imagine sending the following username: bad_guy’, ‘mypass’, ‘’), (‘good_guy
If
[email protected] is given for reg_email, and the password generated by the application is 12345, then the SQL statement becomes the following: INSERT INTO users (reg_username, reg_password, reg_email) VALUES (‘bad_guy’, ‘mypass’, ‘’), (‘good_guy’,‘12345’, ‘
[email protected]’)
This statement creates two accounts: good_guy with a valid email address and bad_guy with no email address. Because reg_email is valid, if the application emails the password for the good_guy account, it will arrive safely. You already know the password for the bad_guy account, because you set it yourself. Thus, by sending a specially crafted username, you have created two accounts that you can perhaps use for further malicious activity. You can use the good_guy account to investigate the application and learn how it works (a valid account might be required to access cer-
55
SECURITY CORNER tain parts). With the bad_guy account (which is also a valid account), you can launch additional attacks with your heightened privilege without the risk of losing your real account if something goes wrong (the bad_guy account is disposable). More importantly, if this is successful (no error is given by the application, and you can log in as bad_guy), it sufficiently proves that there is very poor data filtering, if any at all. You may be wanting more examples of SQL injection attacks, so I will demonstrate another style. Keep in mind that creativity plays a large role, as is the case for most styles of attack. In the example just explained, the attack is limited by the type of query (IINSERT) and the placement of the client data. Other types of queries present new opportunities, and the best practices mentioned in this article prevent practically all SQL injection attacks. WHERE Hacking The WHERE clause is used to restrict the records that a particular query matches. For a SELECT statement, it determines the records that are returned. For an UPDATE statement, it determines the records that are altered. For a DELETE statement, it determines the records that are deleted. If a user can manipulate the WHERE clause, there are a lot of opportunities to make drastic changes—selecting, updating, and deleting arbitrary records in the database. Imagine a SELECT statement intended to fetch all credit card numbers for the current user: $sql = “SELECT card_num, card_name, card_expiry FROM credit_cards WHERE username = ‘{$_GET[‘username’]}’”;
In this particular case, the application might not even be soliciting the username, but rather providing it in a link: View Credit Card(s) for Your Account
Because a user can have multiple cards, the application loops through the results, displaying the card number, the name on the card, and the card’s expiration date for each one. Imagine a user who visits the following URL: /account.php?username=shiflett%27+OR+username+%3D+%27 lerdorf
This submits the following value for the username: shiflett’ OR username = ‘lerdorf
If used in the previous SQL statement, the following is the result: SELECT card_num, card_name, card_expiry FROM credit_cards
April 2004
●
PHP Architect
●
www.phparch.com
WHERE username = ‘shiflett’ OR username = ‘lerdorf’
Now the user sees a list of all credit cards belonging to either shiflett or lerdorf. This is a pretty major security vulnerability. Of course, a larger vulnerability exists in this particular example, because a user can arbitrarily pass any username on the URL. In addition, a username that causes the WHERE clause to match all records can be used: shiflett’ or username = username
Imagine if this particular username were actually stored in the database (using a previous SQL injection attack) and used as the attacker’s username by the application. Everywhere that a WHERE clause is used to restrict a query to the user’s own record can actually include additional records (or all records). This is not only extremely dangerous, but it also makes further attacks very convenient. Data Filtering Note: The methods to be described assume that magic_quotes is disabled. If magic_quotes is enabled, you can use the fix_magic_quotes() function that is listed at: http://phundamentals.nyphp.org/storingretrieving
There are best practices that you should follow to prevent SQL injection attacks, and these offer a very high level of protection. The most important step is to filter all data that comes from the client. This includes $_GET, $_POST, $_COOKIE, and $_FILES. To help clarify this, consider the following HTML form: red green blue
Clearly, the expected values are red, green, and blue. So, the data filtering should verify this: $clean = array(); $valid_colors = array(‘red’, ‘green’, ‘blue’); if (!in_array($_POST[‘color’], $valid_colors)) { /* Display error */ } else { $clean[‘color’] = $_POST[‘color’]; }
This code uses a separate array ($$clean) to store the filtered data. It is a good idea to choose a naming convention that will help you identify potentially tainted data. In this example, $clean[‘color’] - if it exists - can be trusted to contain a valid color, because it is first ini-
56
SECURITY CORNER tialized and then only set to $_POST[‘color’] if it passes the validation. Another option for a set of expected values is to use a switch statement: $clean = array(); switch ($_POST[‘color’]) { case ‘red’: case ‘green’: case ‘blue’: $clean[‘color’] = $_POST[‘color’]; break; default: /* Display error */ break; }
For numeric data, the is_numeric() function is a good choice. Filtering other types of data can be more difficult, but regular expressions can be very helpful. For example, my favorite validation logic for an email address is as follows: $email = ‘
[email protected]’; $clean = array(); $email_pattern = ‘/^[^@\s]+@([-a-z0-9]+\.)+[az]{2,}$/i’; if (!preg_match($email_pattern, $email)) { /* Display error */ } else { $clean[‘email’] = $email; }
The two most important points for data filtering are: 1. Only accept valid data rather than trying to prevent invalid data. 2. Choose a naming convention that helps you distinguish tainted data from filtered data.
Escaping Data With properly filtered data, you’re already pretty well protected against malicious attacks. The only remaining step is to escape it such that the format of the data doesn’t accidentally interfere with the format of the SQL statement. If you are using MySQL, this simply requires you to pass all user input through mysql_escape_string() prior to use: $clean[‘color’] = mysql_escape_string($clean[‘color’]); $sql = “... {$clean[‘color’]} ...”;
In this case, assuming $clean[‘color’] comes from the previous example, we can be sure that the data only contains alphabetic characters. However, it is a good habit to always escape data. This practice will help you avoid forgetting this crucial step. Until Next Time... Preventing SQL injection is easy, but it is one of the most common PHP application vulnerabilities. Hopefully you will now always perform the following two steps: 1. Filter data from the client 2. Escape data used in SQL Of course, you should always filter client data, so the only new step is to escape data before you use it in an SQL statement. If you use MySQL, this only requires a function call to mysql_escape_string(). There is a helpful resource located at phundamentals.nyphp.org/storingretrieving that explains this second step in much more detail (focus on the section regarding data storage). I hope that you are now protected against SQL injection attacks and can prevent such vulnerabilities in your applications. Until next month, be safe.
Figure 1
About the Author
?>
Chris Shiflett is a frequent contributor to the PHP community and one of the leading security experts in the field. His solutions to security problems are often used as points of reference, and these solutions are showcased in his talks at conferences such as ApacheCon and the O'Reilly Open Source Convention, his answers to questions on mailing lists such as PHP-General and NYPHP-Talk, and his articles in publications such as PHP Magazine and php|architect. Security Corner, his new monthly column for php|architect, is the industry's first and foremost PHP security column. Chris is the author of the HTTP Developer's Handbook (Sams Publishing) and is currently writing PHP Security (O'Reilly and Associates). In order to help bolster the strength of the PHP community, he is also leading an effort to create a PHP community site at PHPCommunity.org. You can contact him at
[email protected] or visit his Web site at http://shiflett.org/.
April 2004
●
PHP Architect
●
www.phparch.com
57
T I P S
&
T R I C K S
Tips & Tricks By John W. Holmes
Eleven Tips to a Successful Technical Presentation This month’s column is going to be a little different from what I’ve done in the past. I know this will be published well after php|cruise will have taken place, but this tip is in regard to something I, personally, got out of the cruise (one of many things, actually). Giving a presentation can be difficult. Some people have a natural knack for it and can make any topic interesting and keep the audience involved, and more importantly, awake. For the rest of us, though, talking in front of a crowd can be a really painful experience. Giving a technical presentation, such as something about PHP features, can be even more difficult, as you now have more pieces to work with. You have to worry about how and where to show your code and working examples of it, how you are going to talk and type at the same time, etc. So, I just wanted to give a few presentation tips to those of you (hopefully) preparing to speak at the next PHP conference. php|cruise #2 to Alaska will be here before you know it—and you can never start to prepare too early. Lastly, this certainly isn’t a jab at any php|cruise presenters, since I’m sure you’re out there reading this. Everyone enjoyed all of the presentations on the cruise, but it never hurts to go over things like this again. 1) Rehearse: This is a must. You have to run through your presentation a couple times and preferably in front of other people so they can provide feedback. Some people are really good at impromptu speaking and can April 2004
●
PHP Architect
●
www.phparch.com
wing it, but it still doesn’t hurt to run through it just to see how long it takes. You want to ensure you’re speaking at a moderate pass and not talking to fast. You don’t want to run out of material half way through your presentation time. Running over your time limit can be just as bad, also. When you run over, it delays everything after you. If you’re the one planning a conference, you can’t necessarily plan for extra time between presentations, either. If you think you may run over on time, let the planner know. Maybe you can go last so you don’t mess anyone else up? 2) Dry Run: This is basically the last rehearsal, but the key is that it should be done where you’re actually going to give the presentation. Yes, I know this isn’t always possible, but you should try to make it happen. If you didn’t bring your own equipment, this will give you chance to ensure that what you have will work on the equipment provided. Obviously, it’s better to identify problems as soon as possible. What looks good on your CRT monitor doesn’t always look good from a projector (or whatever display system they have). Do you have all of the cables you need to show your presentation? Do the colors show up (especially yellows and oranges). Can you read the fonts from the front and back of the room? Do you need a microphone? If you have a quiet voice, then one may be required even though you may be in a small conference area. 3) Typing Code: You have to be very careful any time
59
TIPS & TRICKS you find yourself typing during a presentation. Your eyes are now on the keyboard or screen instead of the audience. You’re making the audience wait on you as you show them how to do something that they already know how to do. You’re bound to make errors as you type, also, which results in even more time spent typing and not concentrating on the audience. If you need to show code during your presentation, either work it into your presentation slides, or have it already typed and just load the files into a text editor when it’s time to show them. The flow of your presentation will not be interrupted except for the time it takes for the file to load.
“R
you after the presentation or ask when there is more time. Trying to “hold all questions” until the end isn’t really going to work, either, especially if the audience is small. It’s better to go over the questions while the pertinent slide is being displayed rather than having to flip back and forth to find the relevant slide. Also, by learning what’s interesting to confusing to the audience through their questions can allow you to adapt your topics and maybe speed up or slow down for certain topics. 7) Time: Time is your enemy, well, one of them, at least. This is why you rehearse and try to stay on schedule, even with questions. If you realize that you have more info to present than planned, keep in touch with whomever is planning the conference and try to request more time. You may or may not get it. Maybe you can be moved to last in the daily schedule, so that, if you run over, you’re not affecting pre-
unning over your time 4) Text Editors: If you’re going to use one, make sure you can adjust limit can be just as bad, the text size. What looks good and also. When you run over, it easy to read on your monitor may delays everything not look that way during the actual presentation. Also, take note of after you. “ how long your lines and entire scripts are. Scrolling left and right or many pages down can be a pain during the middle of a presentation. If possible, have senters after you. your code already opened in an editor and learn or Here again is where an assistant comes in handy, as make shortcuts to get back and forth between it and they can keep an eye on the clock and attempt to keep your presentation. This way your audience is not wait- you on schedule. ing on you. 8) Graphics and Transitions: I’m going to back-peddle 5) Have an Assistant: This kind of goes along with all on my stance for this one a little bit. I was originally of the above suggestions; try to have an assistant that going to say that I didn’t see a need for pretty graphics can help you give your presentation. I realize this isn’t and transitions if they didn’t actually pertain to the always possible, but there can be a lot of benefits if you topic. However, after seeing a few examples of how make this happen. The simplest benefit is that you can random graphics or transitions can keep the audience have your assistant flipping slides and scrolling at the interested and awaiting the next slides, I can see where right times without you having to say “next slide” or they will be useful. Hopefully, you’ll be an interesting anything else. This keeps your eyes and attention on enough of a speaker or have an interesting enough of the audience instead of the computer. a topic to keep the audience’s attention. If not, howevIf you really need to type (see tip 3 and 4 again), like er, I guess it can’t hurt to throw in the random picture filling out a form during a live demo, then have your of your kid or cat. assistant do it. You can continue explaining concepts or The only thing to keep in mind is that the additional what the results of the changes are going to be while graphics are going to increase the size of the file overyour assistant types. all. This may be an issue if you make your slides availYou can even make your assistant the bad guy that able for download. Consider having a graphic-free verkeeps you on track and on time and cuts off questions sion available for download that people on low-speed when necessary, but that’s up to you. connections can acquire. This leads us right into the next tip. 6) Questions: Speaking of questions, try to pause between each slide and at least look up to see if anyone 9) Make Slides Available: This one should be a nohas any questions. Too many people get their nose brainer, but make sure people can get to your slides caught-up in the computer or looking at their slides after the presentation, especially if they contain code. that they miss people who want to ask a question. Another thing to realize, also, is that not everyone runs Realize that a large amount of questions may push you whatever program you are running to make and disover your time limit, also. Don’t be afraid to cut ques- play your slides. You should make an attempt to make tions off when your time is up or when you really need them available in a basic format like HTML or PDF. to get to your next slide. If it’s important, they’ll find Remember that, at this point, you’re just putting the April 2004
●
PHP Architect
●
www.phparch.com
60
TIPS & TRICKS info out there, you’re not presenting it. It doesn’t have to look the same—just show the information so people can read it. If you’re the one organizing a conference, try to have everything available at one location so people don’t have to look around for it. Put an easy to remember link in your presentation that people can write down quickly, if possible. 10) Uhmmmm: Try to avoid excessive “uhmms” and “ahhs” and other noise words when you’re in front of an audience. Sometimes this is really hard for people, especially if you’re nervous, so rehearsing and experience is going to help. You may not even realize you’re doing this until you do a rehearsal in front of someone. 11) Font Size: This tip is added for Chris Shiflett, who pointed it out a couple of times to me: make your font size huge. This includes your slides, obviously, but also text editors, command lines, browsers, etc. Do this before your conference, also. You may not plan on opening up a text editor or command line, but the questions from the audience may force you to. Be prepared and adjust the font sizes beforehand. These tips certainly aren’t all inclusive and are mostly
April 2004
●
PHP Architect
●
www.phparch.com
from the perspective of an audience member. Feel free to join the Tips & Tricks forum at http://www.phparch.com/discuss to add your own recommendations or offer comments about these eleven suggestions. This isn’t a wholly original topic, either, as many other people have discussed what it takes to give a good presentation. Chris pointed out a link to me for a great presentation by M. J. Dominus on “Conference Presentation Judo” at: http://perl.plover.com/yak/presentation/
The presentation shows how you can balance content, humor and graphics in order to give an awesome presentation. The included notes give even more suggestions about how to cover each slide the “correct way” and should be required reading. Enjoy the show.
php|a
61
The Place to be for PHP Professionals! ter s i g e R Now ! m a ny Te a p m Co unt! Disco
Amsterdam May 3 to 5, 2004
www.phpconference.com Main Conference: May 4-5, 2004 Power Workshops: May 3, 2004
International PHP Conference 2004
Spring Edition Organizer:
Want to Share? Come to Canada!
e x i t ( 0 ) ;
By Marco Tabini
F
or all their geographical proximity, Canada and the United States are, and remain, two very different countries. Ask most Canadians, and they’ll tell you that, although they consider Americans their brothers, they’re happy to have their own place to call home— and proud of their societal customs and traditions. Ask most Americans—or, for that matter, most people outside Canada— and they’ll likely tell you that Canada is that perennially cold place up there where it snows all the time and people go around on dog sleighs and end all their sentences with “eh?”. I will have you know, my dear readers, that nothing could be farther from the truth. We only use dog sleighs for no more than three or four months out of every year (at least from what I can tell when I get on the highway in the morning here in Toronto) and, as I write this col-
April 2004
●
PHP Architect
●
www.phparch.com
umn (on April 5th, 2004—there, so much for not dating your material) there is barely a foot of snow on the ground, a bright sun shines outside and we’re well above ten degrees below zero, eh?
“In Canada, however, there is no DMCA (Digital Millennium Copyright Act), and our privacy laws are extremely strong.”
But I digress—let me go back to the real reason why I am here. As you probably know—given that you follow this column religiously from month to month— a few issues back I reported that the Recording Industry
Association of America (which we’ll refer to as RIAA from now on, least we run out of paper over the next three sentences) sued several hundred people in the US for sharing music files over the Internet. Here in Canada, the RIAA’s cousin, called the Canadian Recording Industry Association, or CRIA for those of us who like to save trees, took the first steps to do the same to a couple dozen Canadians last month, and was run through the shredder by a federal judge. The CRIA was not actually suing anybody for sharing digital music; they were seeking to force several large ISPs—among them the biggest in Canada—to give up personal information about people who, according to them, had illegally shared music over the Internet. Justice Konrad von Finckenstein, who, despite his name, is a thoroughly Canadian figure, ruled against the request, stating in his find-
63
EXIT(0);
Want to Share? Come to Canada!
ings that the CRIA did not present even one shred of evidence that could be considered usable in a court of law. Justice von Finckenstein must have been on a roll, because he hardly stopped there, ruling that sharing files per se does not violate Canadian copyright laws. A computer that is sharing files is, in his view, akin to a photocopier left at a public library—the library itself is not responsible for people violating copyright laws by photocopying other people’s work. On the surface, this is very good news for the file sharers. Personally, and for what it’s worth, I think that Justice von Finckenstein rendered an excellent ruling that finally sets the recording industry, with all the arrogance that American copyright laws have afforded it, back in its place. The methodologies that the RIAA and CRIA have used to identify file sharers is deeply flawed and has produced more than one false positives in the past. Naturally, the RIAA doesn’t care, because it has the money to pursue as many lawsuits as it sees fit, and to bully the average Joe into any kind of settlement it wants. Being on the receiving end of a lawsuit is a nasty business—the kind of business that costs a lot of money just to defend yourself, regardless of whether you’re completely innocent or guilty as charged (although those terms are normally applied to criminal cases and file-sharing is normally a matter for the civil courts). As a result, a certain number of fail-safe measures have to be built into the system. Otherwise, a party with bottomless pockets will punctually overcome a
April 2004
●
PHP Architect
●
www.phparch.com
weaker party by the simple exercise of what lawyers refer to as “papering to death”, and what I refer to as “ruining your weekends” (it is, apparently, a longstanding practice that any legal paperwork must be served on an opposing party on a Friday afternoon, just so that you can ruin their weekends). If these fail-safe measures were not in place, “justice for all” would become “justice for the highest bidder”.
“The methodologies that the RIAA and CRIA have used to identify file sharers is deeply flawed and has produced more than one false positives in the past.” To be sure, the judicial system is—even in the eyes of this layman—not perfect. A financially stronger party has the upper hand in that it can always force you into litigation, no matter how ridiculously frivolous their claim is, and “bleed” you out of whatever money you have before any useful conclusion can be reached in a court of law by simply forcing your lawyer to spend time considering frivolous matters. In the case of the RIAA, the American government has, under the guise of the Digital Millennium Copyright Act (DMCA), relaxed some of the fail-safe measures designed to protect the privacy of individuals in favour of the copyright holders’ right to protect their work. In Canada, however, there is no DMCA, and our privacy laws are extremely strong. This, cou-
pled with what seems to have been lack of good, old-fashioned research and several strategic faux-pas on the CRIA’s part, have resulted in the recent ruling. As much as I am grateful for Justice von Finckenstein’s ruling, I have my reservations. My worry is that people will take it as a free pass to file sharing— which is not, in my opinion, what it means. As I have said, file sharing is wrong, but, on a large enough scale, it is also a form of expression for people who want to protest the music industry’s market strategies. The RIAA/CRIA have tried to quash that form of expression by brute force—simply because it is less expensive to do so than to give up their business model of selling you a CD for more than tape cassette, where the latter actually costs less to manufacture. We are, after all, talking about the same industry group that gets a levy from each blank CD-R and cassette tape sold here in Canada as “compensation” against intellectual property piracy—regardless of whether you use that CD-R to store your personal document, family photo album or—God forbid—a backup copy of some software program you own. On the other hand, by opening the issue up so dramatically, Justice von Finckenstein’s ruling may finally force the CRIA to look at the problem from a different angle and start implementing fair business practices. Sadly, they are now working on their appeal... so we’ll have to put that thought on hold for a little while longer. php|a
64