Jan's Guide to HTML

CGI & Perl

Contents: The Basics - CGI & SSI - Who Is Using My Script? - Handling Forms - Getting the User's Input - Verify the Input - Sending Email - Writing To A File - What Can Go Wrong?

Chapters

Table of Contents
Introduction
Basic HTML
Text
Images
Lists
Anchors
Tables
Frames
Forms
Javascript
SSI
CGI & Perl
Style Sheets
Special characters
Colour examples
HTML resources
Software
Recent Updates
Index

The Basics

This chapter gives some information about making CGI scripts using Perl. You should check first if you are allowed to make your own CGI scripts. Not all web hosting companies let you do that. You should also check if you have Perl on your server. For Unix machines this is pretty much standard but if your server is on a Windows machine, Perl may not be available.

This page is not an extensive guide to Perl, CGI or Unix. It discusses one concrete example and that should be enough to get you going. Perl is easy to learn and fairly tolerant towards the programmer. You will find links to helpful Perl sites on the resources page.

CGI & SSI

Basically, there are two ways you can call a CGI script. One is through a server side include and the other is by using a form. Calling a script from a server side include is simple:

<!--#exec cgi="ssi.demo.cgi"-->

"If people wish to eat meat and run the risk of dying a horrible, lingering, hormone-induced death after sprouting extra breasts and large amounts of hair it is, of course, entirely up to them."
Tony Banks.

The CGI script (which is in Perl 5) opens a file with a bunch of quotes, selects one at random and prints it to the page. You can let such a script do pretty much anything, including printing certain text or graphics depending, for example, on the user's IP Address or the browser used by your readers. Remember that, in order for server side includes to work, your file needs to have a .shtml extension rather than .html.

Handling Forms

However, the main reason why you will want to learn CGI scripting is to handle the forms on your web site. The following sections of this page explain how to collect the data from forms, minimum security measures you should take and how to send the input on to yourself by email or to write it to a file. Unfortunately, it is much harder here to show examples, so you should eventually create a script for your own server and start experimenting with that.

This chapter is based on the link suggestion form and the related script. You can see them in action on my wicked web pages.

<FORM METHOD="POST" ACTION="/cgibin/linksuggest.cgi">

Category (select one)</A>
<SELECT NAME="category">
<OPTION SELECTED>Airlines
<OPTION>Beer
<OPTION>Internet
<OPTION>Jan's Guide to HTML
<OPTION>London Links
<OPTION>Macintosh
<OPTION>Macintosh Dealers
<OPTION>Macintosh Software
<OPTION>News (Dutch)
<OPTION>News (English)
<OPTION>News (German)
<OPTION>News (other)
<OPTION>Politics
<OPTION>Search Engines
<OPTION>Travel
<OPTION>other...
</SELECT>
</P>

Location of site (http://www.somewhere/etc/)</A><BR>
<INPUT NAME="location" TYPE="text" SIZE="50" MAXLENGTH="70" VALUE="http://">

Description</A><BR>
<INPUT NAME="description" TYPE="text" SIZE="50" MAXLENGTH="70">

Your name (optional):</A><BR>
<INPUT NAME="yourname" TYPE="text" SIZE="50" MAXLENGTH="70">

<INPUT TYPE="submit" VALUE="Submit"> <INPUT TYPE="Reset" VALUE="Clear form">

The link in the ACTION attribute may need to be different in your case. You should check the documentation of your web server or ask your support person. On some servers, the filename may need to be linksuggest.pl rather than linksuggest.cgi.

There is more information about how to design forms elsewhere in this guide.

Who Is Using My Script?

Your first security measure is to check who is using the script. Since this script is called from one web page only, the environment variable $ENV{'HTTP_REFERER'} should hold the name that page (a number of environment variables is available whenever a script is called, HTTP_REFERER is one of them). The script checks if it is the proper page (http://www.weijers.net/great/suggest.shtml or possibly http://weijers.net/great/suggest.shtml). If it isn't the script reports an error and exits:

# if there is a referer, check if it is us
if ( !( $ENV{'HTTP_REFERER'} =~ /^http:\/\/(www\.)?weijers\.net\/great\/suggest.shtml/i ) ) {
print "Content-type: text/html\n\n";
print "<HTML><HEAD><TITLE>Forbidden!</TITLE></HEAD><BODY>\n";
print "<H1>Forbidden</H1>\n";
print "<P>$ENV{'HTTP_REFERER'} is not allowed to use this script.</P></BODY></HTML>\n";
exit;
}

If the script was called directly the $ENV{'HTTP_REFERER'} will usually be empty. If it is called from another page (which a malicious hacker might have set up on some other site) it will contain the address of that page. In principal you could also do something with HTTP_REFERER if it is not your site, for example email it to yourself and then try to find out who is abusing your script, but I have chosen just to ignore it. You have to take care that if you make a script that is called from more than one page on your site, you make the test flexible enough so that all pages can use it.

Getting the User's Input

If your form uses the POST method, like the example above, the data is read from STDIN (a Unix term). The first thing to do is check for suspiciously long input. A hacker might try to break into your server by feeding your script overly long content. In this example the scripts rejects anything more than 500 bytes, you may need to set a higher value if you have a long, complicated form. The length of the input is stored in the environment variable $ENV{'CONTENT_LENGTH'}. If the content is too long the script prints a message to that effect and then exits. Please note that using the MAXLENGTH attribute in your form is not a guarantee that your script will only get short input. It is possible to create a form elsewhere without the MAXLENGTH attribute or to call the script directly (checking the HTTP_REFERER should make that impossible but better safe than sorry).

# Get the input and check for suspicious size
if ( $ENV{'CONTENT_LENGTH'} > 500 ) {
print "Content-type: text/html\n\n";
print "<HTML><HEAD><TITLE>Error!</TITLE></HEAD><BODY>\n";
print "<H1>Error</H1>\n";
print "<P>$ENV{'CONTENT_LENGTH'} bytes is way too much input for me.</P></BODY></HTML>\n";
exit;
}
else {
# read if everything seems okay
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
}

# Split the name-value pairs
@input_pairs = split(/&/, $buffer);

If the length of the input is acceptable, the read function is used to copy the input into the variable $buffer. The bits of input are tied together with the & sign and spaces have replaced by + signs, for example "email=jan@weijers.net&realname=Jan+Weijers&location=http%3A%2F%2Fwww.weijers.net%2F". The split() function is used to cut $buffer into pieces, breaking $buffer up at every & sign, and load the resulting pieces in the @input_pairs list. To stay with this example, the first element of @input_pairs will now contain "email=jan@weijers.net" and the second one "realname=Jan+Weijers".

In the next section of the script, it goes through a loop for each element of the list @pairs. Each pair is split in a name and a value. For example, a name could be "email" and a value "jan@weijers.net". Next, the tr/// function is called to replace all plusses with spaces. This has to be done both for $value and $name. The tr/// function has this simple format tr/[character to find]/[character to replace with]/. In order to understand the use of the tr/// and s/// functions you will need to learn something about pattern matching.

Some characters will be encoded as hexadecimal values when the input of your form is received, for example "/" will be encoded as "%2F". The s/// function is used to look for occurrences of that and then uses the pack() and hex() functions to translate them back to regular characters. This also has to be done for both $name and $value. The s/// function has a similar format to the tr/// function, s/[string to find]/[string to replace with]/, but works with strings rather than just one character.

The next step is to call the s/// function twice to remove any server side includes and all HTML from $name and $value. This is for security reasons as well as shortening the input. Then the s/// function is called three times: once to remove all whitespace (spaces, tabs, etc) from the beginning of $value, then to remove all whitespace from the end of $value and finally to remove double spaces and tabs from all of $value and replace them with single spaces.

foreach $pair (@input_pairs) {
($name, $value) = split(/=/, $pair);

# translate plusses into spaces and encoded characters
$name =~ tr/+/ /;
$name =~ s/%([A-F0-9]{2})/pack("C", hex($1))/ieg;
$value =~ tr/+/ /;
$value =~ s/%([A-F0-9]{2})/pack("C", hex($1))/ieg;

# delete any html and server side includes for security
$value =~ s/<(.|\n)*?>//g;
$name =~ s/<(.|\n)*?>//g;

# use substitute to clear whitespace at beginning and end (\s means any whitespace character)
$value =~ s/^\s+//;
$value =~ s/\s+$//;
# and then also remove double spaces etc
$value =~ s/\s+/ /g;

# now get the results, skipping env_report
if ($name =~ /^(category|location|description|yourname)$/i) {
if ( length($value) > 70 ) {
$value = substr($value, 0, 70); }

$form_input{$name} = $value; }
}

The final step of loading the input is copying everything into a list @form_input. When this is done, for example, $form_input{'email'} will have the value "jan@weijers.net". You can see that the result is only used if $name matches one of the four names used in the form that refers to this script (category, location, description or yourname). This is an additional safety feature. If someone managed to break in this far, the culprit will still be limited in what input he can feed to the script.

In this example, I cut all cases of $value that are longer than 70 characters down to 70 (this corresponds to the MAXLENGTH attribute used in the form). Depending on your form, you can make this longer or shorter. Cutting excessively long input improves security and helps avoid errors. Now the data is ready for the script to get to work. Everything we have done until now is standard for any script that processes forms but that will change.

Verify the Input

Before the script mails the contents of the form (or writes them to a file) it makes sense to verify what the user typed. This is the great advantage of using a CGI script to process forms. It enables you to verify the input before it is sent to you and when the user is still around to make corrections.

The next bit of this example checks first if the user typed a description and a location. If there is a location, the script checks if it is a url in the proper format (this is not a waterproof check but it will catch the worst errors). If anything is wrong, an error report is added to the list @wrong by using the push function.

#description
if (!($form_input{'description'})) {
    push(@wrong,'the description of the site is missing');
}

#verify if they filled in their address and if it is valid
if (!($form_input{'location'})) {
    push(@wrong,'The location (url) is missing');
}
else {
    if (!($form_input{'location'} =~ /(https?\:\/\/)?(www\.)?.*\..*/)) {
        push(@wrong,'The location (url) is invalid or incomplete');
    }
}

Of course, the information you need to verify will depend on your form. No two scripts will handle this the same way. You have to decide for yourself what the minimum of information is you need before you can accept the form. If there are errors (if there is anything in the list @wrong), the script prints those errors and then exits:

if (@wrong) {
print "Content-type: text/html\n\n";
print "<HTML><HEAD><TITLE>Sorry, there is a problem</TITLE>";
print "</HEAD>\n <BODY BGCOLOR=white TEXT=midnightblue Link=midnightblue VLINK=midnightblue>";

print "<H1>Sorry!</H1></P>";

print "<P>Your form could not be accepted because:</P><UL>";

foreach $wrong_thing (@wrong) {
print "<li>$wrong_thing"; }
print "</UL><P>Use the Back Button on your browser to return to the ";
print "<a href=\"http://www.weijers.net/great/suggest.shtml\">Wicked Web Form</a>";
print " and try again.</BODY></HTML>";
exit;
}

As you can see, the first thing that is always printed is "Content-type: text/html" which serves to tell the browser that it is receiving HTML input. For the rest, what is printed is standard HTML. The exit command ends the script and returns control to the user. You should take care with printing special characters like " and @ as these have special meanings in Perl. To strip them of there special meaning and print them precede them with a backslash like \" and \@.

Sending Email

The first thing you want to do is print a message to the user. This is done along the same lines like the two error messages you have seen above. Instead of reporting an error you let the user know that the input was successful and what you will do with it.

Secondly, of course the input should reach you. One way to do that is to send it as email. Below is how the script linksuggest.cgi does that:

# Open The Mail Program

open(MAIL,"|/usr/lib/sendmail -t");

print MAIL "To: webmaster\@weijers.net\n";
print MAIL "From: $form_input{'yourname'}\n";
print MAIL "Subject: Wicked Web submission ($form_input{'category'})\n\n";

print MAIL "<A HREF=\"$form_input{'location'}\">";
print MAIL "<IMG SRC=\"arrow1.gif\" BORDER=\"0\" WIDTH=\"13\" HEIGHT=\"12\" ALT=\"*\" ALIGN=\"bottom\"> ";
print MAIL "$form_input{'description'}</A><BR>\n\n\n";

print MAIL "Submitted by: $form_input{'yourname'}\n\n";

print MAIL "IP Address: $ENV{'REMOTE_ADDR'} ($ENV{'CONTENT_LENGTH'} bytes input)\n\n\n";

close (MAIL);

The first thing this bit of the script does is open the email programme using a handle MAIL. You need to check the location of the email software on the server you are using. After that, it is simply a matter of printing to MAIL. Note again the use of \" and \@ to strip certain characters of their Perl special meaning and print them to mail as regular text. You should also note that \n is a return. You need to print returns here to properly format the email.

Another interesting thing to see here is that I format the user input in HTML. That means that when it arrives in my mailbox and I want to include this link on my site, I can just copy and paste it without doing any more work on it. Another example that shows how handling forms through CGI scripts is superior to just mailing them.

At the end of the mail I include the IP address from where the form was submitted and how many bytes total input was given. The IP address might help trace someone who is trying to abuse your script although any hacker worth his salt will be able to conceal his real IP address.

Chapters

Table of Contents
Introduction
Basic HTML
Text
Images
Lists
Anchors
Tables
Frames
Forms
Javascript
SSI
CGI & Perl
Style Sheets
Special characters
Colour examples
HTML resources
Software
Recent Updates
Index

Writing To A File

Alternatively, you could write the input to a file on your web server. This means you could just download the file occasionally and then decide what to do with the input. For non-urgent matters, this may be preferable to email. Here is how you could do it:

if ( open(OUTFILE, '>>/some/path/linksuggest.txt') ) {
print OUTFILE "Name: $form_input{'yourname'}\n";
print OUTFILE "Site: $form_input{'location'}\n";
print OUTFILE "Description: $form_input{'description'}\n";
print OUTFILE "Category: $form_input{'category'}\n";
print OUTFILE "IP Address: $ENV{'REMOTE_ADDR'}\n\n\n";
}

You have to check which path to use on your particular server. The rest is simple. Open the file with file handle OUTFILE (or any other name you prefer). Then print the input of the user to that file and close it. In the example above you can see that the script only attempts to print if the file is opened successfully.

What Can Go Wrong?

You can see that this script has a number of security features. First of all, it only accepts input from the proper form on my own site. Secondly, it rejects suspiciously long input. Thirdly, it strips HTML and Server Side Includes from the input. Four, it only accepts input that has the proper name. The fifth and last security item is that it trims all input to 70 characters, corresponding to the MAXLENGTH attributes in the form. If you take the same or similar steps in your script, you need not be concerned about security risks.

It can be notoriously difficult to find and fix errors in CGI scripts. Ask your web hosting company if you can get access to the error logs, which may help. Secondly, it is useful to get a copy of Perl for your own computer so you can test the script and bits of it at home where it is easier to see what goes wrong. See the software page for information on where to get Perl.

The final thing to take into account is that after uploading the script to your server you will need to change the permissions on it so that any user is allowed to run it. If you do not do this, the server will give an error any time someone submits a form and tries to use the script. The way to do this is Telnet to the server and then type:

cd /directory/where/thescript/is
chmod 755 linksubmit.cgi

See the software page for information on where to get Telnet software.


Please send comments and suggestions to me by jan@weijers.net.
I will also try to answer any questions you may have.

©1995-2002 Jan C.H. Weijers. All rights reserved.

I'd be pleased if you would visit my home page.

HTML 4 (transitional)