Developer Forums | About Us | Site Map
Search  
HOME > TUTORIALS > SERVER SIDE CODING > ADMINISTRATION TUTORIALS > SECURE PROGRAMMER: VALIDATING INPUT


Sponsors





Useful Lists

Web Host
site hosted by netplex

Online Manuals

Secure programmer: Validating input
By David A. Wheeler - 2004-01-23 Page:  1 2 3 4 5

Strings

Again, with strings you need to identify what is legal, and reject any other string. Often the easiest tool for specifying legal strings are regular expressions: Just write a pattern using a regular expression to describes what string values are legal, and throw away data that doesn't match the pattern. For example, ^[A-Za-z0-9]+$ specifies that the string must be at least one character long and that it can only include upper-case letters, lower-case letters, and the digits 0 through 9 (in any order). You can use regular expressions to limit which characters are allowed and to be more specific (for example, you can often limit even further what the first character can be). Just about all languages have libraries that implement regular expressions; Perl is based on regular expressions, and for C, the functions regcomp(3) and regexec(3) are part of the POSIX.2 standard and are widely available.

If you use regular expressions, be sure to indicate that you want to match the beginning (usually symbolized by ^) and end (usually symbolized by $) of the data in your match. If you forget to include ^ or $, an attacker could include legal text inside their attack to bypass your check. If you're using Perl and you use its multi-line option (m), watch out: you must use \A for the beginning and \Z for the end instead, because the multi-line option changes the meaning of ^ and $.

Thebiggest problem is figuring out exactly what should be legal in the string. In general, you should be as restrictive as possible. There are a large number of characters that can cause special problems; where possible, you don't want to allow characters that have a special meaning to the program internals or the eventual output. That turns out to be really difficult, because so many characters can cause problems in some cases.

Here is a partial list of the kinds of characters that often cause trouble:

  • Normal control characters (characters with values less than 32): This especially includes character 0, traditionally called NUL; I call it NIL to distinguish it from C's NULL pointer. NIL marks the end of strings in C; even if you don't use C directly, many libraries call C routines indirectly and can get confused if given NIL. Another problem is line ending characters, which can be interpreted as command endings. Unfortunately, there are several line ending encodings: UNIX-based systems use character linefeed (0x0a), but DOS based systems (including Windows) use the CP/M marking carriage-return linefeed (0x0d 0x0a), the Apple MacOS uses carriage return (0x0d), many IBM mainframes (like OS/390) uses next line (0x85), and some programs even (incorrectly) use the reverse CP/M marking (0x0a 0x0d).
  • Characters with values higher than 127: These are used for international characters, but the problem is that they can have many possible meanings, and you need to make sure that they're properly interpreted. Often these are UTF-8 encoded characters, which has its own complications; see the UTF-8 discussion later in this article.
  • Metacharacters: Metacharacters are characters that have special meanings to programs or libraries you depend on, such as the command shell or SQL.
  • Characters that have a special meaning in your program: For instance, characters used as delimiters. Many programs store data in text files, and separate the data fields with commas, tabs, or colons; you'll need to reject or encode user data with those values. Today, a common problem is the less-than sign (<), because XML and HTML use this.

This isn't an exhaustive list, and you often must accept some of these characters. Later articles will discuss how to deal with these characters, if you must accept them. The point of this list is to convince you to try to accept as few characters as possible, and to think carefully before accepting another. The fewer characters you accept, the more difficult you make it for an attacker.



View Secure programmer: Validating input Discussion

Page:  1 2 3 4 5 Next Page: More specific data types

First published by IBM developerWorks


Copyright 2004-2025 GrindingGears.com. All rights reserved.
Article copyright and all rights retained by the author.