Cultured Perl: Perl 5.6 For C And Java Programmers Regular expression mayhem

Cultured Perl: Perl 5.6 for C and Java programmers

By Teodor Zlatanov - 2004-09-30 Page: 1 2 3

Regular expression mayhem

Regular expressions are scary to the uninitiated. They look like a mish-mash of characters and exclamations. Many of us believe that regular expressions were actually invented by Kalahari bushmen who infiltrated computer science programs throughout the world's universities years ago.

Perl's regular expression heritage comes from shell scripting and the awk/grep tools. The language's capabilities, however, far exceed its original models.

Basic regular expressions are easy to write, but somewhat hard to read. For example, "con\w+" will match "contra" and "contrary" but not "pro" or "con". With Perl 5.6.0, however, regular expressions were put on steroids. Unicode character class specifiers, arbitrary code execution inside a pattern, flag toggles, conditional expressions, and many other features were added to the regular expression engine.

The best advice for the beginner is: learn the basics of regular expressions (see Resources or the "perldoc perlre" manual page), but stay away from the advanced features for a little while. Regular expressions are tricky beasts. They are significantly harder to read than all other Perl code, because they are usually written tightly and without comments (comments are not only possible, they are also highly recommended for anyone writing production code).

Regular expressions are available for C/C++/Java as external packages, but Perl is by far the best tool available today to do regular expression searches and substitutions. In rare cases, it may be slower than a pure C approach, but Perl should be the first tool considered for purely regular expression-oriented problems.

Scalars, arrays, and hashes: my oh my!

Unlike C, C++, or Java's variables, Perl's variables are auto-instantiated and typed by name. This is jarring to new Perl programmers, but very useful once understood.

I recommend the "use strict" pragma in all production code. Among other things, it ensures that variables are declared before they are used. This avoids bugs caused by typos, both common and annoying.

Without "use strict" you can encounter the following kinds of problems:


$i = 5;
print $j;                               # print $i

The programmer made a typo: he hit j when he wanted i. Perl considers that fine, and prints nothing, which is the value of $j. Sometimes auto-instantiation is nice, but in my experience it is best turned off with "use strict" for all code that is to be shared with others.

Perl variables are either scalars, arrays, or hashes (there are more, but you will rarely encounter them directly). They can also be references, which are just scalars. Scalar names begin with "$", array names begin with "@", and hash names begin with "%".

Scalars are the regular Joes. They hold a single value, which will be either a string or a reference. Perl will convert from string to number as necessary. Which is, to say the least, surprising to new Perl programmers. Take a look at this, for instance:


$i = "hi there";
print 1+$i;                             # prints 1

The scalar $i contains the string "hi there" which has the numeric value 0. Thus, 1 + "hi there" yields 1.

Don't think of it as strings vs. numbers. There is only a scalar in memory, and it contains a scalar value. The value can be a number in numeric context (addition), or it can be a string in string context (printing). But there is still only one value.

Undefined scalars contain the "undef" value. You shouldn't compare anything to undef, the way you would compare things to null in C/C++/Java. Instead, use the defined() function, like so:


$i = "hi there";
print $i if defined $i;                 # prints "hi there"
undef $i;                               # set $i to be undef
print $i if defined $i;                 # prints nothing

Arrays are lists of scalars. They automatically resize as needed, much like the Vector class in Java. C and C++ have no built-in equivalent to arrays, but there are many libraries such as the STL that provide similar functionality. An interesting property of arrays is that in scalar context they yield the number of elements in the array:



@a = ("hi there", "nowhere");
print scalar @a;                        # prints 2
push @a, "hello";                       # add "hello" at the end
print scalar @a;                        # prints 3

Hashes are like arrays, but the scalars are not ordered by position. They are indexed by another scalar (a unique key). For example, a list of names indexed by social security number (a fairly unique key) can be a hash. Insertion of keys in the hash expands the hash automatically. Hashes are similar to the Java HashMap and Hashtable classes.

References are held inside scalars, and they can point to anything. Thus, it is possible to have an array of hashes or a hash of arrays or a hash of hashes or an array of arrays (N-dimensional array). There are several ways to access reference contents, either by explicit dereferencing or with the "->" operator. See the "perldoc perlref" manual page for further information, as this is a pretty broad topic.

C and C++ only have scalars as a built-in type. This forces programmers to go through many hoops when they want arrays and hashes, including using external libraries such as the STL.

Java has built-in types that correspond to arrays and hashes, but they are not as implicit in the language itself. Iterating over the keys of a hash, for example, takes about three times as much typing in Java as it does in Perl:


import java.util.Enumeration;
import java.util.Hashtable;
Hashtable hi = new Hashtable();
// fill in hi's values
// we can use an Iterator, still a lot of typing
for (Enumeration enum = hi.elements();
     enum.hasMoreElements();)
{
 Object o = enum.nextElement();
 // do something with o
}


# note that this even includes the definition and initialization of
# the hash, and still is more compact than the Java code!

%hash = { a => "hi", b => "hello" };

foreach (values %hash)
{
 # do something with $_
}

The missing pieces

Perl lacks many of C, C++, and Java's features. It is a different language, after all. Some of those features directly conflict with each other, like Java's single inheritance model and C++'s multiple inheritance model, for example. In such cases, it's obviously not possible to have both, and Perl comes up with its own way of doing things.

Because Perl programs can be linked to C libraries (and in fact, this is how much of Perl's functionality is implemented), there is almost nothing that C or C++ code can do that Perl cannot accomplish by linking. Here we'll try to limit our discussion to the language's built-in functionality, without external linking.

Compared to C and C++, Perl sometimes lacks execution speed. This can be a problem, but is often easily overcome with good programming and good use of Perl's built-in functionality.

Perl also lacks direct use of C and C++ libraries. Their constants and functionality have to be adapted into Perl through modules and various bindings, which can delay development and slow down execution. This is less of a problem lately, with a huge amount of bindings released on CPAN to date.

As a programmer skill, Perl is not as well accepted as C and C++. It is a young language, growing in popularity but not yet universally spoken. It is installed on most UNIX systems, however, and there are few operating systems to which Perl has not been ported.

Perl supports single or multiple inheritance hierarchies, encapsulation, and polymorphism, but only through external modules or programmer agreement. In other words, the language itself does not enforce strict OOP rules; it is up to the programmers to abide by the rules. This can be good and bad, depending on the programmers and on the project.

Perl's threads and Unicode support lag significantly behind Java, and a little behind C/C++. Java was designed to support threads and Unicode from the start, while C/C++ have had many more years than Perl to get it right, and more need for those features. Threads and Unicode support are still at the experimental stage in Perl, but this should change with the next stable release after 5.6.0.

The best of Perl

For the C/C++/Java programmer, Perl is invaluable for what it does better than those languages. Regular expressions, for instance, are trivial in Perl but quite hard to do in C, C++, or Java. Implicit function arguments, loose syntax, and liberal program structures add to Perl's charm.

Perl is not for everyone. It requires readiness to adapt, acceptance of all its faults, and of course a useful application. Don't use Perl just because it's cool; use Perl because it is the better tool. Use C, C++, or Java when they are better. A good programmer always has several tools at the ready.

Perl has some small deficiencies, which are always being ironed out by its tireless developers. Depending on the need for threads, Unicode support, or strict OOP practices, you may want to consider a language other than Perl to accommodate those specific needs.

Perl is a generic language. It is a flexible language that can act as glue between many disparate modules. It can implement any procedural or functional algorithm. Perl cuts the development cycle significantly, because there is less code to do common things, such as iterating over hash elements. But most importantly, programming in Perl is fun and always a learning experience.

View Cultured Perl: Perl 5.6 for C and Java programmers Discussion

Page: 1 2 3 Next Page: Resources

First published by IBM developerWorks