When answering questions on StackOverflow, I'm often surprised at the number of people who hard-code file names in their programs. I'm starting to suspect that there's a generation of programmers who haven't been introduced to the Unix Filter Model. In order to have somewhere that explains what the filter model is and how (and, most importantly, why) you would use it, here's an extract from my (out of print, but available for free on the web) book Data Munging in Perl. The code samples are in Perl, but I think they're readable and the ideas (or something very similar) will work in just about any programming language.
Overview of the filter model
Many operating systems, principally Unix and its variants, support a feature called I/O redirection. This feature is also supported in Microsoft Windows, although as it is a command line feature, it is not used as much as it is in Unix. I/O redirection gives the user great flexibility over where a program gets its input and sends its output. This is achieved by treating all program input and output as file input and output. The operating system opens two special file handles called STDIN
and STDOUT
, which, by default, are attached to the user’s keyboard and monitor. This means that anything typed by the user on the keyboard appears to the program to be read from STDIN
and anything that the program writes to STDOUT
appears on the user’s monitor.
For example, if a user runs the Unix command
ls
then a list of files in the current directory will be written to STDOUT
and will appear on the user’s monitor.
There are, however a number of special character strings that can be used to redirect these special files. For example, if our user runs the command
ls > files.txt
then anything that would have been written to STDOUT
is, instead, written to the file files.txt
. Similarly, STDIN
can be redirected using the <
character. For example,
sort < files.txt
would sort our previously created file in lexical order (since we haven’t redirected the output, it will go to the user’s monitor).
Another, more powerful, concept is I/O pipes. This is where the output of one process is connected directly to the input of another. This is achieved using the |
character. For example, if our user runs the command
ls | sort
then anything written to the STDOUT
of the ls command (i.e., the list of files in the current directory) is written directly to the STDIN
of the sort
command. The sort
command processes the data that appears on its STDIN
, sorts that data, and writes the sorted data to its STDOUT
. The STDOUT
for the sort
command has not been redirected and therefore the sorted list of files appears on the user’s monitor.
A summary of the character strings used in basic I/O redirection is given in the table below. More complex features are available in some operating systems, but the characters listed are available in all versions of Unix and Windows.
Common I/O redirection
String | Usage | Description |
---|---|---|
> | cmd > file |
Runs cmd and writes the output to file, overwriting whatever was in file. |
>> | cmd >> file |
Runs cmd and appends the output to the end of file. |
< | cmd < file |
Runs cmd , taking input from file. |
| | cmd1 | cmd2 |
Runs cmd1 and passes any output as input to cmd2
|
Advantages of the filter model
The filter model is a very useful concept and is fundamental to the way that Unix works. It means that Unix can supply a large number of small, simple utilities, each of which do one task and do it well. Many complex tasks can be carried out by plugging a number of these utilities together. For example, if we needed to list all of the files in a directory with a name containing the string “proj01” and wanted them sorted in alphabetical order, we could use a combination of ls
, sort
, and grep
like this:
ls –1 | grep proj01 | sort
Most UNIX utilities are written to support this mode of usage. They are known as filters as they read their input from STDIN
, filter the data in a particular way, and write what is left to STDOUT
.
This is a concept that we can make good use of in our data munging programs. If we write our programs so that they make no assumptions about the files that they are reading and writing (or, indeed, whether they are even reading from and writing to files) then we will have written a useful generic tool, which can be used in a number of different circumstances.
Example: I/O independence
Suppose, for example, that we had written a program called data_munger
which munged data from one system into data suitable for use in another. Originally, we might take data from a file and write our output to another. It might be tempting to write a program that is called with two arguments which are the names of the input and output files. The program would then be called like this:
data_munger input.dat output.dat
Within the script we would open the files and read from the input, munge the data, and then write to the output file. In Perl, the program might look something like:
#!/usr/bin/perl
use strict;
use warnings;
my ($input, $output) = @ARGV;
open my $in_fh, '<', $input
or die "Can’t open $input for reading: $!";
open my $out_fh, '>', $output
or die "Can’t open $output for writing: $!";
while (<$in_fh>) {
print $out_fh munge_data($_);
}
close $in_fh or die "Can't close $input: $!";
close $out_fh or die "Can't close $output: $!";
This will certainly work well for as long as we receive our input data in a file and are expected to write our output data to another file. Perhaps at some point in the future, the programmers responsible for our data source will announce that they have written a new program called data_writer
, which we should now use to extract data from their system. This program will write the extracted data to its STDOUT
. At the same time the programmers responsible for our data sink announce a new program called data_reader, which we should use to load data into their system and which reads the data to be loaded from STDIN
.
In order to use our program unchanged we will need to write some extra pieces of code in the script which drives our program. Our program will need to be called with code like this:
data_writer > input.dat
data_munger input.dat output.dat
data_reader < output.dat
This is already looking a little kludgy, but imagine if we had to make these changes across a large number of systems. Perhaps there is a better way to write the original program.
If we had assumed that the program reads from STDIN
and writes to STDOUT
, the program actually gets simpler and more flexible. The rewritten program looks like this:
#!/usr/bin/perl
use strict;
use warnings;
while (<STDIN>) {
print munge_data($_);
}
Note that we no longer have to open the input and output files explicitly, as Perl arranges for STDIN
and STDOUT
to be opened for us. Also, the default file handle to which the print function writes is STDOUT
; therefore, we no longer need to pass a file handle to print
. This script is therefore much simpler than our original one.
When we’re dealing with input and output data files, our program is called like this:
data_munger < input.dat > output.dat
and once the other systems want us to use their data_writer
and data_reader
programs, we can call our program like this:
data_writer | data_munger | data_reader
and everything will work exactly the same without any changes to our program. As a bonus, if we have to cope with the introduction of data_writer
before data_reader
or vice versa, we can easily call our program like this:
data_writer | data_munger > output.dat
or this:
data_munger < input.dat | data_reader
and everything will still work as expected.
Rather than using the STDIN
file handle, Perl allows you to make your program even more flexible with no more work, by reading input from the null file handle like this:
#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
print munged_data($_);
}
In this case, Perl will give your program each line of every file that is listed on your command line. If there are no files on the command line, it reads from STDIN
. This is exactly how most UNIX filter programs work. If we rewrote our data_munger
program using this method we could call it in the following ways:
data_munger input.dat > output.dat
data_munger input.dat | data reader
in addition to the methods listed previously.
Example: I/O chaining
Another advantage of the filter model is that it makes it easier to add new functionality into your processing chain without having to change existing code. Suppose that a system is sending you product data. You are loading this data into the database that drives your company’s web site. You receive the data in a file called products.dat
and have written a script called load_products. This script reads the data from STDIN
, performs various data munging processes, and finally loads the data into the database. The command that you run to load the file looks like this:
load_products < products.dat
What happens when the department that produces products.dat announces that because of a reorganization of their database they will be changing the format of your input file? For example, perhaps they will no longer identify each product with
a unique integer, but with an alphanumeric code. Your first option would be to rewrite load_products
to handle the new data format, but do you really want to destabilize a script that has worked very well for a long time? Using the Unix filter model, you don’t have to. You can write a new script called translate_products
which reads the new file format, translates the new product code to the product identifiers that you are expecting, and writes the records in the original format to
STDOUT
. Your existing load_products script can then read records in the format that it accepts from STDIN
and can process them in exactly the same way that it always has. The command line would look like this:
translate_products < products.dat | load_products
This method of working is known as chain extension and can be very useful in a number of areas.
In general, the Unix filter model is very powerful and often actually simplifies the program that you are writing, as well as making your programs more flexible. You should therefore consider using it as often as possible.
Top comments (0)