Blog Archive for / 2013 / 03 /

Duplication in Software

Tuesday, 26 March 2013

Much has been said about the importance of reducing duplication in software. For example, J. B. Rainsberger has "minimizes duplication" as the second of his four "Elements of Simple Design", and lots of the teachings of the Agile community stress the importance of reducing duplication when refactoring code.

Inspired by Kevlin Henney's tweet last week, where he laments that programmers trying to remove duplication often take it literally, I wanted to talk about the different kinds of duplication in software. I've just mentioned "literal" duplication, so let's start with that.

Basic Literal Duplication

This is the most obvious form of duplication: sections of code which are completely identical. This most often arises due to copy-and-paste programming, but can often arise in the form of repetitive patterns — a simple for loop that is repeated multiple places with the same body, for example.

Removing Literal Duplication

The easiest to create, literal duplication is also the easiest to remove: just extract a function that does the necessary operation.

Sometimes, though the code is identical, the types involved are different. You cannot address this with extracting a simple function, so we have a new class of duplication.

Parametric Literal Duplication

Parametric literal duplication can also arise from copy-and-paste programming. The key feature is that the types of the variables are different so you cannot just reuse the code from one place in another, even if it was a nicely self-contained function. If you eliminate all the basic literal duplication, parametric literal duplication will give you sets of functions with identical structure but different types.

With the lack of a portable is_ready() function for std::future, it is common to test whether a future f is ready by writing f.wait_for(std::chrono::seconds(0))==std::future_status::ready. Since std::future is a class template, the types of the various futures that you may wish to check for readiness may vary, so you cannot extract a simple function. If you write this in multiple places you therefore have parametric literal duplication.

Removing Parametric Literal Duplication

There are various ways to remove parametric literal duplication. In C++ the most straightforward is probably to use a template. e.g.

template<typename T>
inline bool is_ready(std::future<T> f){
    return f.wait_for(std::chrono::seconds(0))==std::future_status::ready;
}

In other languages you might choose to use generics, or rely on duck-typing. You might also do it by extracting an interface and using virtual function calls, but that requires that you can modify the types of the objects, or are willing to write a facade.

Parametric literal duplication is closely related to what I call Structural Duplication.

Structural Duplication

This is where the overall pattern of some code is the same, but the details differ. For example, a for loop that iterates over a container is a common structure, but the loop body varies from loop to loop.e.g

std::vector<int> v;

int sum=0;
for(std::vector<int>::iterator it=v.begin();it!=v.end();++it){
    sum+=*it;
}
for(std::vector<int>::iterator it=v.begin();it!=v.end();++it){
    std::cout<<*it<<std::endl;
}

You can't just extract the whole loop into a separate function because the loop body is different, but that doesn't mean you can't do anything about it.

Removing Structural Duplication

One common way to remove such duplication is to extract the commonality with the template method pattern, or create a parameterized function where the details are passed in as a function to call.

For simple loops like the ones above, we have std::for_each, and the new-style C++11 for loops:

std::for_each(v.begin(),v.end(),[&](int x){sum+=x;});
std::for_each(v.begin(),v.end(),[](int x){std::cout<<x<<std::endl;});

for(int x:v){
    sum+=x;
}
for(int x:v){
    std::cout<<x<<std::endl;
}

Obviously, if your repeated structure doesn't match the standard library algorithms then you must write your own, but the idea is the same: take a function parameter which is a callable object and which captures the variable part of the structure. For a loop, this is the loop body. For a sort algorithm it is the comparison, and so forth.

Temporal Duplication

This is where some code only appears once in the source code, but is executed repeatedly, and the only desired outcome is the computed result, which is the same for each invocation. For example, the call to v.size() or v.end() to find the upper bound of an iteration through a container.

std::vector<int> v;
for(unsigned i=0;i<v.size();++i)
{
    do_stuff(v[i]);
}

It doesn't just happen in loops, though. For example, in a function that inserts data into a database table you might build a query object, run it to insert the data, and then destroy it. If this function is called repeatedly then you are repeatedly building the query object and destroying it. If your database library supports parameterization then you may well be able to avoid this duplication.

Removing Temoral Duplication

The general process for removing temporal duplication is to use some form of caching or memoization — the value is computed once and then stored, and this stored value is used in place of the computation for each subsequent use. For loops, this can be as simple as extracting a variable to hold the value:

for(unsigned i=0,end=v.size();i!=end;++i){
    do_stuff(v[i]);
}

For other things it can be more complex. For example, with the database query example above, you may need to switch to using a parameterized query so that on each invocation you can bind the new values to the query parameters, rather than building the query around the specific parameters to insert.

Duplication of Intent

Sometimes the duplication does not appear in the actual code, but in what the code is trying to achieve. This often occurs in large projects where multiple people have worked on the code base. One person writes some code to do something in one source file, and another writes some code to do the same thing in another source file, but different styles mean that the code is different even though the result is the same. This can also happen with a single developer if the different bits are written with a large enough gap, such that you cannot remember what you did before and your style has changes slightly. To beat the loop iteration example to death, you might have some code that loops through a container by index, and other code that loops through the same container using iterators. The structure is different, but the intent is the same.

Removing Duplication of Intent

This is one of the hardest types of duplication to spot and remove. The way to remove it is to refactor one or both of the pieces of code until they have the same structure, and are thus more obviously duplicates of one-another. You can then treat them either as literal duplication, parametric literal duplication or structural duplication as appropriate.

Incidental Duplication

This is where there is code that looks identical but has completely a different meaning in each place. The most obvious form of this is with "magic numbers" — the constant "3" in one place typically has a completely different meaning to the constant "3" somewhere else.

Removing Incidental Duplication

You can't necessarily entirely eliminate incidental duplication, but you can minimize it by good naming. By using symbolic constants instead of literals then it is clear that different uses are distinct because the name of the constant is distinct. There will be still be duplication of the literal in the definition of the constants, but this is now less problematic.

In the case that this incidental duplication is not just a constant then you can extract separate named functions that encapsulate this duplicate code, and express the intent in each case. The duplication is now just between these function bodies than between the uses, and the naming of the functions makes it clear that this is just incidental duplication.

Conclusion

There are quite a few types of duplication that you may get in your code. By eliminating them you will tend to make your code shorter, clearer, and easier to maintain.

If you can think of any types of duplication I've missed, please add a comment.

Posted by Anthony Williams
[/ design /] permanent link
Tags: software design, refactoring, duplication
Stumble It! | Submit to Reddit | Submit to DZone

Comment on this post

If you liked this post, why not subscribe to the RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.

Previous Entries Later Entries

Just Software Solutions

About Us

Technical Writings

Subscribe to Blog

Blog Archives