Blog Archive for / database /

Database Tip: Eliminate Duplicate Data

Friday, 25 January 2008

Storing duplicated data in your database is a bad idea for several reasons:

  1. The duplicated data occupies more space — if you store two copies of the same data in your database, it takes twice as much space.
  2. If duplicated data is updated, it must be changed in more than one place, which is more complex and may require more code than just changing it in one location.
  3. Following on from the previous point — if data is duplicated, then it is easy to miss one of the duplicates when updating, leading to different copies having different information. This may lead to confusion, and errors further down the line.

Coincidental Duplication

It is worth noting that some duplication is coincidental — it is worth checking out whether a particular instance of duplication is coincidental or not before eliminating it. For example it is common for a billing address to be the same as a delivery address, and it may be that for all existing entries in the table it is the same, but they are still different concepts, and therefore need to be handled as such (though you may manage to eliminate the duplicate storage where they are the same).

Duplication Between Tables

One of the benefits of Using an artificial primary key is that you can avoid duplication of data between the master table and those tables which have foreign keys linked to that table. This reduces the problems described above where the duplication is in the foreign key, but is only the first step towards eliminating duplication within a given table.

If there is duplication of data between tables that is not due to foreign key constraints, and is not coincidental duplication, then it is possibly worth deleting one of the copies, or making both copies reference the same row in a new table.

Duplication Between Rows Within a Table

Typically duplication between rows occurs through the use of a composite primary key, along with auxiliary data. For example, a table of customer orders might include the full customer data along with each order entry:
CUSTOMER_ORDERS
CUSTOMER_NAMECUSTOMER_ADDRESSORDER_NUMBERITEMQUANTITY
Sprockets LtdBooth Ind Est, Boston200804052Widget 23450
Sprockets LtdBooth Ind Est, Boston200804052Widget Connector900
Foobar IncBaz Street, London200708162Widget Screw size 5220
Foobar IncBaz Street, London200708162Widget 4255

In order to remove duplication between rows, the data needs to split into two tables: the duplicated data can be stored as a single row in one table, referenced by a foreign key from the other table. So, the above example could be split into two tables: a CUSTOMER_ORDERS table, and an ORDER_ITEMS table:
CUSTOMER_ORDERS
CUSTOMER_NAMECUSTOMER_ADDRESSORDER_NUMBER
Sprockets LtdBooth Ind Est, Boston200804052
Foobar IncBaz Street, London200708162
ORDER_ITEMS
ORDER_NUMBERITEMQUANTITY
200804052Widget 23450
200804052Widget Connector900
200708162Widget Screw size 5220
200708162Widget 4255

The ORDER_NUMBER column would be the primary key of the CUSTOMER_ORDERS table, and a foreign key in the ORDER_ITEMS table. This isn't the only duplication in the original table, though — what if one customer places multiple orders? In this case, not only are the customer details duplicated for every item on an order, they are duplicated for every item on every order by that customer. This duplication is still present in the new schema, but in this case it is a business decision whether to keep it — if a customer changes address, do you update the old orders with the new address, or do you leave those entries alone, since that was the address that order was delivered to? If the delivered-to address is important, then this is coincidental duplication as described above, if not, then it too can be eliminated by splitting the CUSTOMER_ORDERS table into two.

The Downsides of Eliminating Duplication

The benefits of eliminating duplication might seem obvious, but there are potential downsides too. For example:

  • If the application is already released, you need to provide upgrade code to change existing databases over to the new schema without losing any data.
  • If you split tables in order to reduce duplication, your SQL can get more complicated, as you need more table joins.

Conclusion

As with everything in software development it's a trade-off. However, as the database gets larger, and more data gets stored, the costs of storing duplicate data increase, as do the costs of changing the schema of an existing database. For this reason, I believe that it is worth designing your schema to eliminate duplication as soon as possible — preferably before there's any data in it!

Posted by Anthony Williams
[/ database /] permanent link
Tags: ,

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

Database Tip: Use an Artificial Primary Key

Monday, 19 November 2007

If your data has a clear "master" field or combination of fields, which can uniquely identify each row (such as customer name in a table of customers or ISBN for a table of books), it is tempting to use that as the primary key of the table. However, my advice is: don't do that, use a separate, numeric, artificial primary key instead. Yes, it's an extra column on the table, requiring extra space, and you will have to generate it somehow, but that's not a big deal. Every database vendor provides some way of auto-generating unique key values (e.g. SEQUENCEs in Oracle, and AUTOINCREMENT fields in SQLite), so populating it is easy, and the complications it saves are more than worth the trade-off. You can still maintain the uniqueness of the master columns by applying a unique index to those columns.

Save Space on Foreign Keys

Firstly, if you have any tables that are associated with the master table, and therefore have foreign key columns that refer to the primary key of your master, then having a separate primary key can actually save space overall, as the data for the master columns doesn't have to be duplicated across all the linked tables. This is especially important if there is more than one "master column", such as customer_first_name and customer_last_name, or if the data for these columns is large.

Changing the master data

Secondly, if the "master columns" are actually the primary key of your table, changing the data in them is potentially problematic, especially if they are used as a foreign key in other tables. Many online services use a customer's email address as their "master column": each customer has one email address, and one email address refers to one customer. That's fine until a customer changes their email address. Obviously, you don't want to lose all data associated with a given customer just because they changed their email address, so you need to update the row rather than delete the old one and insert a new one. If the email_address column is the primary key of the table, and therefore used as the foreign key in other tables, then you've got to update the data not just in the master table, but in each dependent table too.

This is not impossible, but it's certainly more complex and time consuming. If you miss a table, the transaction may not complete due to foreign key constraint violations, or (worse) the transaction may complete, but some of the data may be orphaned. Also, in some database engines, the constraint violation will fire when you change either the master table or the dependent table, so you need to execute a special SQL statement to defer the constraint checking until COMMIT time. If you use an auto-generated primary key, then only the data in the master table needs changing.

Changing the master columns

Finally, if the primary key is auto-generated, then not only is it easy to change the data in the master columns, but you can actually change what the master columns are. For example, if you initially decide that customer_first_name and customer_last_name make an ideal primary key, then you're stuck if you then get another customer with the same name. OK, so you make customer_first_name, customer_last_name and customer_address the primary key. Oops — now you've got to duplicate the address information across all the dependent tables. Now you encounter two people with the same name at the same address (e.g. father and son), so you need to add a new designator to the key (e.g. Henry Jones Junior, Richard Wilkins III). Again, you need to update all the dependent tables. If the primary key is auto-generated, there's no problem — just update the unique constraint on the master table to include the appropriate columns, and all is well, with the minimum of fuss.

Simplify your code

It's not going to simplify your code much, but using an auto-generated numeric key means that this is all you need to store as an identifier inside your program to refer to a particular row: much easier than storing the data from a combination of columns. Also, it's much easier to write code to update the data on one table than on multiple tables.

Conclusion

Don't use real table data as the primary key for a table: instead, use a separate, numeric, auto-generated column as the primary key. This will simplify the connections between tables, and make your life easier if the structure of the database or the data in the key columns changes.

Related Posts

In previous posts on Database Design, I've talked about:

Posted by Anthony Williams
[/ database /] permanent link
Tags: , ,

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

Database Tip: Use Parameterized Queries

Monday, 03 September 2007

This post is the third in a series of Database Tips.

When running an application with a database backend, a high percentage of SQL statements are likely to have variable data. This might be data obtained from a previous query, or it might be data entered by the user. In either case, you've got to somehow combine this variable data with the fixed SQL string.

String Concatenation

One possibility is just to incorporate the data into the SQL statement directly, using string concatenation, but this has two potential problems. Firstly, this means that the actual SQL statement parsed by the database is different every time. Many databases can skip parsing for repeated uses of the same SQL statement, so by using a different statement every time there is a performance hit. Secondly, and more importantly, this places the responsibility on you to ensure that the variable data will behave correctly as part of the statement. This is particularly important for web-based applications, as a common attack used by crackers is a "SQL injection" attack — by taking advantage of poor quoting by the application when generating SQL statements, it is possible to input data which will end the current SQL statement, and start a new one of the cracker's choosing. For example, if string data is just quoted in the SQL using plain quotes ('data') then data that contains a quote and a semicolon will end the statement. This means that if data is '; update login_table set password='abc'; then the initial '; will end the statement from the application, and the database will then run the next one, potentially setting everyone's password to "abc".

Parameterized Queries

A solution to both these problems can be found in the form of Parameterized Queries. In a parameterized query, the variable data in the SQL statement is replaced with a placeholder such as a question mark, which indicates to the database engine that this is a parameter. The database API can then be used to set the parameters before executing the query. This neatly solves the first problem with string concatenation — the query seen by the database engine is the same every time, so giving the database the opportunity to avoid parsing the statement every time. Most parameterized query APIs will also allow you to reuse the same query with multiple sets of parameters, thus explicitly caching the parsed query.

Parameterized queries also solve the SQL injection problem — most APIs can send the data directly to the database engine, marked as a parameter rather than quoting it. Even when the data is quoted within the API, this is then the database driver's responsibility, and is thus more likely to be reliable. In either case, the user is relinquished from the requirement of correctly quoting the data, thus avoiding SQL injection attacks.

A third benefit of parameterized queries is that data doesn't have to be converted to a string representation. This means that, for example, floating point numbers can be correctly transferred to the database without first converting to a potentially inaccurate string representation. It also means that the statement might run slightly faster, as the string representation of data often requires more storage than the data itself.

The Parameterized Query Coding Model

Whereas running a simple SQL statement consists of just two parts — execute the statement, optionally retrieve the results — using parameterized queries ofetn requires five:

  1. Parse the statement (often called preparing the statement.)
  2. Bind the parameter values to the parameters.
  3. Execute the statement.
  4. Optionally, retrieve the results
  5. Close or finalize the statement.

The details of each step depends on the particular database API, but most APIs follow the same outline. In particular, as mentioned above, most APIs allow you to run steps 2 to 4 several times before running step 5.

Placeholders

A parameterized query includes placeholders for the actual data to be passed in. In the simplest form, these placeholders can often be just a question mark (e.g. SELECT name FROM customers WHERE id=?), but most APIs also allow for named placeholders by prefixing an identifier with a marker such as a colon or an at-sign (e.g. INSERT INTO books (title,author) VALUES (@title,@author)). The use of named placeholders can be beneficial when the same data is needed in multiple parts of the query — rather than binding the data twice, you just use the same placeholder name. Named placeholders are also easier to get right in the face of SQL statements with large numbers of parameters or if the SQL statement is changed — it is much easier to ensure that the correct data is associated with a particular named parameter, than to ensure that it is associated with the correctly-numbered parameter, as it is easy to lose count of parameters, or change their order when changing the SQL statement.

Recommendations

Look up the API for parameterized queries for your database. In SQLite, it the APIs surrounding sqlite3_stmt, for MySQL it's the Prepared Statements API, and for Oracle the OCI parameterized statements API does the trick.

If your database API supports it, used named parameters, or at least explicit numbering (e.g. ?1,?2,?3 rather than just ?,?,?) to help avoid errors.

Posted by Anthony Williams
[/ database /] permanent link

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

Database Tip: Use Transactions

Monday, 27 August 2007

Here's another Database Tip, to follow on from my previous one on creating appropriate indexes. This time the focus is transactions.

For any experienced database developer, using transactions might seem an obvious suggestion, but lightweight databases may require configuration in order to use transactions. For example, MySql tables use the MyISAM engine by default, which doesn't support transactions — in order to use transactions you need to set the storage engine of your tables to InnoDB or BDB. Also, whereas in Oracle, every statement occurs within a transaction, and you need an explicit COMMIT or ROLLBACK to end the transaction, in databases such as MySQL and SQLite every statement is its own transaction (and thus committed to the database immediately) unless you explicitly begin a transaction with a BEGIN statement, or other database configuration command.

Benefits of Transactions

The primary benefit of using transactions is data integrity. Many database uses require storing data to multiple tables, or multiple rows to the same table in order to maintain a consistent data set. Using transactions ensures that other connections to the same database see either all the updates or none of them. This also applies in the case of interrupted connections — if the power goes off in the middle of a transaction, the database engine will roll back the transaction so it is as-if it was never started. If each statement is committed independently, then other connections may see partial updates, and there is no opportunity for automatic rollback on error.

A secondary benefit of using transactions is speed. There is often an overhead associated with actually committing the data to the database. If you've got 1000 rows to insert, committing after every row can cause quite a performance hit compared to committing once after all the inserts. Of course, this can work the other way too — if you do too much work in a transaction then the database engine can consume lots of space storing the not-yet-committed data or caching data for use by other database connections in order to maintain consistency, which causes a performance hit. As with every optimisation, if you're changing the boundaries of your transactions to gain performance, then it is important to measure both before and after the change.

Using Transactions in Oracle

Oracle databases are always in transaction mode, so all that's needed is to decide where to put the COMMIT or ROLLBACK. When one transaction is finished, another is automatically started. There are some additional options that can be specified for advanced usage — see the Oracle documentation for these.

    INSERT INTO foo (a,b) VALUES (1,'hello');
    INSERT INTO foo (a,b) VALUES (2,'goodbye');
    COMMIT;
    INSERT INTO foo (a,b) VALUES (3,'banana');
    COMMIT;

Using Transactions in SQLite

In SQLite, if you wish a transaction to cover more than one statement, then you must use a BEGIN or BEGIN TRANSACTION statement. The transaction ends when you execute a COMMIT or ROLLBACK statement, and the database reverts to auto-commit mode where each statement has its own transaction.

    BEGIN;
    INSERT INTO foo (a,b) VALUES (1,'hello');
    INSERT INTO foo (a,b) VALUES (2,'goodbye');
    COMMIT;
    INSERT INTO foo (a,b) VALUES (3,'banana');
    -- implicit COMMIT at end of statement
    -- with no preceding BEGIN

Using Transactions in MySQL

As mentioned above, by default all tables use the MyISAM storage engine, so transaction support is disabled. By changing a table to use the InnoDB or BDB storage engines, transaction support can be enabled. For new tables, this is done using the ENGINE or TYPE parameters on the CREATE TABLE statement:

    CREATE TABLE foo
    (
        bar INTEGER PRIMARY KEY,
        baz VARCHAR(80) NOT NULL
    )
    ENGINE = InnoDB;

Existing tables can be changed to use the InnoDB storage engine using ALTER TABLE:

    ALTER TABLE foo ENGINE = InnoDB;

You can also change the default storage engine using the default-storage-engine server option on the server command line, or in the server configuration file.

Once all the tables you're using have storage engines that support transactions, you have two choices. For a given connection, you can set the AUTOCOMMIT session variable to 0, in which case every statement within that connection is part of a transaction, as for Oracle, or you can leave AUTOCOMMIT set to 1 and start transactions explicitly as for SQLite. In auto-commit mode for MySQL, transactions are started with BEGIN, BEGIN WORK or START TRANSACTION. To disable AUTOCOMMIT for the current transaction, use the SET statement:

    SET AUTOCOMMIT=0;

You can configure the database to run this statement immediately upon opening a connection using the init_connect server variable. This can be set in the configuration file, or using the following command:

    SET GLOBAL init_connect='SET AUTOCOMMIT=0';

MySQL also supports additional transaction options — check the documentation for details.

Automatic ROLLBACK and COMMIT

One thing to watch out for is code that causes an automatic ROLLBACK or COMMIT. Most databases cause an automatic ROLLBACK of any open transaction when a connection is closed, so it is important to make sure that all changes are committed before closing the connection.

Also worth watching out for are commands that cause an automatic COMMIT. The list varies depending on the database, but generally DDL statements such as CREATE TABLE will cause an automatic commit. It is probably best to avoid interleaving DML statement with any other type of statement in order to avoid surprises. Check your database documentation for details.

Posted by Anthony Williams
[/ database /] permanent link

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

Database Tip: Create Appropriate Indexes

Monday, 20 August 2007

One of the simplest things you can do speed up database access is create appropriate indexes.

There are several aspects to this. Firstly, you need to identify which columns are used for queries on a given table; in particular, which columns appear in the WHERE clause of time consuming queries. If a query is only done once, or the table only has five rows in it so queries are always quick then there is no benefit to adding indexes. It's not just straight-forward SELECT statements that need checking — UPDATE and DELETE statements can have WHERE clauses too.

Having identified which columns are used in queries, it is important to also note which combinations are used. A database engine will tend to only use one index per table (though some can use more, depending on the query), so if your time-consuming queries use multiple columns from the same table in the WHERE clause, then it's probably better to have an index that covers all of them rather than separate indexes on each.

The cost of indexes

Adding an index to a table does have a downside. Any modification to a table with an index (such as inserts and deletes) will be slower, as the database will have to update the index, and the index will occupy space in the database. That can be a high price to pay for faster queries if the table is modified frequently. In particular, an index that is never used, or covers the same columns as another index is just dead weight. It is important to remember that PRIMARY KEY and UNIQUE columns automatically have indexes that cover them.

Timing is everything

As with every optimization, it is important to profile both before and after any change, and this includes checking the performance of the rest of the application too. Don't be afraid to remove an index if it isn't helping, and bear in mind that it's also possible to improve performance by rewriting queries, particularly where the are joins or subqueries involved.

Posted by Anthony Williams
[/ database /] permanent link

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.

Delay Using a Database

Wednesday, 08 August 2007

A client of ours makes hardware that generates data files, and a few years ago I wrote a piece of software for them to help manage those data files. Initially it only had to deal with a few data files, so it indexed them on start up. Then someone tried to use it with several thousand data files, and start-up times got too slow, so I modified the indexing code to dump the current index to an XML file on shutdown, which it then loaded at startup. This has worked well, but now they're using it to handle hundreds of thousands of files, and the startup and shutdown times are again becoming significant due to the huge size of the XML file. Also, the data file access times are now getting larger due to the size of the in-memory index. We've now been hired again to address the issue, so this time I'm going to use a SQLite database for the index — no start up delay, no shutdown delay, and faster index lookup.

What lessons can be learned from this experience? Should I have gone for SQLite in the first instance? I don't think so. Using a simple in-memory map for the initial index was the simplest thing that could possibly work, and it has worked for a few years. The XML index file was a small change, and it kept the application working for longer. Now the application does need a database, but the implementation is certainly more complex than the initial in-memory map. By using the simple implementation first, the application was completed quicker — not only did this save my client money in the development, but it meant they could begin using it sooner. It also meant that now I come to add the database code, the requirements are better-known and there are already a whole suite of tests for how the index should behave. It has taken less than a day to write the database indexing code, whereas it could easily have taken several days at the beginning.

I think people are often too keen to jump straight to using a database, when they could often get by for now with something far simpler. That doesn't mean that requirements won't evolve, and that a database won't be required in the end, but this time can often be put off for years, thus saving time and money. In this instance I happened to use SQLite, which is free, but many people jump straight to Oracle, or SQL Server, which have expensive licenses and are often overkill for the problem at hand. Just think how much money you could save by putting off the purchase of that SQL Server license for a year or two.

Don't be scared into buying a full-featured enterprise level RDBMS at the beginning of your project; simple in-memory maps or data files will often suffice for a long time, and when they're no longer sufficient you'll know more about what you do need from your RDBMS. Maybe SQLite will do, or maybe it won't — in any case you've saved money.

Posted by Anthony Williams
[/ database /] permanent link

| Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the right.