Monday, 03 September 2007
This post is the third in a series of Database Tips.
When running an application with a database backend, a high percentage of SQL statements are likely to have variable data. This might be data obtained from a previous query, or it might be data entered by the user. In either case, you've got to somehow combine this variable data with the fixed SQL string.
One possibility is just to incorporate the data into the SQL statement directly, using string concatenation, but this has two
potential problems. Firstly, this means that the actual SQL statement parsed by the database is different every time. Many databases
can skip parsing for repeated uses of the same SQL statement, so by using a different statement every time there is a performance
hit. Secondly, and more importantly, this places the responsibility on you to ensure that the variable data will behave correctly as
part of the statement. This is particularly important for web-based applications, as a common attack used by crackers is a "SQL
injection" attack — by taking advantage of poor quoting by the application when generating SQL statements, it is possible to
input data which will end the current SQL statement, and start a new one of the cracker's choosing. For example, if string data is
just quoted in the SQL using plain quotes (
'data') then data that contains a quote and a semicolon will end
the statement. This means that if data is
'; update login_table set password='abc'; then the initial
'; will end the statement from the application, and the database will then run the next one, potentially setting
everyone's password to "abc".
A solution to both these problems can be found in the form of Parameterized Queries. In a parameterized query, the variable data in the SQL statement is replaced with a placeholder such as a question mark, which indicates to the database engine that this is a parameter. The database API can then be used to set the parameters before executing the query. This neatly solves the first problem with string concatenation — the query seen by the database engine is the same every time, so giving the database the opportunity to avoid parsing the statement every time. Most parameterized query APIs will also allow you to reuse the same query with multiple sets of parameters, thus explicitly caching the parsed query.
Parameterized queries also solve the SQL injection problem — most APIs can send the data directly to the database engine, marked as a parameter rather than quoting it. Even when the data is quoted within the API, this is then the database driver's responsibility, and is thus more likely to be reliable. In either case, the user is relinquished from the requirement of correctly quoting the data, thus avoiding SQL injection attacks.
A third benefit of parameterized queries is that data doesn't have to be converted to a string representation. This means that, for example, floating point numbers can be correctly transferred to the database without first converting to a potentially inaccurate string representation. It also means that the statement might run slightly faster, as the string representation of data often requires more storage than the data itself.
The Parameterized Query Coding Model
Whereas running a simple SQL statement consists of just two parts — execute the statement, optionally retrieve the results — using parameterized queries ofetn requires five:
- Parse the statement (often called preparing the statement.)
- Bind the parameter values to the parameters.
- Execute the statement.
- Optionally, retrieve the results
- Close or finalize the statement.
The details of each step depends on the particular database API, but most APIs follow the same outline. In particular, as mentioned above, most APIs allow you to run steps 2 to 4 several times before running step 5.
A parameterized query includes placeholders for the actual data to be passed in. In the simplest form, these
placeholders can often be just a question mark (e.g.
SELECT name FROM customers WHERE id=?), but most APIs also allow
for named placeholders by prefixing an identifier with a marker such as a colon or an at-sign (e.g.
INSERT INTO books
(title,author) VALUES (@title,@author)). The use of named placeholders can be beneficial when the same data is needed in
multiple parts of the query — rather than binding the data twice, you just use the same placeholder name. Named placeholders
are also easier to get right in the face of SQL statements with large numbers of parameters or if the SQL statement is changed
— it is much easier to ensure that the correct data is associated with a particular named parameter, than to ensure that it is
associated with the correctly-numbered parameter, as it is easy to lose count of parameters, or change their order when changing the
Look up the API for parameterized queries for your database. In SQLite, it the APIs surrounding sqlite3_stmt, for MySQL it's the Prepared Statements API, and for Oracle the OCI parameterized statements API does the trick.
If your database API supports it, used named parameters, or at least explicit numbering (e.g.
?1,?2,?3 rather than
?,?,?) to help avoid errors.