SQLite and the Performance Implications of Indexes

Indexes make an enormous difference in the read speed of a databases. I have previously mentioned that adding proper indexes is a significant part of improving overall database performance. We can use SQLite and Python to demonstrate how much of a difference indexes make in certain cases.

For this purpose, we create a database with 5000 rows. Each row will have two columns, one with a random number and the other with a random string. Then we will test the speed with a query that finds rows where col1 is identical but col2 is different using the Python timeit library. Then we will see how long it takes to both create an index and execute the same query. Just to ensure that the results are consistent, we will drop the index and then repeat the test. All together, the code to do this looks like:

import sqlite3
import random
import timeit
from collections import OrderedDict

#Constants used as settings
alpha = 'abcdefghijklmnopqrstuvwxyz'
numRows = 5000

def createRandomDB():
    conn = sqlite3.connect('TestingIndex.py')
    curs = conn.cursor()
    curs.execute('drop table if exists IndexTestTbl')
    curs.execute('create table IndexTestTbl (col1, col2)')
    for i in range(numRows):
        col1 = random.randint(1, numRows)
        col2 = ''.join(random.choice(alpha) for x in range(25))
        curs.execute('insert into IndexTestTbl values (?, ?)', (col1, col2))
    conn.commit() #commit is expensive, to execute only after the loop

def readTable():
    #Does not actively destroy indexes. Assumes that the table lacks indexes
    conn = sqlite3.connect('TestingIndex.py')
    curs = conn.cursor()
    curs.execute('''select *
                    from IndexTestTbl t1
                    where exists (select * from IndexTestTbl t2
                                  where t1.col1 = t2.col1
                                  and t1.col2 != t2.col2)
    a = curs.fetchall()

def createIdxReadTable():
    conn = sqlite3.connect('TestingIndex.py')
    curs = conn.cursor()
    curs.execute('''create index if not exists
                     TestIdx on IndexTestTbl (col1)''')

if __name__ == '__main__':
    resultsDict = OrderedDict() #Using OrderedDict to keep the plots in order
    resultsDict['ReadNoIdx'] = timeit.timeit('readTable()',
            number = 1, setup='from __main__ import readTable')

    resultsDict['CreateIdxRead'] = timeit.timeit('createIdxReadTable()',
            number = 1, setup = 'from __main__ import createIdxReadTable')

    #drop the index
    conn = sqlite3.connect('TestingIndex.py')
    conn.execute('drop index TestIdx')

    resultsDict['ReadAfterDropIdx'] = timeit.timeit('readTable()',
            number = 1, setup='from __main__ import readTable')
    resultsDict['CreateIdxRead2'] = timeit.timeit('createIdxReadTable()',
            number = 1, setup = 'from __main__ import createIdxReadTable')

    #print the results
    print('Take one with no index required {} seconds'.format(resultsDict['ReadNoIdx']))
    print('Create Index and then read required {} seconds'.format(resultsDict['CreateIdxRead']))
    print('Read the table after dropping the Index {}'.format(resultsDict['ReadAfterDropIdx']))
    print('Create the index again and read takes {}'.format(resultsDict['CreateIdxRead2']))

    #graph the data
    import matplotlib.pyplot as plt
    width = .8
    fig = plt.figure()
    ax = plt.subplot(111)
    ax.bar(range(len(resultsDict)), resultsDict.values(), align = 'center', width = width)
    ax.set_xticklabels(resultsDict.keys(), rotation = 30, size = 'small')

The results:

Take one with no index required 1.088418062616674 seconds
Create Index and then read required 0.2158297379552767 seconds
Read the table after dropping the Index 1.0706404903284745
Create the index again and read takes 0.1789995581284507


Results Graph

Results Graph

In this case, it took longer to run the query without the index than it did to create the index and run the query, and the difference was quite substantial. Of course, this was deliberately set up as an extreme example. The rows in the heap were inserted in a random order and the query involved a correlated subquery. I also made sure that I ran the test from the traditional harddrive instead of the SSD. Still, I have seen similar cases in production environments where it took less time to create an index and run the query it was made to support than it did to run the query without bothering to create the index.

It is possible to become overly reliant on indexes, and when there are performance issues in practice I find that there is often more room for improvement and optimization in the queries and code than in the indexes. Still, under some circumstances the right indexes can make a dramatic difference in execution time.

The effects of an SSD on SQL Server Performance

I.                    Introduction

A little while ago, my wonderful wife Renee got a Solid State Drive (SSD) for me.  Being a DBA, one of my first thoughts was to wonder what kind of effect this would have on database performance, since I use my home computer for a lot of testing and my writing projects.  So, I installed the drive in my machine (a slightly more complicated process than I expected that wound up involve type III paracord…) and reinstalled my necessary software.

Then, being a father of a three as well as a database professional and a grad student, I proceeded to put that idea aside for some time.  But I wanted numbers and a direct performance comparison, and thought they would be of interest to the broader database community.  So, once I found some time, I began running some tests for direct comparisons.

II.                  A Brief Background on SSDs

Traditional harddrives store the data on a small array of platters using magnetic charges.  In order to read the platters or write to them, the movable read/write heads must be properly positioned over the platters.  Most solid state drives are based on flash memory and often involve no moving parts whatsoever.

These different techniques to store data create several broad differences between HDDs and SSDs:

  1. SSDs tend to have dramatically lower latencies since they can access the entire drive all the time without having to align the head with the part of the drive to be read.
  2. SSDs tend to be much more resistant to being moved around while doing read/write operations.
  3. HDDs tend to be substantially larger and have a much better size/price ratio.

I carefully hedged by using “tend” rather than a more definitive word.  Both SSD and HDD technologies are still evolving and how large the differences are, or even if they hold true, can vary when comparing different SSDs with different HDDs and with the different circumstances they are used under.  But those three broad trends will hold true the vast majority of the time.

SSDs, because of the way they write to the drive, can be subject to write-cycle exhaustion.  This happens as parts of the drive are overwritten too many times and eventually cannot be written to any more.  To dramatically oversimplify this topic:

  1. According to Tom’s Hardware, it is rarely something a consumer making standard use of their drive needs to worry about.
  2. Ars Technica points out that many SSDs targeted at the enterprise use SLC instead of the consumer grade MLC flash, and SLC is much less susceptible to write-cycle exhaustion, but more expensive.
  3. The risk of write-cycle exhaustion can be significant under certain circumstances though, such as using a consumer grade MLC drive in a write intensive enterprise-style environment

There are also hybrid drives which combine relatively large standard harddrives with a smaller cache of flash memory which is used to reduce latency.

As for my personal experience, the change in overall performance of my system when I went from using a HDD to using an SSD as my primary system drive with the HDD for bulk storage was dramatic.  While I have not measured it precisely, the system boot time is now substantially shorter, as are load times for all programs.  The shortened load time is particularly noticeable in Spyder, a python IDE I use that used to take nearly a minute to fully load and now takes just a few seconds.  The computer as a whole is subjectively more responsive and “snappy”.  I think that for many modern computers, moving to a SSD is perhaps one of the best ways to improve the overall performance of the system.

III.                A Word on Methodology

Here, I am essentially looking at performance difference as they pertain to databases between the two drives in my home computer.  While I think this generalizes relatively well to give an idea of the database performance implications of SSD’s generally, it is important to emphasize that in the end my sample size is one standard harddrive and one SSD.  More than that, these are both consumer grade drives sitting in desktop computer.  In short, while I think this comparison is instructive, it is not a rigorous comparison of enterprise class drives that would be used in corporate servers, much less a look at large scale SANs which can readily incorporate multiple types of storage medium in complex tiers.

I selected two slightly different tests to time.  First, I took the Python script I wrote to compare the performance of SQL Server and SQLite and ran it with just the target drive changed.  Then I ran some of the long running scripts from Jeff Moden’s excellent Performance Tuning: Concatenation Functions and Some Tuning Myths article.

In this setup, I am using the same installation of Windows and SQL Server 2012, both installed on the SSD, and only the location of the database is really changing.  The relevant system specifications for reference are:

System Alienware (Modified)
Processor Intel Core i7 3.4GHz
Ram 16GB
OS Windows 7 Home
SQL SQL Server 2012 Expres

IV.                The Tests

a.       Rerunning the SQL Server and SQLite Tests

A little while ago I wanted to do a direct comparison between SQL Server and SQLite in a single user, desktop environment.  Since it already had a system with speed measurements in place, I reran all the tests twice, once with the destinations for the database pointing to the HDD and once with them pointing to the SSD.  The full details of the tests are in the previous article, but to briefly summarize, the test program creates a SQL Server and a SQLite table and then repeatedly inserts into the tables, sending a commit after each statement.  Then it repeatedly inserts into the tables, sending a commit only at the end, and then attempts to read from both of the tables.  The results are:

0.917 secs for 5000 inserts SQL Server (commit each)10.386 secs for 5000 inserts SQLite (commit each)0.435 secs for 5000 inserts SQL Server (commit once)0.020 secs for 5000 inserts SQLite (commit once)1.556 secs for 5000 reads SQL Server

1.960 secs for 5000 reads SQLite

2.120 secs for 5000 inserts SQL Server (commit each)538.364 secs for 5000 inserts SQLite (commit each)0.402 secs for 5000 inserts SQL Server (commit once)0.459 secs for 5000 inserts SQLite (commit once)1.518 secs for 5000 reads SQL Server

1.890 secs for 5000 reads SQLite

As expected, with a commit after each insert, which forces the server to write after every single insert, the SSD performs dramatically better.  It surprised me somewhat that the HDD came out better on the SQL Server inserts with one commit and even slightly faster on the SQL server reads.  Although I am not certain, I suspect this is because of the sophisticated caching that both the harddrive and SQL Server itself do.  So, I cleared the cache before executing the reads and got:

SSD (Cache cleared) HDD (Cache cleared)
1.564 secs for 5000 reads SQL Server 1.655 secs for 5000 reads SQL Server

This increased the time for reads on both drives slightly and shifted the advantage to the SSD where it was expected.

b.      Jeff Moden’s Concatenation Tests

Next, I turned back to Jeff Moden’s concatenation article.  This article came to mind because it helped give me more insight into performance tuning when I was much earlier in my SQL career and because it included a nice long running “Hog” Test that in his article took over 2 minutes and 50 seconds on the hardware he was using at the time.  So, I ran his setup code, and then created the function in the article, each twice, once on a database sitting on the SSD and once on a database sitting on the HDD.  Then I cleared the cache[1] and ran

SELECT DISTINCT SomeID, dbo.fnConcatTest(SomeID) AS CSVString
FROM dbo.TestData

Against each of them to get:

SSD (Cache cleared) HDD (Cache Cleared)
7 Secs 22 Secs

Clearly the SSD helped substantially.  This should not in any way to be taken to mean that improving the hardware can replace tuning the code though.  The “STUFF” version he provided at the end of his article had sub-second response times on the HDD, which is a nice reminder that frequently you can get better improvements by optimizing the code than by improving the hardware.


V.                  Conclusion

As the numbers show, the SSD generally performs better.  Though how much better varied by the circumstances, with SQL Server’s sophisticated caching system letting the HDD perform or even top the SSD under certain limited circumstances.

In most consumer and workstation systems currently in use, I suspect switching from a HDD to a SSD is one of the most effective upgrades currently available.  Moving purely to SSDs is probably prohibitively expensive for most large database applications, however they can serve as an effective part of a SAN to hold the most active data or to hold selected parts of the database which particularly require low latency with the bulk data held on HDDs in a simpler storage configuration.

VI.                More References

  1. The SSD Revolution – A series of feature articles on SSDs by ArsTechnica.
  2. In-depth on how SSDs really work– By Lee Hutchinson.  This is part of ArsTechnica’s SSD Revolution series, but goes into great depth on SSDs work and what those inner workings imply.  This is well worth reading.
  3. Is Your SSD More Reliable Than a Hard Drive by Andrew Ku

[1] It was significant to clear the cache here, without doing so the SSD and HDD times were identical.