SQLite and the Performance Implications of Indexes

Indexes make an enormous difference in the read speed of a databases. I have previously mentioned that adding proper indexes is a significant part of improving overall database performance. We can use SQLite and Python to demonstrate how much of a difference indexes make in certain cases.

For this purpose, we create a database with 5000 rows. Each row will have two columns, one with a random number and the other with a random string. Then we will test the speed with a query that finds rows where col1 is identical but col2 is different using the Python timeit library. Then we will see how long it takes to both create an index and execute the same query. Just to ensure that the results are consistent, we will drop the index and then repeat the test. All together, the code to do this looks like:

#imports
import sqlite3
import random
import timeit
from collections import OrderedDict

#Constants used as settings
alpha = 'abcdefghijklmnopqrstuvwxyz'
numRows = 5000

def createRandomDB():
    conn = sqlite3.connect('TestingIndex.py')
    curs = conn.cursor()
    curs.execute('drop table if exists IndexTestTbl')
    conn.commit()
    curs.execute('create table IndexTestTbl (col1, col2)')
    conn.commit()
    for i in range(numRows):
        col1 = random.randint(1, numRows)
        col2 = ''.join(random.choice(alpha) for x in range(25))
        curs.execute('insert into IndexTestTbl values (?, ?)', (col1, col2))
    conn.commit() #commit is expensive, to execute only after the loop

def readTable():
    #Does not actively destroy indexes. Assumes that the table lacks indexes
    conn = sqlite3.connect('TestingIndex.py')
    curs = conn.cursor()
    curs.execute('''select *
                    from IndexTestTbl t1
                    where exists (select * from IndexTestTbl t2
                                  where t1.col1 = t2.col1
                                  and t1.col2 != t2.col2)
                        ''')
    a = curs.fetchall()

def createIdxReadTable():
    conn = sqlite3.connect('TestingIndex.py')
    curs = conn.cursor()
    curs.execute('''create index if not exists
                     TestIdx on IndexTestTbl (col1)''')
    conn.commit()
    readTable()

###########################################
if __name__ == '__main__':
    resultsDict = OrderedDict() #Using OrderedDict to keep the plots in order
    createRandomDB()
    resultsDict['ReadNoIdx'] = timeit.timeit('readTable()',
            number = 1, setup='from __main__ import readTable')

    resultsDict['CreateIdxRead'] = timeit.timeit('createIdxReadTable()',
            number = 1, setup = 'from __main__ import createIdxReadTable')

    #drop the index
    conn = sqlite3.connect('TestingIndex.py')
    conn.execute('drop index TestIdx')
    conn.commit()

    resultsDict['ReadAfterDropIdx'] = timeit.timeit('readTable()',
            number = 1, setup='from __main__ import readTable')
    resultsDict['CreateIdxRead2'] = timeit.timeit('createIdxReadTable()',
            number = 1, setup = 'from __main__ import createIdxReadTable')

    #print the results
    print('Take one with no index required {} seconds'.format(resultsDict['ReadNoIdx']))
    print('Create Index and then read required {} seconds'.format(resultsDict['CreateIdxRead']))
    print('Read the table after dropping the Index {}'.format(resultsDict['ReadAfterDropIdx']))
    print('Create the index again and read takes {}'.format(resultsDict['CreateIdxRead2']))

    #graph the data
    import matplotlib.pyplot as plt
    width = .8
    fig = plt.figure()
    ax = plt.subplot(111)
    ax.bar(range(len(resultsDict)), resultsDict.values(), align = 'center', width = width)
    ax.set_xticks(range(len(resultsDict)))
    ax.set_xticklabels(resultsDict.keys(), rotation = 30, size = 'small')
    plt.tight_layout()
    plt.savefig('SqliteIndexTest.png')

The results:

Take one with no index required 1.088418062616674 seconds
Create Index and then read required 0.2158297379552767 seconds
Read the table after dropping the Index 1.0706404903284745
Create the index again and read takes 0.1789995581284507

 

Results Graph

Results Graph

In this case, it took longer to run the query without the index than it did to create the index and run the query, and the difference was quite substantial. Of course, this was deliberately set up as an extreme example. The rows in the heap were inserted in a random order and the query involved a correlated subquery. I also made sure that I ran the test from the traditional harddrive instead of the SSD. Still, I have seen similar cases in production environments where it took less time to create an index and run the query it was made to support than it did to run the query without bothering to create the index.

It is possible to become overly reliant on indexes, and when there are performance issues in practice I find that there is often more room for improvement and optimization in the queries and code than in the indexes. Still, under some circumstances the right indexes can make a dramatic difference in execution time.

Advertisements

MS Access – Store a Different Value than Displayed in a Combo Box

It is occasionally usefully to display a different values in an Access combo box than what is actually stored in the table.  This can be done if you want to store the primary key in another table to make a reference easier, but display some other data.  It can also simply help save storage space when that is a concern.  One way to do this is to bind the table to one column, but tell it to display two columns with the bound column having a space of 0.

Naturally, this is easier to see with an example, so let us make a totally artificial and painfully simple database to store book ratings.  We will have a table that lists the rating options with both a number and a text description of the rating.

ratingsoptions1
Then we’ll make a table with three columns: ID, BookName, and BookRating.  We will use this to store our ratings, and we will make BookRating be a number.  We will store the value of the rating rather than the text description.  In a large database, this could save some space.  Then we can make a form based on BookRatings, with the actual BookRating being a combo box.  Since this is just an example, we will keep the form almost painfully simple.
bookratingsdesignview1
We will make certain that the BookRating field is bound to the first column, the primary key, of the RatingsOptions.  But then, we will set the display to display two columns.  The first will be effectively invisible with a width of zero and the second will be displayed with a width of 1”.

propertysheets1

And this will give us the text descriptions listed as the options, but it will only store the numeric primary key.
formandtable1

Python Distributions

Python Distributions

Python is free and open source software.  A programmer could download the executables for Python directly from the official site or even download the sourcecode and compile it themselves.  But for someone wanting to get straight to programming in Python it is generally better to get a Python distribution.  The distributions will generally include a selection of third party libraries that are commonly used but not included with the core of Python, a solid IDE, and perhaps other development tools as well.  I take a look at three excellent contenders for Python Distributions here.

WinPython

 

Spyder IDE from Win Python

Spyder IDE from Win Python

WinPython is the version of Python I use at home and the one I personally recommend to anyone that does not want to spend a lot of money on a distribution.  It comes with a wide array of third party libraries including SciPy, making it ready to do many types of analytic work immediately.

WinPython is also designed so it can be run as portable software.  This makes it convenient for those that need more than one installation of Python, such as a 3.x and a 2.x installation running side by side.  Conveniently, WinPython offers both 3.x and 2.x installations.  Being portals also means there is no need for administrator access to the machine to use Python on it.

WinPython comes with Spyder, my favorite light weight IDE.  I use other options when working on large applications requiring numerous components, but for small Python scripts or small programs I think Spyder is highly conducive to my work flow.  WinPython also includes other nice development tools like Qt Designer and has Cython for performance.

 

Enthought Canopy Python

 

Enthought Canopy is a commercially supported version of Python.  It does offer a free version, but that is severely limited compared to the full version.  Enthought is also installed in the user folder and generally does not need local Administrator privileges.

Enthought Canopy comes with a large host of included libraries, including SciPy.  It also includes its own IDE.  While the IDE works quite well, I personally prefer Spyder.  The debugger is excellent, but is only included with the paid version.  The paid version also includes other additional features such as online training courses.  Enthought is currently only available in a 3.x version.

Overall, I think Enthought is an excellent distribution if you will use and are willing pay for its premium features such as its debugger and its training courses.  If you are looking for a free distribution then I think WinPython is generally superior, but a large part of that is that I am partial to Spyder.

 

Python (X, Y)

Python (X, Y) describes itself as “the scientific Python distribution” and is maintained by Gabi Davar.  This used to be my preferred distribution and it remains a well-made Python distribution.  It has a slower update speed than some others.  As I write this, the last update was posted in June, 2015.  It is also available only for Python 2.X.  Much like WinPython, it uses Spyder as its default IDE and comes with a variety of scientific libraries and additional tools.  Given the slow update speed, I viewed WinPython as a graceful upgrade path from Python(X, Y).

Conclusions and Other Distributions

Of course, those are far from the only distributions available.  I only discussed the ones that I had personally used.  I have also heard good things about the Anaconda distribution though I will not personally comment on it since I have not tried it.  From what I have personally experienced, I am happy to recommend Enthought for anyone that wants and is willing to pay for the premium version.  If you want a free version or if you will not use Enthought’s premium features, then WinPython has served me well.

 

 

 

 

Making Changes to Many Forms In Access

Although generally I prefer to create custom interfaces for my databases using fully developed programming languages like Python or C#. I prefer to have full access to their computational capabilities and in particular their graphing even when the initial plan does not need these. Projects have a tendency of expanding over time. But when I need to quickly create a simple front end for a database I sometimes turn to MS Access.

One issue I have faced occasionally is being asked to make the same change to all of the forms or all of the reports. Fortunately, the VBA built into Access makes this relatively simple for certain types of changes. You can iterate through all forms, make the desired change, and then save that change so it is permanent.

This is best described with an example, though I have only tested this with Access 2010 and 2013. For this example, I am just going to create a database with one table and two forms. I will put the VBA code itself into a module. The code will look like:

Public Function ChangeBkColor()
    Dim obj As AccessObject 'Entities in Allforms are AccessObjects
    Dim CurForm As Form
    
    'We want everything to be closed to avoid any problems
    'with reopening later
    For Each CurForm In Application.Forms
        DoCmd.Close acForm, CurForm.Name
    Next CurForm
    
    For Each obj In Application.CurrentProject.AllForms
        'If not opened as design, it may not save the change
        DoCmd.OpenForm obj.Name, acDesign
        'There should only be one open form now
        Set CurForm = Application.Forms(0)
        'Change the color to red, but could change others
        CurForm.Section(acDetail).BackColor = RGB(100, 0, 0)
        'Setting acSaveYes here is significant
        DoCmd.Close acForm, CurForm.Name, acSaveYes
    Next obj

End Function

I normally execute by highlighting the code in the VBA editor and then pressing F5, but it could also be tied to a macro if that is more convenient. Executing it then opens each form, changes the property in question, and then closes it again, saving the change. Using techniques like this has saved me a fair bit of time in trying to standardize certain things across numerous forms in a database.

SampleResults

SampleChange

Dealing with SQL Server LOCK resource errors.

I was recently doing some testing that required a fairly large table. I created the test table, and set it up to insert about 30 million rows. I got an error that read:

The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.

Naturally, having not encountered that before, I started by Googling the error and looking up Error 1204 and Microsoft’s Support Article. The support article helpfully suggested setting locks to 0 using sp_configure. But that is the default value and mine was already set that way.

So, I created a simplified setup that would reliably produce the error based on Jeff Moden’s Tally Table Article.

if OBJECT_ID('BigTallyTable', 'U') is not NULL

drop table dbo.BigTallyTable

create table BigTallyTable
(n BigInt primary key)


insert into dbo.BigTallyTable(n)
select top (30000000)
row_number() over (order by s1.name) as n
from master.dbo.syscolumns s1,
master.dbo.syscolumns s2

And I started playing with it. Since this all tied to locks, one thing I did was break it up into batches:

declare @maxValue int, @maxperround int, @last int
set @maxValue = 30000000 --30 Million
set @maxPerRound = 500000


set @last = 0
while @last < @maxValue
begin

if @maxPerRound > @maxValue - @last
set @maxPerRound = @maxValue - @last

insert into dbo.BigTallyTable(n)
select top (@maxPerRound)
row_number() over (order by s1.name) + @last as n
from master.dbo.syscolumns s1,
master.dbo.syscolumns s2

Print 'Finished a round with @last = ' + cast(@last as varchar) + ' and @maxPerround = ' + cast(@maxPerRound as varchar)

set @last = @last + @maxPerRound
end

This avoided the error and generated the results I was looking for. On a multi-user system (I was doing this on my test instance of SQL which is on my personal machine) it would have also allowed other queries that were waiting for locks to be able to execute more freely, though whether that is necessarily good or not depends on the circumstances.

I also got it to lock by directing SQL Server to take a table lock right from the beginning.

insert into dbo.BigTallyTable with (TABLOCK) (n)

select top (30000000)
row_number() over (order by s1.name) as n
from master.dbo.syscolumns s1,
master.dbo.syscolumns s2

This ran substantially faster than splitting it into batches did and is clearly a simpler query. It does take a more aggressive table lock which, as the article on lock escalation points out, can cause more problems with blocking other queries if that is a concern.

Order of a Result Set in SQL

I have seen people build procedures that rely on a result set being returned in a certain order.  This is fine if an order by clause is used, but SQL Server will not guarantee the order of the results without one.

It is interesting to look at the way SQL Server normally returns a result set though.  So, lets make a small sample table:

create table orderTest (n varchar(15))

insert into orderTest (n)
values  (2),
(1),
(3),
(1)

If we run a simple “select n from orderTest” the results I get back are 2, 1, 3, 1.  In other words, it returns the results in the order they were inserted.  This is relatively consistent.  I have tried similar tests on multiple versions of sql on multiple machines.  As I said at the beginning, I have seen some programs rely on the data being returned in the order it was inserted and those programs have gone for extended periods of time without any problems.  But I want to emphasize that SQL Server will not guarantee the order without an order by clause.  If you want to be able to guarantee that you can get the data back in the order that it was inserted in then you can add an identity column and order by that.

The tendency to return things in the order they were inserted seems fairly stable with simple where clauses, but asking SQL Server to return distinct results will change the order as it checks for duplicates.  For instance “select distinct n from orderTest” gives me 1, 2, 3.

And SQL Server seems to return the results of a union all by just concatenating the results of the two queries.  So

select n
from orderTest
UNION ALL
select n
from orderTest

Gives 2, 1, 3, 1, 2, 1, 3, 1.  Essentially, the contents of the table repeated twice.  But union makes the results distinct and so it changes the order.  Redoing that with union  instead of union all gives 1, 2, 3.

Indexes will also affect the order.  You may notice the tests were all done on a heap with no indexes.  But if I add an index (either clustered or nonclustered) it makes the order for “select n from orderTest” become 1, 1, 2, 3.  Of course, you cannot always rely on the data being returned in index order either, unless there is an order by clause.  Similarly, as Conor Cunningham pointed out, the size of the dataset can affect it and break apparent order as the optimizer seeks to increase parallelization.

In short, SQL Server has certain ways it will normally return data and it can be helpful to be aware of those.  Even small changes to the query or changes in the size of the data can have a major impact on the order the results are returned.  Without an order by clause, SQL is not guaranteed to return the data in the way it normally does and you cannot rely on that order.  If the order actually matters, use an order by clause.

SQL Server Three Valued Logic and Ansi Nulls

SQL Server uses a three valued logic with True, False, and Unknown. And, normally, SQL Server Nulls are not comparable. Together that means that this little script:

if Null = Null -- Results in Unknown
print 'Num 1'

if Null != Null -- Results in Unknown
print 'Num 2'

if not (Null = Null) or not (Null != Null) -- Results in Unknown
print 'Num 3'

normally produces no results. It simply prints the standard ‘Command(s) completed successfully.’ One significant implication of this is that if the where clause includes something like

[column_name] = null

It will filter out all rows. The best way to look for Nulls is to use

[column_name] is null

But, whether or not Nulls are comparable can be controlled by the use of the Ansi_Nulls session setting. The default for Ansi_Nulls is on, and in fact the documentation warns that ANSI_NULLS may always be ON, with no way to turn it off, in future versions. Until then, this test script:

Set Ansi_nulls off

if Null = Null -- Now Results in True
print 'Num 1'

if Null != Null -- Now False
print 'Num 2'

if not (Null = Null) or not (Null != Null) -- Now True
print 'Num 3'

Dutifully prints out:

Num 1
Num 3

It is generally best to leave Ansi_nulls in its standards compliant default setting and use “is null” or “is not null”, but occasionally it can be useful to change that.

[20 Aug 13 – Updated to add tags and change category]