SQLite and the Performance Implications of Indexes

Indexes make an enormous difference in the read speed of a databases. I have previously mentioned that adding proper indexes is a significant part of improving overall database performance. We can use SQLite and Python to demonstrate how much of a difference indexes make in certain cases.

For this purpose, we create a database with 5000 rows. Each row will have two columns, one with a random number and the other with a random string. Then we will test the speed with a query that finds rows where col1 is identical but col2 is different using the Python timeit library. Then we will see how long it takes to both create an index and execute the same query. Just to ensure that the results are consistent, we will drop the index and then repeat the test. All together, the code to do this looks like:

#imports
import sqlite3
import random
import timeit
from collections import OrderedDict

#Constants used as settings
alpha = 'abcdefghijklmnopqrstuvwxyz'
numRows = 5000

def createRandomDB():
    conn = sqlite3.connect('TestingIndex.py')
    curs = conn.cursor()
    curs.execute('drop table if exists IndexTestTbl')
    conn.commit()
    curs.execute('create table IndexTestTbl (col1, col2)')
    conn.commit()
    for i in range(numRows):
        col1 = random.randint(1, numRows)
        col2 = ''.join(random.choice(alpha) for x in range(25))
        curs.execute('insert into IndexTestTbl values (?, ?)', (col1, col2))
    conn.commit() #commit is expensive, to execute only after the loop

def readTable():
    #Does not actively destroy indexes. Assumes that the table lacks indexes
    conn = sqlite3.connect('TestingIndex.py')
    curs = conn.cursor()
    curs.execute('''select *
                    from IndexTestTbl t1
                    where exists (select * from IndexTestTbl t2
                                  where t1.col1 = t2.col1
                                  and t1.col2 != t2.col2)
                        ''')
    a = curs.fetchall()

def createIdxReadTable():
    conn = sqlite3.connect('TestingIndex.py')
    curs = conn.cursor()
    curs.execute('''create index if not exists
                     TestIdx on IndexTestTbl (col1)''')
    conn.commit()
    readTable()

###########################################
if __name__ == '__main__':
    resultsDict = OrderedDict() #Using OrderedDict to keep the plots in order
    createRandomDB()
    resultsDict['ReadNoIdx'] = timeit.timeit('readTable()',
            number = 1, setup='from __main__ import readTable')

    resultsDict['CreateIdxRead'] = timeit.timeit('createIdxReadTable()',
            number = 1, setup = 'from __main__ import createIdxReadTable')

    #drop the index
    conn = sqlite3.connect('TestingIndex.py')
    conn.execute('drop index TestIdx')
    conn.commit()

    resultsDict['ReadAfterDropIdx'] = timeit.timeit('readTable()',
            number = 1, setup='from __main__ import readTable')
    resultsDict['CreateIdxRead2'] = timeit.timeit('createIdxReadTable()',
            number = 1, setup = 'from __main__ import createIdxReadTable')

    #print the results
    print('Take one with no index required {} seconds'.format(resultsDict['ReadNoIdx']))
    print('Create Index and then read required {} seconds'.format(resultsDict['CreateIdxRead']))
    print('Read the table after dropping the Index {}'.format(resultsDict['ReadAfterDropIdx']))
    print('Create the index again and read takes {}'.format(resultsDict['CreateIdxRead2']))

    #graph the data
    import matplotlib.pyplot as plt
    width = .8
    fig = plt.figure()
    ax = plt.subplot(111)
    ax.bar(range(len(resultsDict)), resultsDict.values(), align = 'center', width = width)
    ax.set_xticks(range(len(resultsDict)))
    ax.set_xticklabels(resultsDict.keys(), rotation = 30, size = 'small')
    plt.tight_layout()
    plt.savefig('SqliteIndexTest.png')

The results:

Take one with no index required 1.088418062616674 seconds
Create Index and then read required 0.2158297379552767 seconds
Read the table after dropping the Index 1.0706404903284745
Create the index again and read takes 0.1789995581284507

 

Results Graph

Results Graph

In this case, it took longer to run the query without the index than it did to create the index and run the query, and the difference was quite substantial. Of course, this was deliberately set up as an extreme example. The rows in the heap were inserted in a random order and the query involved a correlated subquery. I also made sure that I ran the test from the traditional harddrive instead of the SSD. Still, I have seen similar cases in production environments where it took less time to create an index and run the query it was made to support than it did to run the query without bothering to create the index.

It is possible to become overly reliant on indexes, and when there are performance issues in practice I find that there is often more room for improvement and optimization in the queries and code than in the indexes. Still, under some circumstances the right indexes can make a dramatic difference in execution time.

Python Distributions

Python Distributions

Python is free and open source software.  A programmer could download the executables for Python directly from the official site or even download the sourcecode and compile it themselves.  But for someone wanting to get straight to programming in Python it is generally better to get a Python distribution.  The distributions will generally include a selection of third party libraries that are commonly used but not included with the core of Python, a solid IDE, and perhaps other development tools as well.  I take a look at three excellent contenders for Python Distributions here.

WinPython

 

Spyder IDE from Win Python

Spyder IDE from Win Python

WinPython is the version of Python I use at home and the one I personally recommend to anyone that does not want to spend a lot of money on a distribution.  It comes with a wide array of third party libraries including SciPy, making it ready to do many types of analytic work immediately.

WinPython is also designed so it can be run as portable software.  This makes it convenient for those that need more than one installation of Python, such as a 3.x and a 2.x installation running side by side.  Conveniently, WinPython offers both 3.x and 2.x installations.  Being portals also means there is no need for administrator access to the machine to use Python on it.

WinPython comes with Spyder, my favorite light weight IDE.  I use other options when working on large applications requiring numerous components, but for small Python scripts or small programs I think Spyder is highly conducive to my work flow.  WinPython also includes other nice development tools like Qt Designer and has Cython for performance.

 

Enthought Canopy Python

 

Enthought Canopy is a commercially supported version of Python.  It does offer a free version, but that is severely limited compared to the full version.  Enthought is also installed in the user folder and generally does not need local Administrator privileges.

Enthought Canopy comes with a large host of included libraries, including SciPy.  It also includes its own IDE.  While the IDE works quite well, I personally prefer Spyder.  The debugger is excellent, but is only included with the paid version.  The paid version also includes other additional features such as online training courses.  Enthought is currently only available in a 3.x version.

Overall, I think Enthought is an excellent distribution if you will use and are willing pay for its premium features such as its debugger and its training courses.  If you are looking for a free distribution then I think WinPython is generally superior, but a large part of that is that I am partial to Spyder.

 

Python (X, Y)

Python (X, Y) describes itself as “the scientific Python distribution” and is maintained by Gabi Davar.  This used to be my preferred distribution and it remains a well-made Python distribution.  It has a slower update speed than some others.  As I write this, the last update was posted in June, 2015.  It is also available only for Python 2.X.  Much like WinPython, it uses Spyder as its default IDE and comes with a variety of scientific libraries and additional tools.  Given the slow update speed, I viewed WinPython as a graceful upgrade path from Python(X, Y).

Conclusions and Other Distributions

Of course, those are far from the only distributions available.  I only discussed the ones that I had personally used.  I have also heard good things about the Anaconda distribution though I will not personally comment on it since I have not tried it.  From what I have personally experienced, I am happy to recommend Enthought for anyone that wants and is willing to pay for the premium version.  If you want a free version or if you will not use Enthought’s premium features, then WinPython has served me well.

 

 

 

 

Playing with Cython

Intro

Recently, I came over Cython and started experimenting with it.  After some basic testing, I found several things of interest.

  1. Used properly, Cython is a fantastic way to speed up Python code.
  2. It is extremely liberal in what Python code it will accept, but works best when the code is tweaked for Cython.
  3. Cython works extremely well, but it is not magic and can reduce performance and make code more brittle if not used carefully.
  4. Really, use the profiler before doing any serious optimization work.

What is Cython?

Cython is a programming language that compiles to C or C++ code that can then be compiled with a normal C/C++ compiler.  It supports the optional static typing of variables with C-types in order to improve performance.   One common usage is to create optimized extension modules to be imported into python.  Cython also seems able to create standalone executables, but I have not tried that yet.

The story and what I found.

When Cython caught my interest, I read about it on the web and played just a bit with some of the examples that were posted.  Then I decided I should play with it in something a bit closer to the real world and preferably with code that I had written.

I had a small little project that I had been toying with lying around that seemed perfect.  The entire thing was under a thousand lines of actual working code and I had it cleanly split into two files, one that did all the work and the other that just supported a pyQt gui.  The module with the working functions could be exercised completely separately from the GUI, I had unit tests for all the important functions, and I was unhappy with the performance.  This seemed like a perfect opportunity to test Cython.

I knew that under normal circumstances my first step in optimizing my simple program should be to run it through the profiler.  I’ve read that many times and I’ve seen good reasons for it in practice.  But this time my goal was to play with Cython, if it fixed the performance issues in the process that would be a nice bonus.  So, I was going to mess with Cython no matter what the profiler said and didn’t see a reason to bother with it at this point.

So, I forked the project into a new folder and set aside the gui, focusing on the working module for now.  I pulled out nearly all of the code and put it into a new .pyx file.  I left the original .py file as a small shell that set a couple of parameters and called the main function.  Then at the top of that file I added:

Import pyximport; pyximort.install()
Import (the pyx module)

Then I ran it, and it took substantially longer than the original pure python file.  It didn’t take much looking to see that was because of compiling the new pyx module.  So, I ran it again and the time reduced to only a fraction of a second slower than it originally was, but it was still slower than pure Python.  Just to play with it, I built a setup.py and made a pyd file to take pyxImport entirely out of the quest.  In this case, it didn’t make any measurable difference, but I have not done extensive testing on it and there are cases where a setup.py is needed to compile the file.

These initial tests showed me two things immediately.  The first is that Cython is extremely liberal in what it accepts.  The python code I ran through the Cython compiler included calls to import various libraries and other idiomatic Python features and it handled it without complaint.  I saw references that Cython is not a full Python implementation (at least not yet) and that some types of valid Python will fail, but it certainly seems to handle most cases.  All of the unit tests, continued to pass at this point.  The second is that while many examples show that Cython can speed up certain types of Python code with no changes at all, it is not magic and certain types of Python programs can be slowed down by using Cython.

Along those lines, I played with it by passing in the “wrong” type.  Namely, I passed in a float where I had defined an int in the Cython code.  The Python version happily switched to outputting float values.  The Cython version executed without errors but the output remained an int and was now substantially rounded compared to the results from the Python version.

So I pulled out just one class that mostly did different types of arithmetic.  I went through and meticulously went through changing def to cpdef and putting in types everywhere it made any sense to do so.  I tried that, and there was still no speedup.

So I finally ran it through the profiler, something that under normal circumstances would have been my first action.  I found that the vast majority of the time was being spent in a function of PIL that got called several times (though the functions from the class did seem to be minutely faster).   This proved a solid reminder of the importance of profiling before doing any kind of optimization.  Here, the time certainly wasn’t wasted since it gave me a chance to play with Cython, but had I started with the profiler I might readily have decided this wasn’t a great test case and found a different one.

To keep playing with it, I setup a new python file that imported that class in both its pure python version and in the Cython version with type declarations and ran some of the methods a few thousand times with timeit.   This showed that the Cython version ran in about a tenth of the time.

Conclusions

Cython seems to have a great ability increase the speed of Python code and will almost certainly be a tool I use repeatedly.  With that said, to get full advantage of it require some additional work.  Also, the first step in almost any sensible optimization plan should be to apply the profiler.

Class and instance variables in Python 2.7

The differences and interactions between class variables and instance variables are well described in the Python 2.x documentation, but they can be slightly subtle in some cases and I recently spent quite some time troubleshooting an issue even though I should have known better.

To put it simply, class variables are defined outside of any method of the class, normally right below the doc string.  Class variables can be referenced directly from the class, and this can be used as one way to make an enumerated constant, or they can be referenced from any instance in the class.  Instance variables are defined inside a method, normally __new__ or __init__, and they are local to that instance.

For a simple, contrived example:

class TestClass(object):
    c = 299792458  #this is a class variable

    def __init__(self):
        self.e = 2.71 #instance variable

    def phiFunc(self):
        self.phi = 1.618 #will become an instance variable after phiFunc is called
        x=224 #this is a method variable and can't be accessed outside this method

assert TestClass.c == 299792458
try:
    print TestClass.e #Not defined
except AttributeError:
    pass

testInst = TestClass()
assert testInst.c == 299792458 #instance gets c, currently a ref to the class var
assert testInst.e == 2.71 #got e because of __init__
try:
    print testInst.phi #error since not defined yet
except AttributeError:
    pass

testInst.phiFunc()
assert testInst.phi == 1.618 #now its here
try:
    testInst.x #still undefined
except AttributeError:
    pass

Class variables can be useful for constants that will need to be used by all instances of the class, or that are meant to be accessed directly from the class.  They can also be used to set defaults for instance variables.  But there it is important to remember that if the value of a class variable is changed, then all instances that call to the class variable will reflect that change.  So, to carry on with our contrived TestClass:

TestClass.c = 1
assert testInst.c == 1 #Referring back to the class, so it changed too.

But when an instance has an attribute of the same name as an instance that essentially hides the class attribute. And assigning a value to an attribute of an instance will assign it just to that instance attribute, even if it needs to create it to do it. So:

class testClass2 (object):
    sqrttwo = 1.41
    sqrrtthree = 1.73

    def __init__(self):
        self.sqrttwo = 1

assert testClass2.sqrttwo == 1.41 #access class variable

testInst2 = testClass2()
assert testInst2.sqrttwo == 1 #the instance variable hides the class variable

testInst2.sqrrtthree = 2 #assigning creates an instance attribute
assert testClass2.sqrrtthree == 1.73 #So the value in the class is unchanged

This can get complicated when the class variable is a mutable type, like a list.  Python handles assignment by reference, rather than by making a copy.  For instance:

class classWList(object):
    defaultList = [1,]

    def __init__(self):
        self.instList = self.defaultList

instance = classWList()
assert instance.instList == [1,] #same as defaultList
instance.instList.append(2)
assert classWList.defaultList == [1, 2] #the class variable also has 2 now.
assert id(classWList.defaultList) == id(instance.instList) #they point to the same memory address

Of course, that could be handled by explicitly making a copy instead of a simple assignment.

In short, both class variables and instance variables are useful, but there can be some subtleties to the way they interact that need to be remembered when working with them.