SQLite and the Performance Implications of Indexes

Indexes make an enormous difference in the read speed of a databases. I have previously mentioned that adding proper indexes is a significant part of improving overall database performance. We can use SQLite and Python to demonstrate how much of a difference indexes make in certain cases.

For this purpose, we create a database with 5000 rows. Each row will have two columns, one with a random number and the other with a random string. Then we will test the speed with a query that finds rows where col1 is identical but col2 is different using the Python timeit library. Then we will see how long it takes to both create an index and execute the same query. Just to ensure that the results are consistent, we will drop the index and then repeat the test. All together, the code to do this looks like:

#imports
import sqlite3
import random
import timeit
from collections import OrderedDict

#Constants used as settings
alpha = 'abcdefghijklmnopqrstuvwxyz'
numRows = 5000

def createRandomDB():
    conn = sqlite3.connect('TestingIndex.py')
    curs = conn.cursor()
    curs.execute('drop table if exists IndexTestTbl')
    conn.commit()
    curs.execute('create table IndexTestTbl (col1, col2)')
    conn.commit()
    for i in range(numRows):
        col1 = random.randint(1, numRows)
        col2 = ''.join(random.choice(alpha) for x in range(25))
        curs.execute('insert into IndexTestTbl values (?, ?)', (col1, col2))
    conn.commit() #commit is expensive, to execute only after the loop

def readTable():
    #Does not actively destroy indexes. Assumes that the table lacks indexes
    conn = sqlite3.connect('TestingIndex.py')
    curs = conn.cursor()
    curs.execute('''select *
                    from IndexTestTbl t1
                    where exists (select * from IndexTestTbl t2
                                  where t1.col1 = t2.col1
                                  and t1.col2 != t2.col2)
                        ''')
    a = curs.fetchall()

def createIdxReadTable():
    conn = sqlite3.connect('TestingIndex.py')
    curs = conn.cursor()
    curs.execute('''create index if not exists
                     TestIdx on IndexTestTbl (col1)''')
    conn.commit()
    readTable()

###########################################
if __name__ == '__main__':
    resultsDict = OrderedDict() #Using OrderedDict to keep the plots in order
    createRandomDB()
    resultsDict['ReadNoIdx'] = timeit.timeit('readTable()',
            number = 1, setup='from __main__ import readTable')

    resultsDict['CreateIdxRead'] = timeit.timeit('createIdxReadTable()',
            number = 1, setup = 'from __main__ import createIdxReadTable')

    #drop the index
    conn = sqlite3.connect('TestingIndex.py')
    conn.execute('drop index TestIdx')
    conn.commit()

    resultsDict['ReadAfterDropIdx'] = timeit.timeit('readTable()',
            number = 1, setup='from __main__ import readTable')
    resultsDict['CreateIdxRead2'] = timeit.timeit('createIdxReadTable()',
            number = 1, setup = 'from __main__ import createIdxReadTable')

    #print the results
    print('Take one with no index required {} seconds'.format(resultsDict['ReadNoIdx']))
    print('Create Index and then read required {} seconds'.format(resultsDict['CreateIdxRead']))
    print('Read the table after dropping the Index {}'.format(resultsDict['ReadAfterDropIdx']))
    print('Create the index again and read takes {}'.format(resultsDict['CreateIdxRead2']))

    #graph the data
    import matplotlib.pyplot as plt
    width = .8
    fig = plt.figure()
    ax = plt.subplot(111)
    ax.bar(range(len(resultsDict)), resultsDict.values(), align = 'center', width = width)
    ax.set_xticks(range(len(resultsDict)))
    ax.set_xticklabels(resultsDict.keys(), rotation = 30, size = 'small')
    plt.tight_layout()
    plt.savefig('SqliteIndexTest.png')

The results:

Take one with no index required 1.088418062616674 seconds
Create Index and then read required 0.2158297379552767 seconds
Read the table after dropping the Index 1.0706404903284745
Create the index again and read takes 0.1789995581284507

 

Results Graph

Results Graph

In this case, it took longer to run the query without the index than it did to create the index and run the query, and the difference was quite substantial. Of course, this was deliberately set up as an extreme example. The rows in the heap were inserted in a random order and the query involved a correlated subquery. I also made sure that I ran the test from the traditional harddrive instead of the SSD. Still, I have seen similar cases in production environments where it took less time to create an index and run the query it was made to support than it did to run the query without bothering to create the index.

It is possible to become overly reliant on indexes, and when there are performance issues in practice I find that there is often more room for improvement and optimization in the queries and code than in the indexes. Still, under some circumstances the right indexes can make a dramatic difference in execution time.

Advertisements

Python Distributions

Python Distributions

Python is free and open source software.  A programmer could download the executables for Python directly from the official site or even download the sourcecode and compile it themselves.  But for someone wanting to get straight to programming in Python it is generally better to get a Python distribution.  The distributions will generally include a selection of third party libraries that are commonly used but not included with the core of Python, a solid IDE, and perhaps other development tools as well.  I take a look at three excellent contenders for Python Distributions here.

WinPython

 

Spyder IDE from Win Python

Spyder IDE from Win Python

WinPython is the version of Python I use at home and the one I personally recommend to anyone that does not want to spend a lot of money on a distribution.  It comes with a wide array of third party libraries including SciPy, making it ready to do many types of analytic work immediately.

WinPython is also designed so it can be run as portable software.  This makes it convenient for those that need more than one installation of Python, such as a 3.x and a 2.x installation running side by side.  Conveniently, WinPython offers both 3.x and 2.x installations.  Being portals also means there is no need for administrator access to the machine to use Python on it.

WinPython comes with Spyder, my favorite light weight IDE.  I use other options when working on large applications requiring numerous components, but for small Python scripts or small programs I think Spyder is highly conducive to my work flow.  WinPython also includes other nice development tools like Qt Designer and has Cython for performance.

 

Enthought Canopy Python

 

Enthought Canopy is a commercially supported version of Python.  It does offer a free version, but that is severely limited compared to the full version.  Enthought is also installed in the user folder and generally does not need local Administrator privileges.

Enthought Canopy comes with a large host of included libraries, including SciPy.  It also includes its own IDE.  While the IDE works quite well, I personally prefer Spyder.  The debugger is excellent, but is only included with the paid version.  The paid version also includes other additional features such as online training courses.  Enthought is currently only available in a 3.x version.

Overall, I think Enthought is an excellent distribution if you will use and are willing pay for its premium features such as its debugger and its training courses.  If you are looking for a free distribution then I think WinPython is generally superior, but a large part of that is that I am partial to Spyder.

 

Python (X, Y)

Python (X, Y) describes itself as “the scientific Python distribution” and is maintained by Gabi Davar.  This used to be my preferred distribution and it remains a well-made Python distribution.  It has a slower update speed than some others.  As I write this, the last update was posted in June, 2015.  It is also available only for Python 2.X.  Much like WinPython, it uses Spyder as its default IDE and comes with a variety of scientific libraries and additional tools.  Given the slow update speed, I viewed WinPython as a graceful upgrade path from Python(X, Y).

Conclusions and Other Distributions

Of course, those are far from the only distributions available.  I only discussed the ones that I had personally used.  I have also heard good things about the Anaconda distribution though I will not personally comment on it since I have not tried it.  From what I have personally experienced, I am happy to recommend Enthought for anyone that wants and is willing to pay for the premium version.  If you want a free version or if you will not use Enthought’s premium features, then WinPython has served me well.

 

 

 

 

A really simple Multiprocessing Python Example

Purpose and introduction

A Python program will not be able to take advantage of more than one core or more than one CPU by default.  One way to get the program to take advantage of multiple cores is through the multiprocessing module.  There are lots of excellent references and tutorials available on the web (links are included at the bottom), but one thing I was not able to find back when I first started using multiprocessing was a detailed look at an extremely simple, but still practical, example.  Sometimes, it is useful when dealing with a new technique to see it in a very simple form, but not so simple as some of the completely contrived examples in the library documentation.

So, here is a look at an extremely simple example using an embarrassingly parallel issue: generating the Mandelbrot set.  The algorithm used is basically a direct adaptation of the one presented in pseudo-code on Wikipedia, grouping the pixels into rows to make it easier to pass off to the multiprocessing.  Just to be clear, this is far from the fastest or best or most elegant way to use Python to calculate the Mandelbrot set.  It does provide a fairly good springboard for using multiprocessing while still doing actual work.

Essentially this provides a straightforward example with explanations of processing a function against a list of arguments using multiprocessing and then gathering those together into a list.

A look at a single processor version

Here is the code for a single processor version:

import matplotlib.pyplot as plt
from functools import partial

def mandelbrotCalcRow(yPos, h, w, max_iteration = 1000):
    y0 = yPos * (2/float(h)) - 1 #rescale to -1 to 1
    row = []
    for xPos in range(w):
        x0 = xPos * (3.5/float(w)) - 2.5 #rescale to -2.5 to 1
        iteration, z = 0, 0 + 0j
        c = complex(x0, y0)
        while abs(z) < 2 and iteration < max_iteration:
            z = z**2 + c
            iteration += 1
        row.append(iteration)

    return row

def mandelbrotCalcSet(h, w, max_iteration = 1000):
    partialCalcRow = partial(mandelbrotCalcRow, h=h, w=w, max_iteration = max_iteration)
    mandelImg = map(partialCalcRow, xrange(h))
    return mandelImg

mandelImg = mandelbrotCalcSet(400, 400, 1000)
plt.imshow(mandelImg)
plt.savefig('mandelimg.jpg')

The modifications needed to use multiprocessing

Obviously, to use multiprocessing, we need to import it, so towards the top, we add:

import multiprocessing

The mandelbrotCalcRow function can remain unchanged.  The main changes are to the mandelbrotCalcSet function, which now looks like:

def mandelbrotCalcSet(h, w, max_iteration = 1000):
    #make a helper function that better supports pool.map by using only 1 var
    #This is necessary since the version
    partialCalcRow = partial(mandelbrotCalcRow, h=h, w=w, max_iteration = max_iteration)

    pool =multiprocessing.Pool() #creates a pool of process, controls worksers
    #the pool.map only accepts one iterable, so use the partial function
    #so that we only need to deal with one variable.
    mandelImg = pool.map(partialCalcRow, xrange(h)) #make our results with a map call
    pool.close() #we are not adding any more processes
    pool.join() #tell it to wait until all threads are done before going on

    return mandelImg

Here, Pool creates the pool of processes that controls the workers.  It gets the environment ready to run multiple tasks.  One of the easiest ways to use the pool is to use its map.  That takes a function and an iterable of parameters.  That function is then called for each parameter in the iterable and results are put into a list, distributing the calls over the available threads.

One significant difference between pool.map and the built-in map, other than the fact pool.map can take advantage of multiple processors, is that pool.map will only take a single iterable of arguments for processing.  That is why I created a partial function which freezes the other arguments.

Pool.close() then informs the processor that no new tasks will be added that pool.  Either pool.close or pool.terminate need to be called before pool.join can be called.  Pool.join stops and waits for all of the results to be finished and collected before proceeding with the rest of the program.  This gives a simple way to collect the results into a single list for use later.

The other significant change is that the main portion, the entry-point of the script, needs to be wrapped with a  “if __name__=’__main__’ conditional on Windows.  This is because the main module needs to be able to be safely imported by a new python interpreter.  Not doing this can result in problems such as a RuntimeError or completely locking up the system in some of the tests I tried.  This, and a couple of other caveats, are mentioned in the Programming Guidelines.

So, the entry point now looks like:

if __name__=='__main__':
    mandelImg = mandelbrotCalcSet(400, 400, 1000)
    plt.imshow(mandelImg)
    plt.savefig('mandelimg.jpg')

In this example, the multiprocessing version only has 8 additional lines of code (its 15 lines longer, but 7 of those lines are additional whitespace or comment lines I added to make it easier to read).  But it runs in less than a third of the time.

Of course, it is worth remembering the saying that “premature optimization is the root of all evil.”  It is normally smart to get the code working first, and then consider bringing in multiprocessing options.

And the results:

Some related links.

  1. Multiprocessing Docs
  2. The examples in the documentation.
  3. Wiki.cython.org has an example of creating the Mandelbrot set using Cython.  For actually generating the set rather than just making examples for multiprocessing, that version is much better.
  4. SciPy.org has a good discussion of parallel programming with numpy and scipy.

 

{Edit 10 Jan 13 – Corrected a minor spelling error.}
{Edit 20 May 14 – Corrected typos.

Playing with Cython

Intro

Recently, I came over Cython and started experimenting with it.  After some basic testing, I found several things of interest.

  1. Used properly, Cython is a fantastic way to speed up Python code.
  2. It is extremely liberal in what Python code it will accept, but works best when the code is tweaked for Cython.
  3. Cython works extremely well, but it is not magic and can reduce performance and make code more brittle if not used carefully.
  4. Really, use the profiler before doing any serious optimization work.

What is Cython?

Cython is a programming language that compiles to C or C++ code that can then be compiled with a normal C/C++ compiler.  It supports the optional static typing of variables with C-types in order to improve performance.   One common usage is to create optimized extension modules to be imported into python.  Cython also seems able to create standalone executables, but I have not tried that yet.

The story and what I found.

When Cython caught my interest, I read about it on the web and played just a bit with some of the examples that were posted.  Then I decided I should play with it in something a bit closer to the real world and preferably with code that I had written.

I had a small little project that I had been toying with lying around that seemed perfect.  The entire thing was under a thousand lines of actual working code and I had it cleanly split into two files, one that did all the work and the other that just supported a pyQt gui.  The module with the working functions could be exercised completely separately from the GUI, I had unit tests for all the important functions, and I was unhappy with the performance.  This seemed like a perfect opportunity to test Cython.

I knew that under normal circumstances my first step in optimizing my simple program should be to run it through the profiler.  I’ve read that many times and I’ve seen good reasons for it in practice.  But this time my goal was to play with Cython, if it fixed the performance issues in the process that would be a nice bonus.  So, I was going to mess with Cython no matter what the profiler said and didn’t see a reason to bother with it at this point.

So, I forked the project into a new folder and set aside the gui, focusing on the working module for now.  I pulled out nearly all of the code and put it into a new .pyx file.  I left the original .py file as a small shell that set a couple of parameters and called the main function.  Then at the top of that file I added:

Import pyximport; pyximort.install()
Import (the pyx module)

Then I ran it, and it took substantially longer than the original pure python file.  It didn’t take much looking to see that was because of compiling the new pyx module.  So, I ran it again and the time reduced to only a fraction of a second slower than it originally was, but it was still slower than pure Python.  Just to play with it, I built a setup.py and made a pyd file to take pyxImport entirely out of the quest.  In this case, it didn’t make any measurable difference, but I have not done extensive testing on it and there are cases where a setup.py is needed to compile the file.

These initial tests showed me two things immediately.  The first is that Cython is extremely liberal in what it accepts.  The python code I ran through the Cython compiler included calls to import various libraries and other idiomatic Python features and it handled it without complaint.  I saw references that Cython is not a full Python implementation (at least not yet) and that some types of valid Python will fail, but it certainly seems to handle most cases.  All of the unit tests, continued to pass at this point.  The second is that while many examples show that Cython can speed up certain types of Python code with no changes at all, it is not magic and certain types of Python programs can be slowed down by using Cython.

Along those lines, I played with it by passing in the “wrong” type.  Namely, I passed in a float where I had defined an int in the Cython code.  The Python version happily switched to outputting float values.  The Cython version executed without errors but the output remained an int and was now substantially rounded compared to the results from the Python version.

So I pulled out just one class that mostly did different types of arithmetic.  I went through and meticulously went through changing def to cpdef and putting in types everywhere it made any sense to do so.  I tried that, and there was still no speedup.

So I finally ran it through the profiler, something that under normal circumstances would have been my first action.  I found that the vast majority of the time was being spent in a function of PIL that got called several times (though the functions from the class did seem to be minutely faster).   This proved a solid reminder of the importance of profiling before doing any kind of optimization.  Here, the time certainly wasn’t wasted since it gave me a chance to play with Cython, but had I started with the profiler I might readily have decided this wasn’t a great test case and found a different one.

To keep playing with it, I setup a new python file that imported that class in both its pure python version and in the Cython version with type declarations and ran some of the methods a few thousand times with timeit.   This showed that the Cython version ran in about a tenth of the time.

Conclusions

Cython seems to have a great ability increase the speed of Python code and will almost certainly be a tool I use repeatedly.  With that said, to get full advantage of it require some additional work.  Also, the first step in almost any sensible optimization plan should be to apply the profiler.

Class and instance variables in Python 2.7

The differences and interactions between class variables and instance variables are well described in the Python 2.x documentation, but they can be slightly subtle in some cases and I recently spent quite some time troubleshooting an issue even though I should have known better.

To put it simply, class variables are defined outside of any method of the class, normally right below the doc string.  Class variables can be referenced directly from the class, and this can be used as one way to make an enumerated constant, or they can be referenced from any instance in the class.  Instance variables are defined inside a method, normally __new__ or __init__, and they are local to that instance.

For a simple, contrived example:

class TestClass(object):
    c = 299792458  #this is a class variable

    def __init__(self):
        self.e = 2.71 #instance variable

    def phiFunc(self):
        self.phi = 1.618 #will become an instance variable after phiFunc is called
        x=224 #this is a method variable and can't be accessed outside this method

assert TestClass.c == 299792458
try:
    print TestClass.e #Not defined
except AttributeError:
    pass

testInst = TestClass()
assert testInst.c == 299792458 #instance gets c, currently a ref to the class var
assert testInst.e == 2.71 #got e because of __init__
try:
    print testInst.phi #error since not defined yet
except AttributeError:
    pass

testInst.phiFunc()
assert testInst.phi == 1.618 #now its here
try:
    testInst.x #still undefined
except AttributeError:
    pass

Class variables can be useful for constants that will need to be used by all instances of the class, or that are meant to be accessed directly from the class.  They can also be used to set defaults for instance variables.  But there it is important to remember that if the value of a class variable is changed, then all instances that call to the class variable will reflect that change.  So, to carry on with our contrived TestClass:

TestClass.c = 1
assert testInst.c == 1 #Referring back to the class, so it changed too.

But when an instance has an attribute of the same name as an instance that essentially hides the class attribute. And assigning a value to an attribute of an instance will assign it just to that instance attribute, even if it needs to create it to do it. So:

class testClass2 (object):
    sqrttwo = 1.41
    sqrrtthree = 1.73

    def __init__(self):
        self.sqrttwo = 1

assert testClass2.sqrttwo == 1.41 #access class variable

testInst2 = testClass2()
assert testInst2.sqrttwo == 1 #the instance variable hides the class variable

testInst2.sqrrtthree = 2 #assigning creates an instance attribute
assert testClass2.sqrrtthree == 1.73 #So the value in the class is unchanged

This can get complicated when the class variable is a mutable type, like a list.  Python handles assignment by reference, rather than by making a copy.  For instance:

class classWList(object):
    defaultList = [1,]

    def __init__(self):
        self.instList = self.defaultList

instance = classWList()
assert instance.instList == [1,] #same as defaultList
instance.instList.append(2)
assert classWList.defaultList == [1, 2] #the class variable also has 2 now.
assert id(classWList.defaultList) == id(instance.instList) #they point to the same memory address

Of course, that could be handled by explicitly making a copy instead of a simple assignment.

In short, both class variables and instance variables are useful, but there can be some subtleties to the way they interact that need to be remembered when working with them.