Variable Scope in SQL Server

The way variable scoping works in SQL Server has been well documented on MSDN, but I have seen related issues come up on forums recently so I decided to go over it.  Generally, a variable in T-SQL is available between when it has been declared and the end of the batch, function, or stored procedure that it is defined in.  That is of course fairly straightforward, but it is easy to miss some of the subtleties on what constitutes a batch.

Go creates a new batch.

The most explicit way to end a batch is with the GO command.  So for instance:

declare @var varchar(10)

set @var = 'Hi'

print @var --This works

GO --End the batch

print @var --Generates an error

Gives results of:

Hi

Msg 137, Level 15, State 2, Line 2
Must declare the scalar variable "@var".

Each Execute is in a new batch.

When using SSMS, every call to execute generates a new batch.  So, for instance if you highlight:

declare @var varchar(10)

set @var = 'Hi'

print @var --This works

In SSMS and press F5 (or Execute under the Query Menu) then it is all one batch and the print statement will execute.  If you then immediately highlight that print statement again and press F5 again, it will give the “Must declare the scalar variable” because it is executing in a new batch.

A transaction can include multiple batches.

An explicit transaction can include multiple batches.  So, you can open an explicit transaction, declare a variable within it, start a new batch and so lose access to the variable before the transaction commits.  So, to re-use the first example with a slight tweak:

begin transaction

declare @var varchar(10)

set @var = 'Hi'

print @var --This works

GO --End the batch, but not the transaction

print @var --Generates an error since out of batch

--this particular error won't roll back the transaction, but ends the batch

Results in this message:

Hi

Msg 137, Level 15, State 2, Line 2

Must declare the scalar variable "@var".

It got that error because the GO command ended the batch.  The Level 15 error generated ends the execution but not the transaction, so looking @@trancount at this point will return one and we could proceed to commit or rollback the transactions (which doesn’t actually matter in this case since it didn’t affect anything).

A procedure has a different scope even if called within the same batch.

Even if it is called as part of a longer batch, a procedure has its own scope.  So a procedure can’t directly access other variables in a batch that aren’t passed to it as a parameter and it can’t affect variables outside of it except through output parameters. So if we create a procedure just for testing like:

alter procedure SillyProc

as

declare @sillyvar varchar(5)

set @sillyvar = 'Hi'

print @sillyvar

then when it is executed there won’t be an @sillyvar available even if it is in the same batch.  Thus this:

exec SillyProc

print @sillvar --generates an error

Would give the familiar “Must declare” error.

Each execution from an external program is in a new batch.

Generally, each call to execute from a separate program will be in a new batch.  For a python example:

import pyodbc

sqlConnStr = ('DRIVER={SQL Server};Server=(local);Database=TestDb;'+

'Trusted_Connection=YES')

sqlConn = pyodbc.connect(sqlConnStr)

curs = sqlConn.cursor()

sql = """

declare @pyvar varchar(10)

set @pyvar = 'HI!'"""

curs.execute(sql) #This works

sql = "set @pyvar = 'retry'"

curs.execute(sql) #this will give an error.

curs.commit() #Won't get here.

This will give a “Must declare the scalar variable “@pyvar”.” error when it reaches the second execute statement because that is in a new batch and that new batch does not have the variable declared.

Conclusion

The scope of a variable is generally straightforward, but it can have some subtleties.  Essentially, you can use a variable within the function, procedure, or batch it is declared in.

(Edited 17 Feb. 2013 to correct a type in a sourcecode tag.)

A really simple Multiprocessing Python Example

Purpose and introduction

A Python program will not be able to take advantage of more than one core or more than one CPU by default.  One way to get the program to take advantage of multiple cores is through the multiprocessing module.  There are lots of excellent references and tutorials available on the web (links are included at the bottom), but one thing I was not able to find back when I first started using multiprocessing was a detailed look at an extremely simple, but still practical, example.  Sometimes, it is useful when dealing with a new technique to see it in a very simple form, but not so simple as some of the completely contrived examples in the library documentation.

So, here is a look at an extremely simple example using an embarrassingly parallel issue: generating the Mandelbrot set.  The algorithm used is basically a direct adaptation of the one presented in pseudo-code on Wikipedia, grouping the pixels into rows to make it easier to pass off to the multiprocessing.  Just to be clear, this is far from the fastest or best or most elegant way to use Python to calculate the Mandelbrot set.  It does provide a fairly good springboard for using multiprocessing while still doing actual work.

Essentially this provides a straightforward example with explanations of processing a function against a list of arguments using multiprocessing and then gathering those together into a list.

A look at a single processor version

Here is the code for a single processor version:

import matplotlib.pyplot as plt
from functools import partial

def mandelbrotCalcRow(yPos, h, w, max_iteration = 1000):
    y0 = yPos * (2/float(h)) - 1 #rescale to -1 to 1
    row = []
    for xPos in range(w):
        x0 = xPos * (3.5/float(w)) - 2.5 #rescale to -2.5 to 1
        iteration, z = 0, 0 + 0j
        c = complex(x0, y0)
        while abs(z) < 2 and iteration < max_iteration:
            z = z**2 + c
            iteration += 1
        row.append(iteration)

    return row

def mandelbrotCalcSet(h, w, max_iteration = 1000):
    partialCalcRow = partial(mandelbrotCalcRow, h=h, w=w, max_iteration = max_iteration)
    mandelImg = map(partialCalcRow, xrange(h))
    return mandelImg

mandelImg = mandelbrotCalcSet(400, 400, 1000)
plt.imshow(mandelImg)
plt.savefig('mandelimg.jpg')

The modifications needed to use multiprocessing

Obviously, to use multiprocessing, we need to import it, so towards the top, we add:

Import multiprocessing

The mandelbrotCalcRow function can remain unchanged.  The main changes are to the mandelbrotCalcSet function, which now looks like:

def mandelbrotCalcSet(h, w, max_iteration = 1000):
    #make a helper function that better supports pool.map by using only 1 var
    #This is necessary since the version
    partialCalcRow = partial(mandelbrotCalcRow, h=h, w=w, max_iteration = max_iteration)

    pool =multiprocessing.Pool() #creates a pool of process, controls worksers
    #the pool.map only accepts one iterable, so use the partial function
    #so that we only need to deal with one variable.
    mandelImg = pool.map(partialCalcRow, xrange(h)) #make our results with a map call
    pool.close() #we are not adding any more processes
    pool.join() #tell it to wait until all threads are done before going on

return mandelImg

Here, Pool creates the pool of processes that controls the workers.  It gets the environment ready to run multiple tasks.  One of the easiest ways to use the pool is to use its map.  That takes a function and an iterable of parameters.  That function is then called for each parameter in the iterable and results are put into a list, distributing the calls over the available threads.

One significant difference between pool.map and the built-in map, other than the fact pool.map can take advantage of multiple processors, is that pool.map will only take a single iterable of arguments for processing.  That is why I created a partial function which freezes the other arguments.

Pool.close() then informs the processor that no new tasks will be added that pool.  Either pool.close or pool.terminate need to be called before pool.join can be called.  Pool.join stops and waits for all of the results to be finished and collected before proceeding with the rest of the program.  This gives a simple way to collect the results into a single list for use later.

The other significant change is that the main portion, the entry-point of the script, needs to be wrapped with a  “if __name__=’__main__’ conditional on Windows.  This is because the main module needs to be able to be safely imported by a new python interpreter.  Not doing this can result in problems such as a RuntimeError or completely locking up the system in some of the tests I tried.  This, and a couple of other caveats, are mentioned in the Programming Guidelines.

So, the entry point now looks like:

if __name__=='__main__':
    mandelImg = mandelbrotCalcSet(400, 400, 1000)
    plt.imshow(mandelImg)
    plt.savefig('mandelimg.jpg')

In this example, the multiprocessing version only has 8 additional lines of code (its 15 lines longer, but 7 of those lines are additional whitespace or comment lines I added to make it easier to read).  But it runs in less than a third of the time.

Of course, it is worth remembering the saying that “premature optimization is the root of all evil.”  It is normally smart to get the code working first, and then consider bringing in multiprocessing options.

And the results:

Some related links.

  1. Multiprocessing Docs
  2. The examples in the documentation.
  3. Wiki.cython.org has an example of creating the Mandelbrot set using Cython.  For actually generating the set rather than just making examples for multiprocessing, that version is much better.
  4. SciPy.org has a good discussion of parallel programming with numpy and scipy.

 

{Edit 10 Jan 13 – Corrected a minor spelling error.}

Playing with Cython

Intro

Recently, I came over Cython and started experimenting with it.  After some basic testing, I found several things of interest.

  1. Used properly, Cython is a fantastic way to speed up Python code.
  2. It is extremely liberal in what Python code it will accept, but works best when the code is tweaked for Cython.
  3. Cython works extremely well, but it is not magic and can reduce performance and make code more brittle if not used carefully.
  4. Really, use the profiler before doing any serious optimization work.

What is Cython?

Cython is a programming language that compiles to C or C++ code that can then be compiled with a normal C/C++ compiler.  It supports the optional static typing of variables with C-types in order to improve performance.   One common usage is to create optimized extension modules to be imported into python.  Cython also seems able to create standalone executables, but I have not tried that yet.

The story and what I found.

When Cython caught my interest, I read about it on the web and played just a bit with some of the examples that were posted.  Then I decided I should play with it in something a bit closer to the real world and preferably with code that I had written.

I had a small little project that I had been toying with lying around that seemed perfect.  The entire thing was under a thousand lines of actual working code and I had it cleanly split into two files, one that did all the work and the other that just supported a pyQt gui.  The module with the working functions could be exercised completely separately from the GUI, I had unit tests for all the important functions, and I was unhappy with the performance.  This seemed like a perfect opportunity to test Cython.

I knew that under normal circumstances my first step in optimizing my simple program should be to run it through the profiler.  I’ve read that many times and I’ve seen good reasons for it in practice.  But this time my goal was to play with Cython, if it fixed the performance issues in the process that would be a nice bonus.  So, I was going to mess with Cython no matter what the profiler said and didn’t see a reason to bother with it at this point.

So, I forked the project into a new folder and set aside the gui, focusing on the working module for now.  I pulled out nearly all of the code and put it into a new .pyx file.  I left the original .py file as a small shell that set a couple of parameters and called the main function.  Then at the top of that file I added:

Import pyximport; pyximort.install()
Import (the pyx module)

Then I ran it, and it took substantially longer than the original pure python file.  It didn’t take much looking to see that was because of compiling the new pyx module.  So, I ran it again and the time reduced to only a fraction of a second slower than it originally was, but it was still slower than pure Python.  Just to play with it, I built a setup.py and made a pyd file to take pyxImport entirely out of the quest.  In this case, it didn’t make any measurable difference, but I have not done extensive testing on it and there are cases where a setup.py is needed to compile the file.

These initial tests showed me two things immediately.  The first is that Cython is extremely liberal in what it accepts.  The python code I ran through the Cython compiler included calls to import various libraries and other idiomatic Python features and it handled it without complaint.  I saw references that Cython is not a full Python implementation (at least not yet) and that some types of valid Python will fail, but it certainly seems to handle most cases.  All of the unit tests, continued to pass at this point.  The second is that while many examples show that Cython can speed up certain types of Python code with no changes at all, it is not magic and certain types of Python programs can be slowed down by using Cython.

Along those lines, I played with it by passing in the “wrong” type.  Namely, I passed in a float where I had defined an int in the Cython code.  The Python version happily switched to outputting float values.  The Cython version executed without errors but the output remained an int and was now substantially rounded compared to the results from the Python version.

So I pulled out just one class that mostly did different types of arithmetic.  I went through and meticulously went through changing def to cpdef and putting in types everywhere it made any sense to do so.  I tried that, and there was still no speedup.

So I finally ran it through the profiler, something that under normal circumstances would have been my first action.  I found that the vast majority of the time was being spent in a function of PIL that got called several times (though the functions from the class did seem to be minutely faster).   This proved a solid reminder of the importance of profiling before doing any kind of optimization.  Here, the time certainly wasn’t wasted since it gave me a chance to play with Cython, but had I started with the profiler I might readily have decided this wasn’t a great test case and found a different one.

To keep playing with it, I setup a new python file that imported that class in both its pure python version and in the Cython version with type declarations and ran some of the methods a few thousand times with timeit.   This showed that the Cython version ran in about a tenth of the time.

Conclusions

Cython seems to have a great ability increase the speed of Python code and will almost certainly be a tool I use repeatedly.  With that said, to get full advantage of it require some additional work.  Also, the first step in almost any sensible optimization plan should be to apply the profiler.

Class and instance variables in Python 2.7

The differences and interactions between class variables and instance variables are well described in the Python 2.x documentation, but they can be slightly subtle in some cases and I recently spent quite some time troubleshooting an issue even though I should have known better.

To put it simply, class variables are defined outside of any method of the class, normally right below the doc string.  Class variables can be referenced directly from the class, and this can be used as one way to make an enumerated constant, or they can be referenced from any instance in the class.  Instance variables are defined inside a method, normally __new__ or __init__, and they are local to that instance.

For a simple, contrived example:

class TestClass(object):
    c = 299792458  #this is a class variable

    def __init__(self):
        self.e = 2.71 #instance variable

    def phiFunc(self):
        self.phi = 1.618 #will become an instance variable after phiFunc is called
        x=224 #this is a method variable and can't be accessed outside this method

assert TestClass.c == 299792458
try:
    print TestClass.e #Not defined
except AttributeError:
    pass

testInst = TestClass()
assert testInst.c == 299792458 #instance gets c, currently a ref to the class var
assert testInst.e == 2.71 #got e because of __init__
try:
    print testInst.phi #error since not defined yet
except AttributeError:
    pass

testInst.phiFunc()
assert testInst.phi == 1.618 #now its here
try:
    testInst.x #still undefined
except AttributeError:
    pass

Class variables can be useful for constants that will need to be used by all instances of the class, or that are meant to be accessed directly from the class.  They can also be used to set defaults for instance variables.  But there it is important to remember that if the value of a class variable is changed, then all instances that call to the class variable will reflect that change.  So, to carry on with our contrived TestClass:

TestClass.c = 1
assert testInst.c == 1 #Referring back to the class, so it changed too.

But when an instance has an attribute of the same name as an instance that essentially hides the class attribute. And assigning a value to an attribute of an instance will assign it just to that instance attribute, even if it needs to create it to do it. So:

class testClass2 (object):
    sqrttwo = 1.41
    sqrrtthree = 1.73

    def __init__(self):
        self.sqrttwo = 1

assert testClass2.sqrttwo == 1.41 #access class variable

testInst2 = testClass2()
assert testInst2.sqrttwo == 1 #the instance variable hides the class variable

testInst2.sqrrtthree = 2 #assigning creates an instance attribute
assert testClass2.sqrrtthree == 1.73 #So the value in the class is unchanged

This can get complicated when the class variable is a mutable type, like a list.  Python handles assignment by reference, rather than by making a copy.  For instance:

class classWList(object):
    defaultList = [1,]

    def __init__(self):
        self.instList = self.defaultList

instance = classWList()
assert instance.instList == [1,] #same as defaultList
instance.instList.append(2)
assert classWList.defaultList == [1, 2] #the class variable also has 2 now.
assert id(classWList.defaultList) == id(instance.instList) #they point to the same memory address

Of course, that could be handled by explicitly making a copy instead of a simple assignment.

In short, both class variables and instance variables are useful, but there can be some subtleties to the way they interact that need to be remembered when working with them.

Some Jargon from the SQL Server World

Not long ago I wrote about how to write well about SQL Server, and one major point was that jargon should be used sparingly and carefully.  Tied to that, I thought it might be beneficial to collect ten of the more commonly used jargon phrases in the world of SQL Server literature.  In doing so, I hoped to provide a concise list of terms that people joining this community might wish to be familiar with.  I also thought it would serve, as a reminder to the seasoned database professional that these were jargon and might not be understood outside our community, many even have an entirely different meaning in other communities.

To expand slightly on what I said in my previous post, jargon is not always a bad thing.  Sometimes there simply is no non-technical word to express a concept, or the technical term expresses it much more succinctly.  Some in sociology even say that jargon can help with group identity and group cohesion.  Yet, it can make communicating with those outside of the group, or new to the group, difficult.  Even those well versed in the terminology can find reading something which overuses jargon tiring and difficult to follow.  Jargon does not necessarily need to be avoided, especially when the target audience should be familiar with it, but it should be used sparingly and with care.

Here, I am focusing on terms which are relatively common within the SQL community, but which would likely be unfamiliar to those outside of it.  I am intentionally avoiding terms like IAM/PFS[1] that are relatively rarely used in the community.  I am also skipping words like table, row, and query that are amoungst the first things anyone working with SQL Server will learn.  Instead I am focusing on the words that are common but not elementary within the SQL Server community or those that are so common that people readily forget they are jargon, and I am not trying to be comprehensive even there.

1.    DBA  - The job title held by many who work with SQL Server is largely unknown outside the technical community.  Even spelling it out as “Database Administrator” has earned me quizzical looks when I talk to people outside of the computer professions.  I generally describe myself as a programmer when asked unless I know the other person actually has a technical background.  The abbreviated form also overlaps with multiple other terms, such as “doing business as” in the business/legal communities and “Doctor of Business Administration”.

2.    Normalization – Normalization is a way to organize tables and the relations between them to help reduce redundancy.  Under the right circumstances, normalization can help reduce the size of the data files, guarantee certain types of internal consistency, and make the database more flexible for future changes.  However, it can, under certain circumstances, come at a performance penalty and potentially some increased complexity in programming since it frequently involves creating more tables which must be joined together.  The term normalization is shared with many other fields such as sociology and statistics, where it refers to a way to remove certain types of statistical errors.

Normalization is a deep topic and is discussed in some detail in articles such as:

The Road to Professional Database Development: Database Normalization by Peter Larsson

Stairway to Database Design Level 9: Normalization by Joe Celko

3.    3NF – The Third Normal form is the normal form most often sought after for practical database implementations.  It was loosely summed up by E.F. Codd saying “Every Non-key attribute must provide a fact about the key, the whole key, and nothing but the key.”

4.    Codd – Speaking of E.F. Codd, many serious students of relational databases are likely to be familiar with his name.  He was a researcher for IBM who is credited with inventing the relational model for databases and created much of the foundations of the theory behind relational databases.  The phrase “so help me Codd” is also occasionally, jokingly, added to the description of 3NF.  His name is also attached to the Boyce-Codd Normal Form (BCNF, sometimes called 3.5NF).

5.    RDBMS – Relational Database Management System.  MS SQL Server is a prime example of this, but so are Oracle, MySQL, SQLite and others.  Relational databases are based, although sometimes loosely and with compromises, on the relational model first described by Codd.  Relational Databases can be contrasted with OODBMS[2] like ZODB and Key-Value Stores like BigTable.

6.    ACID – Atomicity, Consistency, Isolation, Durability.  ACID describes properties of a reliable database transaction.  To perhaps oversimplify, if a database is ACID compliant then either an entire transaction completes and is committed or none of that transaction is committed.  In SQL Server, the database engine provides mechanisms to ensure that transactions are ACID compliant.[3]   ACID is often contrasted with the BASE[4]consistency model, which describes some types of NoSQL databases.

More information at:
DBAs on ACID by Tony Davis

7.    PowerShell – PowerShell is a command line shell and scripting tool.  It permits access to the .NET framework including SMO, which permits relatively easy control of SQL Server.  There are many online resources for PowerShell, including:

Introduction to Windows PowerShell for the SQL Server DBA by Edwin Sarmiento

Exceptional PowerShell DBA Pt1 – by Laerte Junior

 

8.    CRUD – Create, Read, Update, and Delete.  They are four of the basic functions of any persistent storage system, including databases.  This comes up frequently in the context of describing an application that does CRUD work on a database, especially if that application does nothing but provide a convenient front end or GUI for users to perform CRUD operations on the database.

9.    GUI – Graphical User Interface.  This of course is a term that is used widely in tech circles, and programming circles in particular.  Most users, especially the technically sophisticated ones will already be familiar with this term, but not all.

10.  ETL– Extract, transform, and load.  When discussing databases, this is generally the process of preparing a data source and then loading it into the database or data warehouse.  I have occasionally seen this used in reference to removing information from a database and presenting it in a more user friendly format, but that seems to be a less common use.

Talking about SQL Server in particular, SSIS is one of the main ETL tools used.  But practically any programming language can be used for the purpose.  I tend to make extensive use of Python for ETL tasks that involve substantial preprocessing of the data.


[1] Index Allocation Map/Page Free Space.  I only really learned this one myself when I was investigating some surprising results on a performance test.

[2] Object Oriented Database Management System

[3] There are some subtleties here though.  Handling Constraint Violations and Errors in SQL Server by Phil Factor has details on some of the factors that affect rolling back transactions.

[4] Basically Available, Soft State, Eventual Consistency.

The effects of an SSD on SQL Server Performance

I.                    Introduction

A little while ago, my wonderful wife Renee got a Solid State Drive (SSD) for me.  Being a DBA, one of my first thoughts was to wonder what kind of effect this would have on database performance, since I use my home computer for a lot of testing and my writing projects.  So, I installed the drive in my machine (a slightly more complicated process than I expected that wound up involve type III paracord…) and reinstalled my necessary software.

Then, being a father of a three as well as a database professional and a grad student, I proceeded to put that idea aside for some time.  But I wanted numbers and a direct performance comparison, and thought they would be of interest to the broader database community.  So, once I found some time, I began running some tests for direct comparisons.

II.                  A Brief Background on SSDs

Traditional harddrives store the data on a small array of platters using magnetic charges.  In order to read the platters or write to them, the movable read/write heads must be properly positioned over the platters.  Most solid state drives are based on flash memory and often involve no moving parts whatsoever.

These different techniques to store data create several broad differences between HDDs and SSDs:

  1. SSDs tend to have dramatically lower latencies since they can access the entire drive all the time without having to align the head with the part of the drive to be read.
  2. SSDs tend to be much more resistant to being moved around while doing read/write operations.
  3. HDDs tend to be substantially larger and have a much better size/price ratio.

I carefully hedged by using “tend” rather than a more definitive word.  Both SSD and HDD technologies are still evolving and how large the differences are, or even if they hold true, can vary when comparing different SSDs with different HDDs and with the different circumstances they are used under.  But those three broad trends will hold true the vast majority of the time.

SSDs, because of the way they write to the drive, can be subject to write-cycle exhaustion.  This happens as parts of the drive are overwritten too many times and eventually cannot be written to any more.  To dramatically oversimplify this topic:

  1. According to Tom’s Hardware, it is rarely something a consumer making standard use of their drive needs to worry about.
  2. Ars Technica points out that many SSDs targeted at the enterprise use SLC instead of the consumer grade MLC flash, and SLC is much less susceptible to write-cycle exhaustion, but more expensive.
  3. The risk of write-cycle exhaustion can be significant under certain circumstances though, such as using a consumer grade MLC drive in a write intensive enterprise-style environment

There are also hybrid drives which combine relatively large standard harddrives with a smaller cache of flash memory which is used to reduce latency.

As for my personal experience, the change in overall performance of my system when I went from using a HDD to using an SSD as my primary system drive with the HDD for bulk storage was dramatic.  While I have not measured it precisely, the system boot time is now substantially shorter, as are load times for all programs.  The shortened load time is particularly noticeable in Spyder, a python IDE I use that used to take nearly a minute to fully load and now takes just a few seconds.  The computer as a whole is subjectively more responsive and “snappy”.  I think that for many modern computers, moving to a SSD is perhaps one of the best ways to improve the overall performance of the system.

III.                A Word on Methodology

Here, I am essentially looking at performance difference as they pertain to databases between the two drives in my home computer.  While I think this generalizes relatively well to give an idea of the database performance implications of SSD’s generally, it is important to emphasize that in the end my sample size is one standard harddrive and one SSD.  More than that, these are both consumer grade drives sitting in desktop computer.  In short, while I think this comparison is instructive, it is not a rigorous comparison of enterprise class drives that would be used in corporate servers, much less a look at large scale SANs which can readily incorporate multiple types of storage medium in complex tiers.

I selected two slightly different tests to time.  First, I took the Python script I wrote to compare the performance of SQL Server and SQLite and ran it with just the target drive changed.  Then I ran some of the long running scripts from Jeff Moden’s excellent Performance Tuning: Concatenation Functions and Some Tuning Myths article.

In this setup, I am using the same installation of Windows and SQL Server 2012, both installed on the SSD, and only the location of the database is really changing.  The relevant system specifications for reference are:

System Alienware (Modified)
Processor Intel Core i7 3.4GHz
Ram 16GB
OS Windows 7 Home
SQL SQL Server 2012 Expres

IV.                The Tests

a.       Rerunning the SQL Server and SQLite Tests

A little while ago I wanted to do a direct comparison between SQL Server and SQLite in a single user, desktop environment.  Since it already had a system with speed measurements in place, I reran all the tests twice, once with the destinations for the database pointing to the HDD and once with them pointing to the SSD.  The full details of the tests are in the previous article, but to briefly summarize, the test program creates a SQL Server and a SQLite table and then repeatedly inserts into the tables, sending a commit after each statement.  Then it repeatedly inserts into the tables, sending a commit only at the end, and then attempts to read from both of the tables.  The results are:

SSD HDD
0.917 secs for 5000 inserts SQL Server (commit each)10.386 secs for 5000 inserts SQLite (commit each)0.435 secs for 5000 inserts SQL Server (commit once)0.020 secs for 5000 inserts SQLite (commit once)1.556 secs for 5000 reads SQL Server

1.960 secs for 5000 reads SQLite

2.120 secs for 5000 inserts SQL Server (commit each)538.364 secs for 5000 inserts SQLite (commit each)0.402 secs for 5000 inserts SQL Server (commit once)0.459 secs for 5000 inserts SQLite (commit once)1.518 secs for 5000 reads SQL Server

1.890 secs for 5000 reads SQLite

As expected, with a commit after each insert, which forces the server to write after every single insert, the SSD performs dramatically better.  It surprised me somewhat that the HDD came out better on the SQL Server inserts with one commit and even slightly faster on the SQL server reads.  Although I am not certain, I suspect this is because of the sophisticated caching that both the harddrive and SQL Server itself do.  So, I cleared the cache before executing the reads and got:

SSD (Cache cleared) HDD (Cache cleared)
1.564 secs for 5000 reads SQL Server 1.655 secs for 5000 reads SQL Server

This increased the time for reads on both drives slightly and shifted the advantage to the SSD where it was expected.

b.      Jeff Moden’s Concatenation Tests

Next, I turned back to Jeff Moden’s concatenation article.  This article came to mind because it helped give me more insight into performance tuning when I was much earlier in my SQL career and because it included a nice long running “Hog” Test that in his article took over 2 minutes and 50 seconds on the hardware he was using at the time.  So, I ran his setup code, and then created the function in the article, each twice, once on a database sitting on the SSD and once on a database sitting on the HDD.  Then I cleared the cache[1] and ran


SELECT DISTINCT SomeID, dbo.fnConcatTest(SomeID) AS CSVString
FROM dbo.TestData

Against each of them to get:

SSD (Cache cleared) HDD (Cache Cleared)
7 Secs 22 Secs

Clearly the SSD helped substantially.  This should not in any way to be taken to mean that improving the hardware can replace tuning the code though.  The “STUFF” version he provided at the end of his article had sub-second response times on the HDD, which is a nice reminder that frequently you can get better improvements by optimizing the code than by improving the hardware.

 

V.                  Conclusion

As the numbers show, the SSD generally performs better.  Though how much better varied by the circumstances, with SQL Server’s sophisticated caching system letting the HDD perform or even top the SSD under certain limited circumstances.

In most consumer and workstation systems currently in use, I suspect switching from a HDD to a SSD is one of the most effective upgrades currently available.  Moving purely to SSDs is probably prohibitively expensive for most large database applications, however they can serve as an effective part of a SAN to hold the most active data or to hold selected parts of the database which particularly require low latency with the bulk data held on HDDs in a simpler storage configuration.

VI.                More References

  1. The SSD Revolution – A series of feature articles on SSDs by ArsTechnica.
  2. In-depth on how SSDs really work- By Lee Hutchinson.  This is part of ArsTechnica’s SSD Revolution series, but goes into great depth on SSDs work and what those inner workings imply.  This is well worth reading.
  3. Is Your SSD More Reliable Than a Hard Drive by Andrew Ku

[1] It was significant to clear the cache here, without doing so the SSD and HDD times were identical.

New articles posted.

My article Plotting SQL Server Data for Data Visualization is up on MSSQLTips now and discusses using PyQt and Matplotlib to plot data from SQL Server.

My article Single User Performance of SQLite v SQL Server was posted yesterday on SQLServerCentral.  This one does a comparison of the performance of the python implementation of SQLite and SQL Server under single user scenarios.   It also touches on some interesting performance aspects of primary keys that I came across while writing the article.

Some of the commenters have pointed out that this is a somewhat “apples-to-oranges” comparison, and they have a fair point.  Still, I occasionally run into cases where both are valid choices so I thought it worth writing about.  One comment mentioned that SQL SE would be a more valid comparison, so I may do a brief follow up when I get a chance.