Order of a Result Set in SQL

I have seen people build procedures that rely on a result set being returned in a certain order.  This is fine if an order by clause is used, but SQL Server will not guarantee the order of the results without one.

It is interesting to look at the way SQL Server normally returns a result set though.  So, lets make a small sample table:

create table orderTest (n varchar(15))

insert into orderTest (n)
values  (2),
(1),
(3),
(1)

If we run a simple “select n from orderTest” the results I get back are 2, 1, 3, 1.  In other words, it returns the results in the order they were inserted.  This is relatively consistent.  I have tried similar tests on multiple versions of sql on multiple machines.  As I said at the beginning, I have seen some programs rely on the data being returned in the order it was inserted and those programs have gone for extended periods of time without any problems.  But I want to emphasize that SQL Server will not guarantee the order without an order by clause.  If you want to be able to guarantee that you can get the data back in the order that it was inserted in then you can add an identity column and order by that.

The tendency to return things in the order they were inserted seems fairly stable with simple where clauses, but asking SQL Server to return distinct results will change the order as it checks for duplicates.  For instance “select distinct n from orderTest” gives me 1, 2, 3.

And SQL Server seems to return the results of a union all by just concatenating the results of the two queries.  So

select n
from orderTest
UNION ALL
select n
from orderTest

Gives 2, 1, 3, 1, 2, 1, 3, 1.  Essentially, the contents of the table repeated twice.  But union makes the results distinct and so it changes the order.  Redoing that with union  instead of union all gives 1, 2, 3.

Indexes will also affect the order.  You may notice the tests were all done on a heap with no indexes.  But if I add an index (either clustered or nonclustered) it makes the order for “select n from orderTest” become 1, 1, 2, 3.  Of course, you cannot always rely on the data being returned in index order either, unless there is an order by clause.  Similarly, as Conor Cunningham pointed out, the size of the dataset can affect it and break apparent order as the optimizer seeks to increase parallelization.

In short, SQL Server has certain ways it will normally return data and it can be helpful to be aware of those.  Even small changes to the query or changes in the size of the data can have a major impact on the order the results are returned.  Without an order by clause, SQL is not guaranteed to return the data in the way it normally does and you cannot rely on that order.  If the order actually matters, use an order by clause.

SQL Server Three Valued Logic and Ansi Nulls

SQL Server uses a three valued logic with True, False, and Unknown. And, normally, SQL Server Nulls are not comparable. Together that means that this little script:

if Null = Null -- Results in Unknown
print 'Num 1'

if Null != Null -- Results in Unknown
print 'Num 2'

if not (Null = Null) or not (Null != Null) -- Results in Unknown
print 'Num 3'

normally produces no results. It simply prints the standard ‘Command(s) completed successfully.’ One significant implication of this is that if the where clause includes something like

[column_name] = null

It will filter out all rows. The best way to look for Nulls is to use

[column_name] is null

But, whether or not Nulls are comparable can be controlled by the use of the Ansi_Nulls session setting. The default for Ansi_Nulls is on, and in fact the documentation warns that ANSI_NULLS may always be ON, with no way to turn it off, in future versions. Until then, this test script:

Set Ansi_nulls off

if Null = Null -- Now Results in True
print 'Num 1'

if Null != Null -- Now False
print 'Num 2'

if not (Null = Null) or not (Null != Null) -- Now True
print 'Num 3'

Dutifully prints out:

Num 1
Num 3

It is generally best to leave Ansi_nulls in its standards compliant default setting and use “is null” or “is not null”, but occasionally it can be useful to change that.

[20 Aug 13 - Updated to add tags and change category]

Variable Scope in SQL Server

The way variable scoping works in SQL Server has been well documented on MSDN, but I have seen related issues come up on forums recently so I decided to go over it.  Generally, a variable in T-SQL is available between when it has been declared and the end of the batch, function, or stored procedure that it is defined in.  That is of course fairly straightforward, but it is easy to miss some of the subtleties on what constitutes a batch.

Go creates a new batch.

The most explicit way to end a batch is with the GO command.  So for instance:

declare @var varchar(10)

set @var = 'Hi'

print @var --This works

GO --End the batch

print @var --Generates an error

Gives results of:

Hi

Msg 137, Level 15, State 2, Line 2
Must declare the scalar variable "@var".

Each Execute is in a new batch.

When using SSMS, every call to execute generates a new batch.  So, for instance if you highlight:

declare @var varchar(10)

set @var = 'Hi'

print @var --This works

In SSMS and press F5 (or Execute under the Query Menu) then it is all one batch and the print statement will execute.  If you then immediately highlight that print statement again and press F5 again, it will give the “Must declare the scalar variable” because it is executing in a new batch.

A transaction can include multiple batches.

An explicit transaction can include multiple batches.  So, you can open an explicit transaction, declare a variable within it, start a new batch and so lose access to the variable before the transaction commits.  So, to re-use the first example with a slight tweak:

begin transaction

declare @var varchar(10)

set @var = 'Hi'

print @var --This works

GO --End the batch, but not the transaction

print @var --Generates an error since out of batch

--this particular error won't roll back the transaction, but ends the batch

Results in this message:

Hi

Msg 137, Level 15, State 2, Line 2

Must declare the scalar variable "@var".

It got that error because the GO command ended the batch.  The Level 15 error generated ends the execution but not the transaction, so looking @@trancount at this point will return one and we could proceed to commit or rollback the transactions (which doesn’t actually matter in this case since it didn’t affect anything).

A procedure has a different scope even if called within the same batch.

Even if it is called as part of a longer batch, a procedure has its own scope.  So a procedure can’t directly access other variables in a batch that aren’t passed to it as a parameter and it can’t affect variables outside of it except through output parameters. So if we create a procedure just for testing like:

alter procedure SillyProc

as

declare @sillyvar varchar(5)

set @sillyvar = 'Hi'

print @sillyvar

then when it is executed there won’t be an @sillyvar available even if it is in the same batch.  Thus this:

exec SillyProc

print @sillvar --generates an error

Would give the familiar “Must declare” error.

Each execution from an external program is in a new batch.

Generally, each call to execute from a separate program will be in a new batch.  For a python example:

import pyodbc

sqlConnStr = ('DRIVER={SQL Server};Server=(local);Database=TestDb;'+

'Trusted_Connection=YES')

sqlConn = pyodbc.connect(sqlConnStr)

curs = sqlConn.cursor()

sql = """

declare @pyvar varchar(10)

set @pyvar = 'HI!'"""

curs.execute(sql) #This works

sql = "set @pyvar = 'retry'"

curs.execute(sql) #this will give an error.

curs.commit() #Won't get here.

This will give a “Must declare the scalar variable “@pyvar”.” error when it reaches the second execute statement because that is in a new batch and that new batch does not have the variable declared.

Conclusion

The scope of a variable is generally straightforward, but it can have some subtleties.  Essentially, you can use a variable within the function, procedure, or batch it is declared in.

(Edited 17 Feb. 2013 to correct a type in a sourcecode tag.)

A really simple Multiprocessing Python Example

Purpose and introduction

A Python program will not be able to take advantage of more than one core or more than one CPU by default.  One way to get the program to take advantage of multiple cores is through the multiprocessing module.  There are lots of excellent references and tutorials available on the web (links are included at the bottom), but one thing I was not able to find back when I first started using multiprocessing was a detailed look at an extremely simple, but still practical, example.  Sometimes, it is useful when dealing with a new technique to see it in a very simple form, but not so simple as some of the completely contrived examples in the library documentation.

So, here is a look at an extremely simple example using an embarrassingly parallel issue: generating the Mandelbrot set.  The algorithm used is basically a direct adaptation of the one presented in pseudo-code on Wikipedia, grouping the pixels into rows to make it easier to pass off to the multiprocessing.  Just to be clear, this is far from the fastest or best or most elegant way to use Python to calculate the Mandelbrot set.  It does provide a fairly good springboard for using multiprocessing while still doing actual work.

Essentially this provides a straightforward example with explanations of processing a function against a list of arguments using multiprocessing and then gathering those together into a list.

A look at a single processor version

Here is the code for a single processor version:

import matplotlib.pyplot as plt
from functools import partial

def mandelbrotCalcRow(yPos, h, w, max_iteration = 1000):
    y0 = yPos * (2/float(h)) - 1 #rescale to -1 to 1
    row = []
    for xPos in range(w):
        x0 = xPos * (3.5/float(w)) - 2.5 #rescale to -2.5 to 1
        iteration, z = 0, 0 + 0j
        c = complex(x0, y0)
        while abs(z) < 2 and iteration < max_iteration:
            z = z**2 + c
            iteration += 1
        row.append(iteration)

    return row

def mandelbrotCalcSet(h, w, max_iteration = 1000):
    partialCalcRow = partial(mandelbrotCalcRow, h=h, w=w, max_iteration = max_iteration)
    mandelImg = map(partialCalcRow, xrange(h))
    return mandelImg

mandelImg = mandelbrotCalcSet(400, 400, 1000)
plt.imshow(mandelImg)
plt.savefig('mandelimg.jpg')

The modifications needed to use multiprocessing

Obviously, to use multiprocessing, we need to import it, so towards the top, we add:

Import multiprocessing

The mandelbrotCalcRow function can remain unchanged.  The main changes are to the mandelbrotCalcSet function, which now looks like:

def mandelbrotCalcSet(h, w, max_iteration = 1000):
    #make a helper function that better supports pool.map by using only 1 var
    #This is necessary since the version
    partialCalcRow = partial(mandelbrotCalcRow, h=h, w=w, max_iteration = max_iteration)

    pool =multiprocessing.Pool() #creates a pool of process, controls worksers
    #the pool.map only accepts one iterable, so use the partial function
    #so that we only need to deal with one variable.
    mandelImg = pool.map(partialCalcRow, xrange(h)) #make our results with a map call
    pool.close() #we are not adding any more processes
    pool.join() #tell it to wait until all threads are done before going on

return mandelImg

Here, Pool creates the pool of processes that controls the workers.  It gets the environment ready to run multiple tasks.  One of the easiest ways to use the pool is to use its map.  That takes a function and an iterable of parameters.  That function is then called for each parameter in the iterable and results are put into a list, distributing the calls over the available threads.

One significant difference between pool.map and the built-in map, other than the fact pool.map can take advantage of multiple processors, is that pool.map will only take a single iterable of arguments for processing.  That is why I created a partial function which freezes the other arguments.

Pool.close() then informs the processor that no new tasks will be added that pool.  Either pool.close or pool.terminate need to be called before pool.join can be called.  Pool.join stops and waits for all of the results to be finished and collected before proceeding with the rest of the program.  This gives a simple way to collect the results into a single list for use later.

The other significant change is that the main portion, the entry-point of the script, needs to be wrapped with a  “if __name__=’__main__’ conditional on Windows.  This is because the main module needs to be able to be safely imported by a new python interpreter.  Not doing this can result in problems such as a RuntimeError or completely locking up the system in some of the tests I tried.  This, and a couple of other caveats, are mentioned in the Programming Guidelines.

So, the entry point now looks like:

if __name__=='__main__':
    mandelImg = mandelbrotCalcSet(400, 400, 1000)
    plt.imshow(mandelImg)
    plt.savefig('mandelimg.jpg')

In this example, the multiprocessing version only has 8 additional lines of code (its 15 lines longer, but 7 of those lines are additional whitespace or comment lines I added to make it easier to read).  But it runs in less than a third of the time.

Of course, it is worth remembering the saying that “premature optimization is the root of all evil.”  It is normally smart to get the code working first, and then consider bringing in multiprocessing options.

And the results:

Some related links.

  1. Multiprocessing Docs
  2. The examples in the documentation.
  3. Wiki.cython.org has an example of creating the Mandelbrot set using Cython.  For actually generating the set rather than just making examples for multiprocessing, that version is much better.
  4. SciPy.org has a good discussion of parallel programming with numpy and scipy.

 

{Edit 10 Jan 13 – Corrected a minor spelling error.}

Playing with Cython

Intro

Recently, I came over Cython and started experimenting with it.  After some basic testing, I found several things of interest.

  1. Used properly, Cython is a fantastic way to speed up Python code.
  2. It is extremely liberal in what Python code it will accept, but works best when the code is tweaked for Cython.
  3. Cython works extremely well, but it is not magic and can reduce performance and make code more brittle if not used carefully.
  4. Really, use the profiler before doing any serious optimization work.

What is Cython?

Cython is a programming language that compiles to C or C++ code that can then be compiled with a normal C/C++ compiler.  It supports the optional static typing of variables with C-types in order to improve performance.   One common usage is to create optimized extension modules to be imported into python.  Cython also seems able to create standalone executables, but I have not tried that yet.

The story and what I found.

When Cython caught my interest, I read about it on the web and played just a bit with some of the examples that were posted.  Then I decided I should play with it in something a bit closer to the real world and preferably with code that I had written.

I had a small little project that I had been toying with lying around that seemed perfect.  The entire thing was under a thousand lines of actual working code and I had it cleanly split into two files, one that did all the work and the other that just supported a pyQt gui.  The module with the working functions could be exercised completely separately from the GUI, I had unit tests for all the important functions, and I was unhappy with the performance.  This seemed like a perfect opportunity to test Cython.

I knew that under normal circumstances my first step in optimizing my simple program should be to run it through the profiler.  I’ve read that many times and I’ve seen good reasons for it in practice.  But this time my goal was to play with Cython, if it fixed the performance issues in the process that would be a nice bonus.  So, I was going to mess with Cython no matter what the profiler said and didn’t see a reason to bother with it at this point.

So, I forked the project into a new folder and set aside the gui, focusing on the working module for now.  I pulled out nearly all of the code and put it into a new .pyx file.  I left the original .py file as a small shell that set a couple of parameters and called the main function.  Then at the top of that file I added:

Import pyximport; pyximort.install()
Import (the pyx module)

Then I ran it, and it took substantially longer than the original pure python file.  It didn’t take much looking to see that was because of compiling the new pyx module.  So, I ran it again and the time reduced to only a fraction of a second slower than it originally was, but it was still slower than pure Python.  Just to play with it, I built a setup.py and made a pyd file to take pyxImport entirely out of the quest.  In this case, it didn’t make any measurable difference, but I have not done extensive testing on it and there are cases where a setup.py is needed to compile the file.

These initial tests showed me two things immediately.  The first is that Cython is extremely liberal in what it accepts.  The python code I ran through the Cython compiler included calls to import various libraries and other idiomatic Python features and it handled it without complaint.  I saw references that Cython is not a full Python implementation (at least not yet) and that some types of valid Python will fail, but it certainly seems to handle most cases.  All of the unit tests, continued to pass at this point.  The second is that while many examples show that Cython can speed up certain types of Python code with no changes at all, it is not magic and certain types of Python programs can be slowed down by using Cython.

Along those lines, I played with it by passing in the “wrong” type.  Namely, I passed in a float where I had defined an int in the Cython code.  The Python version happily switched to outputting float values.  The Cython version executed without errors but the output remained an int and was now substantially rounded compared to the results from the Python version.

So I pulled out just one class that mostly did different types of arithmetic.  I went through and meticulously went through changing def to cpdef and putting in types everywhere it made any sense to do so.  I tried that, and there was still no speedup.

So I finally ran it through the profiler, something that under normal circumstances would have been my first action.  I found that the vast majority of the time was being spent in a function of PIL that got called several times (though the functions from the class did seem to be minutely faster).   This proved a solid reminder of the importance of profiling before doing any kind of optimization.  Here, the time certainly wasn’t wasted since it gave me a chance to play with Cython, but had I started with the profiler I might readily have decided this wasn’t a great test case and found a different one.

To keep playing with it, I setup a new python file that imported that class in both its pure python version and in the Cython version with type declarations and ran some of the methods a few thousand times with timeit.   This showed that the Cython version ran in about a tenth of the time.

Conclusions

Cython seems to have a great ability increase the speed of Python code and will almost certainly be a tool I use repeatedly.  With that said, to get full advantage of it require some additional work.  Also, the first step in almost any sensible optimization plan should be to apply the profiler.

Class and instance variables in Python 2.7

The differences and interactions between class variables and instance variables are well described in the Python 2.x documentation, but they can be slightly subtle in some cases and I recently spent quite some time troubleshooting an issue even though I should have known better.

To put it simply, class variables are defined outside of any method of the class, normally right below the doc string.  Class variables can be referenced directly from the class, and this can be used as one way to make an enumerated constant, or they can be referenced from any instance in the class.  Instance variables are defined inside a method, normally __new__ or __init__, and they are local to that instance.

For a simple, contrived example:

class TestClass(object):
    c = 299792458  #this is a class variable

    def __init__(self):
        self.e = 2.71 #instance variable

    def phiFunc(self):
        self.phi = 1.618 #will become an instance variable after phiFunc is called
        x=224 #this is a method variable and can't be accessed outside this method

assert TestClass.c == 299792458
try:
    print TestClass.e #Not defined
except AttributeError:
    pass

testInst = TestClass()
assert testInst.c == 299792458 #instance gets c, currently a ref to the class var
assert testInst.e == 2.71 #got e because of __init__
try:
    print testInst.phi #error since not defined yet
except AttributeError:
    pass

testInst.phiFunc()
assert testInst.phi == 1.618 #now its here
try:
    testInst.x #still undefined
except AttributeError:
    pass

Class variables can be useful for constants that will need to be used by all instances of the class, or that are meant to be accessed directly from the class.  They can also be used to set defaults for instance variables.  But there it is important to remember that if the value of a class variable is changed, then all instances that call to the class variable will reflect that change.  So, to carry on with our contrived TestClass:

TestClass.c = 1
assert testInst.c == 1 #Referring back to the class, so it changed too.

But when an instance has an attribute of the same name as an instance that essentially hides the class attribute. And assigning a value to an attribute of an instance will assign it just to that instance attribute, even if it needs to create it to do it. So:

class testClass2 (object):
    sqrttwo = 1.41
    sqrrtthree = 1.73

    def __init__(self):
        self.sqrttwo = 1

assert testClass2.sqrttwo == 1.41 #access class variable

testInst2 = testClass2()
assert testInst2.sqrttwo == 1 #the instance variable hides the class variable

testInst2.sqrrtthree = 2 #assigning creates an instance attribute
assert testClass2.sqrrtthree == 1.73 #So the value in the class is unchanged

This can get complicated when the class variable is a mutable type, like a list.  Python handles assignment by reference, rather than by making a copy.  For instance:

class classWList(object):
    defaultList = [1,]

    def __init__(self):
        self.instList = self.defaultList

instance = classWList()
assert instance.instList == [1,] #same as defaultList
instance.instList.append(2)
assert classWList.defaultList == [1, 2] #the class variable also has 2 now.
assert id(classWList.defaultList) == id(instance.instList) #they point to the same memory address

Of course, that could be handled by explicitly making a copy instead of a simple assignment.

In short, both class variables and instance variables are useful, but there can be some subtleties to the way they interact that need to be remembered when working with them.

Some Jargon from the SQL Server World

Not long ago I wrote about how to write well about SQL Server, and one major point was that jargon should be used sparingly and carefully.  Tied to that, I thought it might be beneficial to collect ten of the more commonly used jargon phrases in the world of SQL Server literature.  In doing so, I hoped to provide a concise list of terms that people joining this community might wish to be familiar with.  I also thought it would serve, as a reminder to the seasoned database professional that these were jargon and might not be understood outside our community, many even have an entirely different meaning in other communities.

To expand slightly on what I said in my previous post, jargon is not always a bad thing.  Sometimes there simply is no non-technical word to express a concept, or the technical term expresses it much more succinctly.  Some in sociology even say that jargon can help with group identity and group cohesion.  Yet, it can make communicating with those outside of the group, or new to the group, difficult.  Even those well versed in the terminology can find reading something which overuses jargon tiring and difficult to follow.  Jargon does not necessarily need to be avoided, especially when the target audience should be familiar with it, but it should be used sparingly and with care.

Here, I am focusing on terms which are relatively common within the SQL community, but which would likely be unfamiliar to those outside of it.  I am intentionally avoiding terms like IAM/PFS[1] that are relatively rarely used in the community.  I am also skipping words like table, row, and query that are amoungst the first things anyone working with SQL Server will learn.  Instead I am focusing on the words that are common but not elementary within the SQL Server community or those that are so common that people readily forget they are jargon, and I am not trying to be comprehensive even there.

1.    DBA  - The job title held by many who work with SQL Server is largely unknown outside the technical community.  Even spelling it out as “Database Administrator” has earned me quizzical looks when I talk to people outside of the computer professions.  I generally describe myself as a programmer when asked unless I know the other person actually has a technical background.  The abbreviated form also overlaps with multiple other terms, such as “doing business as” in the business/legal communities and “Doctor of Business Administration”.

2.    Normalization – Normalization is a way to organize tables and the relations between them to help reduce redundancy.  Under the right circumstances, normalization can help reduce the size of the data files, guarantee certain types of internal consistency, and make the database more flexible for future changes.  However, it can, under certain circumstances, come at a performance penalty and potentially some increased complexity in programming since it frequently involves creating more tables which must be joined together.  The term normalization is shared with many other fields such as sociology and statistics, where it refers to a way to remove certain types of statistical errors.

Normalization is a deep topic and is discussed in some detail in articles such as:

The Road to Professional Database Development: Database Normalization by Peter Larsson

Stairway to Database Design Level 9: Normalization by Joe Celko

3.    3NF – The Third Normal form is the normal form most often sought after for practical database implementations.  It was loosely summed up by E.F. Codd saying “Every Non-key attribute must provide a fact about the key, the whole key, and nothing but the key.”

4.    Codd – Speaking of E.F. Codd, many serious students of relational databases are likely to be familiar with his name.  He was a researcher for IBM who is credited with inventing the relational model for databases and created much of the foundations of the theory behind relational databases.  The phrase “so help me Codd” is also occasionally, jokingly, added to the description of 3NF.  His name is also attached to the Boyce-Codd Normal Form (BCNF, sometimes called 3.5NF).

5.    RDBMS – Relational Database Management System.  MS SQL Server is a prime example of this, but so are Oracle, MySQL, SQLite and others.  Relational databases are based, although sometimes loosely and with compromises, on the relational model first described by Codd.  Relational Databases can be contrasted with OODBMS[2] like ZODB and Key-Value Stores like BigTable.

6.    ACID – Atomicity, Consistency, Isolation, Durability.  ACID describes properties of a reliable database transaction.  To perhaps oversimplify, if a database is ACID compliant then either an entire transaction completes and is committed or none of that transaction is committed.  In SQL Server, the database engine provides mechanisms to ensure that transactions are ACID compliant.[3]   ACID is often contrasted with the BASE[4]consistency model, which describes some types of NoSQL databases.

More information at:
DBAs on ACID by Tony Davis

7.    PowerShell – PowerShell is a command line shell and scripting tool.  It permits access to the .NET framework including SMO, which permits relatively easy control of SQL Server.  There are many online resources for PowerShell, including:

Introduction to Windows PowerShell for the SQL Server DBA by Edwin Sarmiento

Exceptional PowerShell DBA Pt1 – by Laerte Junior

 

8.    CRUD – Create, Read, Update, and Delete.  They are four of the basic functions of any persistent storage system, including databases.  This comes up frequently in the context of describing an application that does CRUD work on a database, especially if that application does nothing but provide a convenient front end or GUI for users to perform CRUD operations on the database.

9.    GUI – Graphical User Interface.  This of course is a term that is used widely in tech circles, and programming circles in particular.  Most users, especially the technically sophisticated ones will already be familiar with this term, but not all.

10.  ETL– Extract, transform, and load.  When discussing databases, this is generally the process of preparing a data source and then loading it into the database or data warehouse.  I have occasionally seen this used in reference to removing information from a database and presenting it in a more user friendly format, but that seems to be a less common use.

Talking about SQL Server in particular, SSIS is one of the main ETL tools used.  But practically any programming language can be used for the purpose.  I tend to make extensive use of Python for ETL tasks that involve substantial preprocessing of the data.


[1] Index Allocation Map/Page Free Space.  I only really learned this one myself when I was investigating some surprising results on a performance test.

[2] Object Oriented Database Management System

[3] There are some subtleties here though.  Handling Constraint Violations and Errors in SQL Server by Phil Factor has details on some of the factors that affect rolling back transactions.

[4] Basically Available, Soft State, Eventual Consistency.