As we saw in my previous post, the build in hash functions of SQL Server were either expensive with good distribution, or cheap, but with poor distribution. As a breath of fresh air, let us look at a useful magic quadrant:
We see that all of the hash functions exposed by HASHBYTES fall into low speed quadrants, caution is advised here. We also see that while CHECKSUM and BINARY_CHECKSUM are challengers, they do not have the spread to fully compete. We are left with modulo as the best possible hash function, but we believe that this algorithm may have trouble executing. The Modulo hash function has some problems (sorry, had to say “problem” instead of “challenge”, I couldn’t bear speaking management bollocks anymore):
- It only works on integer types
- If there is structure in the data (for example, if all values are equal) it will not spread the values equally
- The question we are faced with is:
“Could we create a high speed, good spread, hash function that belongs in the leader quadrant and which does not have the limitation of the module functions”
It was this question I sought to answer when I started hacking away at SQLCLR functions. As you read the following, please don’t laugh too hard at me (a small snigger will suffice) – it has been a long time since I hacked away at code.
MurmurHash2 and Davy’s Bit Magic
Google has published a very interesting hash function, MurmurHash, claiming to have the properties I am looking for: It is both fast and has a good spread. The hash algorithm comes in different flavours: MurmurHash2 and MurmurHash3. Each flavour also exists in architecture and word size optimized versions. I decided to implement the x86/x64, 32-bit integer version in C#.
Fortunately, someone already beat me to it. Davy Landman has an implementation of MurmurHash2 – which he has kindly shared under GPL. I took his implementation and wrapped it in a SQLCLR user defined function.
While trying to understand Davy’s code and hacking it to work with SQLCLR, I had a few “aha moments” that I would like to share.
First of all, Davy uses this cute trick to turn 4 bytes from a BYTE[] array into a 32-bit INT:
UInt32 k = (UInt32)(data[currentIndex++]
| data[currentIndex++] << 8
| data[currentIndex++] << 16
| data[currentIndex++] << 24
);
I thought that was a rather neat trick. To practice my illustration skills, here is what happens:
Please note that left shift (<<) and or (|) operators take INT32, not UINT32 as input. Hence the need to cast the final value to UINT32.
Speaking of integers, in my modification of Davy’s source code you will find:
unchecked
{
return (SqlInt32)(Int32)h;
}
I need to get form the UINT which is returned by the reference implementation of MurmurHash to the signed integer (SqlInt32) that SQL Server understands. Doing the unchecked cast here is faster than Convert.ToInt32.
MurmurHash3
From the C++ source code of MurmurHash3, and using Davy’s bit trick, it took me only a few hours to get myself a nice implementation of MurmurHash3. I compared this with a C++ implementation done by my colleague Christian Martinez (based directly on the Google source), and we agreed on outputs that exercise all the branches in the code. So I am reasonably confident that my implementation is correct.
Except for the fact that it turns out to be important to use SqlBinary instead of SqlBytes (blogged here), there is not much more to say. Feel free to use the MurmurHash3 for your own implementations (my implementation is GPL)
The source code for the MurmurHash functions is too long to paste in this blog, so I created a new page on my site which you can find by following this link: C# Source code for MurmurHash in SQLCLR.
Before I show you the spread and speed results, let me just talk a little about two very old, but very commonly used, hash functions.
CRC16 and CRC32
The Cyclic Redundancy Check (CRC) family of hashes are, compared to MurmurHash, very simple to implement. They walk through each byte in the input and divide it with a specially selected polynomial. For implementation purposes, the polynomial is simply a constant in the code and the division is done with XOR and shifting (making this a very efficient operation). The remainder of the byte division is fed into the division on the next byte (giving the hash algorithm its name) and so on, until the end of the input .
WikiPedia contains a great reference implementation that I used for my tests. For readability, the source code is again available on a separate page of this blog: CRC16 and CRC32 in C#.
In the naïve approach to CRC, the polynomial division is implemented using an inner loop through all the bits of each byte:
rem = rem ^ data[i];
for (int j = 0; j < 8; j++)
{
if ((rem & 0x00000001) == 0x00000001)
{ // if rightmost (least significant) bit is set
rem = (rem >> 1) ^ Polynomial32; /* Polynomial */
}
else
{
rem = (rem >> 1);
}
}
An interesting optimization, shaving off a lot of cycles, is to create a table that allows you to look up the result of this loop for all 256 combinations of 8 bits. My source code contains such tables, both for CRC32 and CRC16 implementations. Using these tables, I can replace the above code fragment with:
rem = (rem >> 8 ) ^ CRCTable32[(rem & 0xff) ^ data[i]];
…A major optimization
Speed Results
At this point, you may wonder if I did all of this to give you a chance to laugh at my rusty coding skills. What exactly was the point of implementing a new hash function in the first place?
Without further ado, let us look at the results compared to native SQL Server functions:
Isn’t that beautiful? If we don’t need the cryptographic properties of the HASHBYTES functions, we can beat SQL Servers hashes with our own SQLCLR implementation. Please note that I took out MD2 from the above to avoid skewing the results (I have previously shown that it is simply too inefficient).
Now, the bad news: the yellow line above. That represent a “hash function” that returns zero, like this:
public static SqlInt32 NoHash(SqlBinary data)
{
return (SqlInt32)0;
}
Notice that even such a simple user defined function that does nothing, consumes almost all the same number of CPU cycles as a full hash calculation. This is the overhead of the CLR and throwing data back and forth between managed and unmanaged code. So much for my optimizations in CRC.
Spread Results
We now know that there is at least hope of implementing a faster hash function than HASHBYTES. But how well do these functions spread the input. Running the chi-squared test from my previous blog entry we get These results:
Now THAT is pretty nice isn’t it? We see that good old CRC makes for an excellent hash function for our purpose. MurmurHash does very well too, comparable with SHA, but at under half the CPU cost.
Summary
In this blog, I have shown you how to implement hash functions that are both faster and have better spread than the built in hash functions exposed in SQL Server HASHBYTES. We have also seen that the overhead of SQLCLR is significant for this case, and hence, it would be preferable if non cryptographic hash function were exposed natively in SQL Server.
If you care about such new hash functions for SQL Server, I would suggest you file a Connect item to ask for them and state your purpose for wanting them. “It would be cool” or “Thomas says so” is not enough, provide a business justification. But at least, now you know what to ask for…
I owe Christian Martinez thanks for this blog for directing me to the MurmurHash functions, and for helping me validate the correctness of my C# implementation against his C++ version.
PS: As I was exploring this little implementation, I noticed that one of my heroes, Donald Knuth, has extended The Art of Computer Programming with a new volume: Volume 4, Fascicle 1: Bitwise Tricks & Techniques; Binary Decision Diagrams. That sounds like an interesting read and it is now on my birthday wish list.
The post Implementing MurmurHash and CRC for SQLCLR appeared first on Fighting Bad Data Modeling.