Tuesday, March 23, 2010

Checksum Vs. Hashbytes

What is Checksum?
Checksum is a function available in sql server 2005 and sql server 2008. It is intended to build a hash index based on an expression or column list.

Determining if two rows or expressions are equal can be a difficult and resource intensive process. This can be the case where the update was conditional based on all of the columns being equal or not for a specific row. Without checksums it is a long process of using innerjoins to identify the exact row to update in the update statements.

When is Checksum used?
When you need to compare the unique characteristics of an expression, columns or a table consider using the Checksum function.

When you have to update a row where many columns are compared to determine if the data is unique, use CHECKSUM function to build a unique value using checksum function and then compare the CHECKSUM values.


What is hashbytes function?
HashBytes function was introduced in SQL server 2005 to simplify creating hashes in the database. It can convert values to MD2, MD4, MD5, SHA, or SHA1 formats.

When can you use a hash function?
Hash functions are a useful option to improve the efficiency of particular queries on large volumes of data. For example, on long strings of text, you can build a hash-index to perform efficient lookups or to speed up aggregate operations.

Differences between checksum and Hashbytes

Faster but can produce lot of duplicate values
Slower but efficient than checksum
Returns an int value
Returns a varbinary(8000)
Microsoft does NOT recommend using CHECKSUM for change detection purposes

Use Hashbytes for change detection purposes

Step By Step Guide to resolve SQL server connection issues when connecting from R Studio

Yesterday I tried using a couple of  RevolveScaleR functions -- RxSqlServerData and RxImport.in Rstudio.  These are my learnings as part of ...