Oracle offers several hashing methods. In this blog post I review the four main types. I test the speed of inserts with hashing and which ones stand up to collisions (where two inputs hash to the same output).
The demonstration is being run agaisnt Oracle 11gR2.
This function generates hash values for an expression. A number is returned. Hash collisions are highly likely as the output is only 32 bytes.
Function gives PL/SQL developers access to a hashing algorithm. Things to note;
Use a small number for the base parameter to establish the low point of the range of values for the hash.
Use a very large hash size number. Oracle recommends a power of 2, for best results.
This package is deprecated since Oracle 11gR2, but we include it here for completeness.
The package function is overloaded for RAW, BLOB and CLOB. In addition it accepts 3 different cryptographic hash algorithms (MD4, MD5 and SHA-1). It returns RAW datatype of 128 bytes for MD4 and MD5 and 160 bytes for SHA-1.
Let’s start the demonstration. For each test I'll be inserting 10 million hashed records into a table.
SQL> set timing on
First we'll create a simple table called "A" which contains one column called "HSH" with a datatype of VARCHAR2 (45).
SQL> create table a (hsh varchar2(45))
Table created.
SQL> create index hsh_ix on a (hsh)
Index created.
SQL> analyze table a compute statistics
Table analyzed.
Let’s look at ORA_HASH first. We will insert 10 million records into the table setting HSH using this HASH function. Details on the select statement that generates the 10 million records can be found in one of my previous blogs: I Need More Data Scotty!
SQL> insert /*+ APPEND */ into a
select ora_hash(x)
from ( select dbms_random.string('X',10)||lpad(rownum,10,'0') as x
from (select rownum from dual connect by rownum <= 1000) a
, (select rownum from dual connect by rownum <= 1000) b
, (select rownum from dual connect by rownum <= 1000) c
where rownum <= 10000000
)
10000000 rows created.
Elapsed: 00:24:45.51
SQL> commit
Commit complete.
Pretty quick. Just shy of 25 minutes to generate/insert 10 million records. Let’s see if there were any collisions though.
SQL> analyze table a compute statistics
Table analyzed.
SQL> with collision as
(select hsh, count(*)
from a
group by
hsh having count(*) > 1
)
select count(hsh) collisions, round((count(hsh)/10000000)*100,2) as pct_collisions
from collision
COLLISIONS PCT_COLLISIONS
---------- --------------
11611 0.12
1 row selected.
Out of 10 million records 11611 collisions occurred, or about 0.1%.
Next we’ll run the same test but this time using DBMS_UTILITY.GET_HASH_VALUE
SQL> truncate table a
Table truncated.
SQL> analyze table a compute statistics
Table analyzed.
SQL> insert /*+ APPEND */ into a
select dbms_utility.get_hash_value(x,0,power(2,20))
from ( select dbms_random.string('X',10)||lpad(rownum,10,'0') as x
from (select rownum from dual connect by rownum <= 1000) a
, (select rownum from dual connect by rownum <= 1000) b
, (select rownum from dual connect by rownum <= 1000) c
where rownum <= 10000000
)
10000000 rows created.
Elapsed: 00:25:34.27
SQL> commit
Commit complete.
Again quick, at just over 25 minutes. And the collisions?
SQL> analyze table a compute statistics
Table analyzed.
Elapsed: 00:01:53.54
SQL> with collision as
(select hsh, count(*)
from a
group by
hsh having count(*) > 1
)
select count(hsh) collisions, round((count(hsh)/10000000)*100,2) as pct_collisions
from collision
COLLISIONS PCT_COLLISIONS
---------- --------------
1047776 10.48
1 row selected.
A considerable number of the records had collisions. about 10%.
Next up DBMS_OBFUSCATION_TOOLKIT package. This has a cryptographic hash algorithm MD5 function.
SQL> truncate table a
Table truncated.
SQL> analyze table a compute statistics
Table analyzed.
SQL> insert /*+ APPEND */ into a
select dbms_obfuscation_toolkit.md5(input=>UTL_RAW.cast_to_raw(x))
from ( select dbms_random.string('X',10)||lpad(rownum,10,'0') as x
from (select rownum from dual connect by rownum <= 1000) a
, (select rownum from dual connect by rownum <= 1000) b
, (select rownum from dual connect by rownum <= 1000) c
where rownum <= 10000000
)
10000000 rows created.
Elapsed: 00:32:36.20
SQL> commit
Commit complete.
Just over 30 minutes to run. Not bad. And how many collisions.
SQL> analyze table a compute statistics
Table analyzed.
Elapsed: 00:01:53.54
SQL> with collision as
(select hsh, count(*)
from a
group by
hsh having count(*) > 1
)
select count(hsh) collisions, round((count(hsh)/10000000)*100,2) as pct_collisions
from collision
COLLISIONS PCT_COLLISIONS
---------- --------------
0 0
1 row selected.
Zero collisions when precessing 10 million hashed records.
Finally we’ll look at the DBMS_CRYPTO package.
SQL> truncate table a
Table truncated.
SQL> analyze table a compute statistics
Table analyzed.
We’ll look at cryptographic hash algorithm MD4 first.
SQL> insert /*+ APPEND */ into a
select dbms_crypto.hash(utl_raw.cast_to_raw(x),1) -- HASH_MD4
from ( select dbms_random.string('X',10)||lpad(rownum,10,'0') as x
from (select rownum from dual connect by rownum <= 1000) a
, (select rownum from dual connect by rownum <= 1000) b
, (select rownum from dual connect by rownum <= 1000) c
where rownum <= 10000000
)
10000000 rows created.
Elapsed: 00:31:49.56
SQL> commit
Commit complete.
As you can see the crypto package runs at a similar speed to DBMS_OBFUSCATION_TOOLKIT.
SQL> analyze table a compute statistics
Table analyzed.
SQL> with collision as
(select hsh, count(*)
from a
group by
hsh having count(*) > 1
)
select count(hsh) collisions, round((count(hsh)/2000000)*100,2) as pct_collisions
from collision
COLLISIONS PCT_COLLISIONS
---------- --------------
0 0
1 row selected.
Zero collisions occurred. Next using MD5 cryptographic hash algorithm.
SQL> truncate table a
Table truncated.
SQL> analyze table a compute statistics
Table analyzed.
SQL> insert /*+ APPEND */ into a
select dbms_crypto.hash(utl_raw.cast_to_raw(x),2) -- HASH_MD5
from ( select dbms_random.string('X',10)||lpad(rownum,10,'0') as x
from (select rownum from dual connect by rownum <= 1000) a
, (select rownum from dual connect by rownum <= 1000) b
, (select rownum from dual connect by rownum <= 1000) c
where rownum <= 10000000
)
10000000 rows created.
Elapsed: 00:32:21.01
SQL> commit
Commit complete.
About the same time, at just over 30 minutes.
SQL> analyze table a compute statistics
Table analyzed.
Elapsed: 00:01:53.54
SQL> with collision as
(select hsh, count(*)
from a
group by
hsh having count(*) > 1
)
select count(hsh) collisions, round((count(hsh)/10000000)*100,2) as pct_collisions
from collision
COLLISIONS PCT_COLLISIONS
---------- --------------
0 0
1 row selected.
Again zero collisions for 10 million hashes. Finally SHA-1 cryptographic hash algorithm.
SQL> truncate table a
Table truncated.
SQL> analyze table a compute statistics
Table analyzed.
SQL> insert /*+ APPEND */ into a
select dbms_crypto.hash(utl_raw.cast_to_raw(x),3) -- HASH_SH1
from ( select dbms_random.string('X',10)||lpad(rownum,10,'0') as x
from (select rownum from dual connect by rownum <= 1000) a
, (select rownum from dual connect by rownum <= 1000) b
, (select rownum from dual connect by rownum <= 1000) c
where rownum <= 10000000
)
10000000 rows created.
Elapsed: 00:32:08.91
SQL> commit
Commit complete.
Again, around the 32 minutes mark.
SQL> analyze table a compute statistics
Table analyzed.
Elapsed: 00:01:53.54
SQL> with collision as
(select hsh, count(*)
from a
group by
hsh having count(*) > 1
)
select count(hsh) collisions, round((count(hsh)/10000000)*100,2) as pct_collisions
from collision
COLLISIONS PCT_COLLISIONS
---------- --------------
0 0
1 row selected.
And again zero collisions.
Let’s tidy up.
SQL> drop table a
Table dropped.
The Oracle 11g database I'm using is shared by other developers, so I ran these tests three times during the day to obtain an average time for each load. The averages were pretty much identical to the timings published above.
DBMS_OBFUSCATION_TOOLKIT and DBMS_CRYPTO are far less likely to produce hashing collisions. DBMS_OBFUSCATION_TOOLKIT is no longer supported in 11gR2 so should be avoided for this reason. Both packages have a cost to performance. You need to weigh up what’s more important, fast inserts with greater chance of failure or slower inserts with a more robust hashing method. Personally, I would choose the slower option and go with the SHA-1 cryptographic hash algorithm in DBMS_CRYPTO as it’s the least likely to ever experience collisions. Remember it far easier to avoid collisions than it is to fix them.
In Oracle 12c a new hashing function has been introduced called STANDARD_HASH, which is meant to be as collision resistant as DBMS_CRYPTO but with improved performance. I’ll be putting this new function to the test in a future blog.