When it comes to hashing, what's fastest? RAW datatypes or VARCHAR2. I decided to conduct a few tests to find out. In this blog post I compare the two datatypes for insert and disk consumption. I also talk a little about read operations when comparing the datatypes used to join tables.
I had already recorded the insert times for VARCHAR2 in an earlier post entitled I’ve Avoided the Collision, but I’m Stuck in the Slow Lane. I could use these results as my base comparison . All I needed to do was re-run my scripts, but this time inserting into a RAW datatype column.
Lets prepare the environment.
SQL> set timing on
SQL> set autotrace on
Autotrace Enabled
Shows the execution plan as well as statistics of the statement.
SQL> create table a (hsh RAW(32));
Table A created.
SQL> create index hsh_ix on a (hsh);
Index HSH_IX created.
SQL> analyze table a compute statistics;
Table A analyzed.
I have created a new table called "A" with a column named "HSH" that has a RAW datatype 32 characters long. 32 being the length of the output (digest) for MD5 cryptographic hash algorithm. We will insert 10 million hash records into table "A".
SQL> insert /*+ APPEND */ into a
2 select standard_hash(x,'MD5')
3 from ( select dbms_random.string('X',10)||lpad(rownum,10,'0') as x
4 from (select rownum from dual connect by rownum <= 1000) a
5 , (select rownum from dual connect by rownum <= 1000) b
6 , (select rownum from dual connect by rownum <= 1000) c
7 where rownum <= 10000000
8 );
10,000,000 rows inserted.
Elapsed: 00:12:37.414
SQL> commit;
Commit complete.
If we compare this to the insert into a VARCHAR2 column datatype, it’s about 5 minutes faster.
I conducted the same tests for the remaining STANDARD_HASH methods, increasing the size of the HSH column to the digest size of the cryptographic hash algorithm I was testing. To determine an accurate time, I ran each test 3 times. From this data I was able to calculate an average load speed for each algorithm method. The results can be seen in the table below.
From the results, it’s clear to see using a RAW datatype not only improves on performance but also greatly reduces disk consumption.
Let's tidy up.
SQL> drop table a;
Table A dropped.
As suspected, using a RAW datatype makes sense over using a VARCHAR2 for both insert operations and disk consumption. What was most surprising was how well the newer hashing algorithms of SHA256, 384 and 512 performed. The latter being almost an hour quicker than using VARCHAR2, with nearly 800mb of disk space saved when inserting 10 million records.
I also wanted to see if using VARCHAR2 or RAW datatypes hashed columns had any impact on read performance when joining on these columns. I conducted a simple test. I created three tables as exampled in the diagram below. Table A joined to table B via table C using hashed columns.
For both tests I used the hashing algorithm SHA256, which has a digest length of 64 characters. For the first test each of the columns had a VARCHAR2 datatype column and in the second RAW datatype columns. I used the data dictionary to insert records into each table as follows.
TABLE A - used the source of ALL_TABLES setting the hashing algorithm based on OWNER and TABLE_NAME concatenated. About 25,000 records were inserted.
TABLE B - used the source of ALL_TAB_COLS, setting the hashing algorithm based on distinct COLUMN_NAME. About 60,000 records were inserted.
TABLE C - used the source of both ALL_TABLES and ALL_TAB_COLS joining on OWNER and TABLE_NAME. The column that joined to table has a hashed column using OWNER and TABLE_NAME. The column that joined to table B had a hashed column constructed from COLUMN_NAME. About 90,000 records were inserted.
When comparing between the two datatypes, I could see no differences in performance. I ran each test several times to obtain an average, which was almost identical for both datatypes.