Introduction
The term "hash" is thrown around in casual IT conversation quite a bit nowadays, from "pass the hash" to "salted hashes" to "would you like hot sauce with your hash?"
Certainly there are many useful things that one can do with a hash, including:
- Perform a one-way conversion of a password into a hash string.
- Identify a malicious file (or a safe file) via its MD5, SHA1, or SHA256 hash.
- Hide functionality within malware via hashed DLL calls and strings.
But being a Computer Science geek, it is also worthwhile to consider the use of a hash to facilitate a data structure that enables fast lookups, namely: the hash table. Why fast lookups? Simply this: one critical part of cyber threat hunting is searching through thousands or millions of DNS queries for suspicious and / or malicious indicators of compromise. If one search algorithm beats another by a factor of ten or more, search times can be reduced from hours to minutes, or minutes to seconds. And time is always of the essence when threat hunting.
To Hash or not to Hash
Yes, Shakespeare had it right all those years ago. But what is the question? What purpose does hashing serve if we are interested in speed and efficiency?
The purpose being discussed here is searching. The efficiency of a search depends on how the search data is organized. This organization is based on two things: the data structure used to store the data and the way the data is placed into the data structure.
Let's examine just three data structures so we can compare their differences. In each case let's store the same group of ten integers:
2 5 17 28 41 60 75 83 127 213
We will not actually be searching for integers when we get to the meat of this blog, but using integers in our example helps convey the search concepts we are examining.
Our three data structures are as follows:
- Array: This data structure stores data items in consecutive memory locations and is typically searched by starting with the first item in the array and moving to the next item until the desired data is found. The search ends when the data is found or the last item in the array is reached.
- Binary Tree: In this data structure, data items are stored in nodes, and the data is placed into the tree in a specific way, so that at every node in the tree, nodes containing smaller data values are reachable from the left child, and nodes containing higher data values are reachable from the right child. Organizing the data this way reduces the number of comparisons that need to be made to locate the data item being searched for (if it exists in the tree).
- Hash Table: For this blog, this data structure is our shining star. The really cool (and fast) thing about looking data up in a hash table is this: the search goes directly to the location within the hash table where the data is stored (or not stored). We do not waste precious time looking through one location after another, as we do in an array, or even the reduced number of nodes we have to examine in a binary tree. We just look at one location.
Ok, I'm lying about that "one location" claim just a teensy bit, but it's a harmless white lie as we will soon see. First, take a look at the following figure, which illustrates how our example set of integers are stored in all three data structures:
Let's compare these three data structures by searching for the value 213. We just want to know if it is stored somewhere in the data structure and how much effort is required to locate it.
With the array, we start at index 0. There is a 2 stored there. That does not equal 213, so we advance to index 1. That contains a 5. Moving on. This process goes on until we get to index 9 where we finally find the value 213. This is a simple linear search whose worst case search time occurs when we have to search through every element of the array. It is simple to implement and relatively simple to insert or delete new elements, but it has a second characteristic that makes it less attractive for many applications: you have to know how many elements that need to be stored ahead of time. In this example, it is not possible to store an eleventh value in the array.
This storage limitation can be overcome by implementing the array as a linked-list, but this complicates the search, insert, and delete operations, and does not change the worst-case search time.
Moving on to the binary tree, we begin searching for the value 213 with the root node (which contains 75). The way a binary tree is searched is quite simple: if the current node contains the data value you are searching for, you are finished. Otherwise, you continue searching on the left child node if the search value is smaller than the current node value, or through the right child node if the search value is larger than the current node value.
So, referring to our example binary tree, beginning with the root node we have a 75 stored, which is less than the value 213 being searched for, so we go to the right child node. This node contains the value 127, which is also smaller than 213, so on we go to the next right-child node. This brings us to the node storing 213, so our search is finished. If this node did not contain 213, the search would end as there are no children off this node.
So, the array's linear search required 10 comparisons to find the value 213 and the binary tree required 3 comparisons. While you might be thinking "big deal James, get on with it," please think bigger for a second.
If the array had 1023 numbers stored in it, the worst-case linear search would require 1023 comparisons and the worst-case binary search would require 10 comparisons. That's a terrific savings in search time using the binary tree.
But the hash table can do even better.
Take another look at the figure and examine the hash table. This hash table contains eight indexes and each index stores up to two data values (a – indicates nothing is stored). The index used to access the hash table is the lower three bits of the data value being searched for. This is the cool thing about a hash value… because it is ultimately numeric, we can use a portion of it as a pointer into our data structure. So, continuing our example of searching for 213, we first look at what 213 equals in binary, which is 11010101. The last three bits are 101, so this is the index we use to access the hash table. The first value stored at this index is 5, so we move on to the second value, which does contain 213, so our search is successful. This only required two comparisons.
In fact, any search of this hash table will require, at most, two comparisons.
See, this is where my teensy little lie creeps in, where I claimed we only look at "one location." We might have to look at two instead. Or more, depending on how the hash table is organized.
However, I can redeem myself by saying this: If we develop a better hash algorithm for converting the data values (which in turn will lead to better index distribution), and also increase the hash table size from 8 rows to 10 or more rows, we could, theoretically (and often practically) get by with just one data value stored at each index and require just a single location to be examined during a search.
I say theoretically because (1) it is not so easy to come up with a good hash algorithm, and (2) you don't really know you've got a good hash algorithm until you investigate its collision properties.
Collisions, Nano-bots, and Giant Mosquitoes
Ok, I'm really only going to talk about collisions, but my friend and co-worker Shawn Kanady mentioned the nano-bots and giant mosquitoes when we were discussing time travel and it cracked me up. FYI, check out Shawn's recent blog PoSeidon Adventures in Memory ( https://www.trustwave.com/Resources/SpiderLabs-Blog/PoSeidon-Adventures-in-Memory ) for some excellent tips on memory-based malware analysis.
So, what's a collision? Simply put, when two different data items hash to the same hash value, that is a collision. This in itself is not a bad thing, as long as there are provisions built into the hash table to handle collisions. For example, looking at our example hash table, we see there are eight data values that result in collisions. But we can handle these collisions by storing up to two data values at each hash table index.
If three different data values hashed to the same index, we would have a problem. Not an impossible problem to solve, but still a problem. Again, we need to run experiments on our hash algorithm to see how it performs by throwing lots of data values at it (or strings, or whatever we are hashing) and seeing how many collisions occur.
And this, finally, is where I'm taking this blog post. While I was minding my own business and playing with search expressions in Splunk Lite, I had an intrusive thought that would not go away, which was: "Would a hash table created from hashed malware domains speed up a search if you had thousands of domains to check?"
Ok, ok. Those who know the answer is yes take one step forward. <blaring horn> Whoops, I didn't say "Simon Says."
But the answer is yes, but how much it speeds things up is unknown until the software gets written and the chase begins.
Things to do:
- Choose the programming language. I decided to give my Visual Studio 2012 installation a workout as I've ignored it for a long time and I'm scared of snakes (Python). Actually I think Python is cool, but I'm writing log analysis software using it and wanted to spread the fun around.
- Find a nice list of malware domains. Can I even provide the link here? Not really sure, but the website I used provided me with almost 1900 malware domains.
- Write the hashing code and keep track of collisions as I tweak the hash algorithm. Make a big table of results.
- Identify the best hash algorithm parameters.
- Find a very large list of domains. Sprinkle some malware domains from the first list into this list.
- Write more code to read the very large list of domains one by one and check them against the malware domain list. This is done two ways: (1) by hashing each domain and doing a hash lookup, and (2) by performing a string comparison of each domain against every malware domain.
- Draw conclusions and write blog.
And Away We Go
Here is my hash algorithm:
The basic hashing process is to rotate the 32-bit hashval integer one or more bits left (depending on the value of ROTCOUNT) and then XOR the lower byte of hashval with the ASCII code of the current maldomain string symbol. FYI, I made the initial value of hashval equal to 7 because I like the number 7.
Here is an example of the program output at this point:
Well, the hash values look different, so that's a relief. Let's check out the end of the program output to see our stats:
Wow. Let's think about what these results mean. The hash table being used here is defined like this:
Right off the bat you may wonder why there are so many collisions when the number of malware domains (1891) is less than the number of storage locations in the hash table (two for every index, or 2048). That's a great thing to question. The answer is that the hash algorithm needs to be improved. This is made quite clear by the "Max table depth = 93" statistic. The HSIZE parameter is currently set at 2. If the hash table needed to have storage for 93 data values at each index, the storage space of the hash table would equal 1024 * 93 * 4 = 380,928 bytes. It may be interesting to note that the size of the malware domain text file is only 40,702 bytes, so this would not be an efficient use of memory, and a table depth of 93 would not have as fast an access time as a table with fewer data values for each index.
I am also using the lower 10 bits of hashval as the index pointer into hashtab (as a reminder: 2^10 = 1024). Perhaps the results would improve if I used the upper 10 bits instead, or even some other combination of 10 bits from hashval . I could also increase the number of indexes to 2048 and see what happens (and therefore use 11 bits of hashval as the index).
The number of bits that hashval is rotated before the next XOR operation will also affect the hash values that are generated, and thus also affect the distribution within the hash table and the number of collisions.
So, I spent a good deal of tedious time trying all sorts of combinations of these parameters. Now you may understand why Shawn and I were talking about space travel at nearly the speed of light, nano-bots, and giant mosquitoes. You gotta pass the time somehow.
Here are the results:
Lower 10 bits of hashval |
Upper 10 bits of hashval |
Lower 11 bits of hashval |
Upper 11 bits of hashval |
|||||
Bits Rotated |
Collisions |
Max Depth |
Collisions |
Max Depth |
Collisions |
Max Depth |
Collisions |
Max Depth |
1 |
1033 |
93 |
1462 |
666 |
810 |
51 |
1377 |
539 |
2 |
733 |
56 |
598 |
28 |
512 |
56 |
348 |
24 |
3 |
545 |
33 |
576 |
19 |
286 |
23 |
303 |
18 |
4 |
541 |
16 |
572 |
58 |
267 |
10 |
322 |
57 |
5 |
549 |
38 |
472 |
6 |
286 |
37 |
185 |
5 |
6 |
481 |
6 |
503 |
8 |
201 |
4 |
240 |
7 |
7 |
514 |
8 |
542 |
14 |
238 |
6 |
258 |
9 |
8 |
1023 |
11 |
1383 |
19 |
578 |
6 |
1077 |
19 |
There is a clear winner when the lower 11 bits of hashval are used as the index and 6 bits of rotation are used in the hash algorithm. Note that these parameters are unique to the content of the malware domain list. Given a different set of malware domains, the algorithm and index bits would need to be re-evaluated.
When You Don't Pay Your Bill
In step 5 I need to find a very large list of domains to use as search items for my hashed (and unhashed) malware domain list. I found two websites that specialize in listing expired domains (hey, sometimes I miss my domain renewal bill too), and by combining the domains listed on both sites I ended up with 301,681 domain names.
Although I would like an even larger list (yes, I'll tell you the real reason shortly), that's good enough to do a comparison test.
I picked ten random malware domains from the malware domain list and sprinkled them into the very large list at different places. Then I wrote two new sections of code that read the very large list of domains one by one and first used a hash lookup, then a string comparison against the malware domain list. Here are the results:
Alrighty then, I have to admit I am pleased that the only hash table hits showing up are for the ten malware domains I randomly chose and inserted into the very large domain list. The time of zero seconds is also nice, although I wish I could get a more granular time indication, such as how many milliseconds the search took. Even so, less than one second seems pretty quick for searching over 300,000 domains.
Switching to a string-comparison search yields these results:
Ah, there's the payoff! Since it seems silly to say that the string search is infinitely longer than the hash table search (6/0 = infinity, although a true mathematician will say, rather dryly I might add, it is "undefined"), let's just pretend the hash table search took one second, and the string search takes six times longer.
James, Why Do You Bother?
Come on! Really? Isn't it worth the time invested in this analysis to review what one knows about hashing, data structures, and programming in order to apply it to a problem that needs to be solved? (Yes.) Isn't there a practical application for this? (Yes.) Isn't this just fun? (Yes… but I am a geek and have different views on fun).
Going on an excursion into Computer Science land to test the effectiveness of hash tables sure sounds like a pleasant vacation (except for the giant mosquitoes, which are easily seen, unlike the nano-bots, which are practically invisible but most likely itch when they get on you).
But why would I really do all this? I briefly mentioned searching DNS logs in the Introduction. Right now I am immersed in Threat Hunting research, looking for indicators of compromise that malware has made its way into one or more computers on a network. Many types of malware communicate and perform other activities (such as data exfiltration) using DNS. Even a relatively small network of only 500 computers (yes, there's no specific classification for small in this case, but 500 computers is small compared to 1,000,000 so there you have it) will generate lots of DNS requests, and so a DNS log of domain requests could easily grow to hundreds of thousands or more entries, perhaps quickly.
So I wondered how much of a speed benefit could be found if a hash table was used to store hashed malware domains, rather than an array of strings. Now I know there is an obvious speed benefit to using a hash table. With everything in place, a little more experimentation shines more light on the speed up.
Sorry, but I cheated a little and just copied my very large domain list onto itself several times, ending up with 3,318,601 domains. These took 5 seconds to search with the hash table and 60 seconds to search by string comparison. That's a speed up of 12 times.
Like a ray of sunshine, I'm basking in the glory of my results, but I'm still not satisfied. Out of over 3 million domains, there are only 110 malware domains included, and these are taken from random entries in the malware domain list. The hash table lookup is not affected by the position of the malware domain within the list, but the string search is. So my reasoning is this: if I add more (um, lots more actually) malware domains to the very large domain list, and these malware domains come from domains at or near the end of the malware domain list, the string comparison search should take longer because there are more malware domains in the search that require more searching in the malware domain array of strings.
In fact, I'm adding 10,000 copies of the last malware domain in the list. Is this unreasonable? If you consider that an infected computer on a network that beacons back to the same malware domain once every 10 minutes and goes undetected for a little over a day, that results in 10,000 domain queries just by itself.
What's the result?
This is an increase of 3 seconds, about 5% longer than before, which is significant when you consider that adding 10,000 malware domains to the list is only a 0.3% increase in the 3+ million already there.
Conclusion and a Confession
So, at the very least, this blog is an answer to the question posed by thousands of Computer Science students from the beginning of time (Turing Time, Planck Time, Amok Time --- I had to put an original Star Trek series reference in here somewhere) : "Where will I ever use a hash search?"
But I must confess, I have other motives for spending my time on this activity. I previously mentioned my Threat Hunting research. What I've been looking for is a nice, juicy, large DNS log file so I can test out regular expressions and other search tricks on Splunk Lite and other tools I am evaluating, including tools I am developing myself. So, not having quick access to an organization's DNS server, I setup my own Windows 2012 DNS server, but who has all day and night to click on every link imaginable?
Once again, I've got some coding to do.
So, stay tuned, as my next blog will explore my adventures in creating humongous, fake Windows DNS logs.