Hashing: Unlocking the Secrets of Fast Data Access in Data Structures

Hashing in Data Structures

Hash Table:

Hash table
is a data structure in which keys are mapped to array positions by a hash function. Access of data becomes very fast if we know the index of the desired data. A value stored in a hash table can be searched in O(1) time using a hash function which generates an address from the key.

Thus, it becomes a data structure in which insertion and search operations are very fast irrespective of the size of the data.

 

Hash Table : Example

The following fig, shows a direct correspondence between the keys and the indices of the array.

• This concept is useful when the total universe of keys is small and when most of the keys are actually used from the whole set of keys.

• For example: storing information of 100 students with 100 keys.

However, when the set K of keys that are actually used is smaller than the universe of keys (U),a hash table consumes less storage space.

• The storage requirement for a hash table is O(k), where k is the number of keys actually used.


Hashing:

In a hash table, an element with key k is stored at index h(k) and not k. It means a hash function h is used to calculate the index at which the element with key k will be stored. This process of mapping the keys to appropriate locations (or indices) in a hash table is called hashing.

The fig. shows a hash table in which each key from the set K is mapped to locations generated by using a hash function. Note that keys k2 and k6 point to the same memory location. This is known as collision. That is, when two or more keys map to the same memory location, a collision is said to occur. Similarly, keys k5 and k7 also collide


Hash Function:

A hash function is a mathematical formula which, when applied to a key, produces an integer which can be used as an index for the key in the hash table. The main aim of a hash function is that elements should be relatively, randomly, and uniformly distributed.

It produces a unique set of integers within some suitable range in order to reduce the number of collisions. In practice, there is no hash function that eliminates collisions completely.

A good hash function can only minimize the number of collisions by spreading the elements uniformly throughout the array.

 Properties of a Good Hash Function:

a. Low cost

The cost of executing a hash function must be small, so that using the hashing technique becomes preferable over other approaches. For example, if binary search algorithm can search an element from a sorted table of n items with log2 n key comparisons, then the hash function must cost less than performing log2 n key comparisons.

b. Determinism

A hash procedure must be deterministic. This means that the same hash value must be generated for a given input value. However, this criteria excludes hash functions that depend on external variable parameters (such as the time of day) and on the memory address of the object being hashed (because address of the object may change during processing).

c. Uniformity

A good hash function must map the keys as evenly as possible over its output range. This means that the probability of generating every hash value in the output range should roughly be the same. The property of uniformity also minimizes the number of collisions.

 

Types of Hash Functions: Division Method:

It is the most simple method of hashing an integer x. This method divides x by M and then uses the remainder obtained. Here, the hash function can be given as h(x) = x mod M .

Since it requires only a single division operation, the method works very fast. However, extra care should be taken to select a suitable value for M. For example, suppose M is an even number then h(x) is even if x is even and h(x) is odd if x is odd. If all possible keys are equiprobable, then this is not a problem.

But if even keys are more likely than odd keys, then the division method will not spread the hashed values uniformly. Generally, it is best to choose M to be a prime number because making M a prime number increases the likelihood that the keys are mapped with a uniformity in the output range of values.

M should also be not too close to the exact powers of 2. If we have h(x) = x mod 2k then the function will simply extract the lowest k bits of the binary representation of x.

Types of Hash Functions: Division Method : Example

• Example:

int const M = 97; // a prime number

int h(int x)

{

return (x % M);

}

• A potential drawback of the division method is that while using this method, consecutive keys map to consecutive hash values. On one hand, this is good as it ensures that consecutive keys do not collide, but on the other, it also means that consecutive array locations will be occupied. This may lead to degradation in performance.

Question: Calculate the hash values of keys 1234 and 5462.

Solution Setting M = 97, hash values can be calculated as:

• h(1234) = 1234 % 97 = 70

• h(5642) = 5642 % 97 = 16

 

Types of Hash Functions: Multiplicative Method:

The steps involved in the multiplication method are as follows: 

Step 1: Choose a constant A such that 0 < A <1.

Step 2: Multiply the key k by A.

Step 3: Extract the fractional part of kA.

Step 4: Multiply the result of Step 3 by the size of hash table (m).

Hence, the hash function can be given as: h(k) = I m (kA mod 1) | where (kA mod 1) gives the fractional part of kA and m is the total number of indices in the hash table.

The greatest advantage of this method is that it works practically with any value of A. Although the algorithm works better with some values, the optimal choice depends on the characteristics of the data being hashed. Knuth has suggested that the best choice of A is (sqrt5 – 1) /2 = 0.6180339887

Types of Hash Functions: Multiplicative Method: Example

• Question: Given a hash table of size 1000, map the key 12345 to an appropriate location in the hash table.

- Solution We will use A = 0.618033, m = 1000, and k = 12345.

Types of Hash Functions: Mid-Square Method:

The mid-square method is a good hash function which works in two steps: 

Step 1: Square the value of the key. That is, find k2.

Step 2: Extract the middle r digits of the result obtained in Step 1.

The algorithm works well because most or all digits of the key value contribute to the result. This is because all the digits in the original key value contribute to produce the middle digits of the squared value.

Therefore, the result is not dominated by the distribution of the bottom digit or the top digit of the original key value. In the mid-square method, the same r digits must be chosen from all the keys. Therefore, the hash function can be given as: h(k) = s where s is obtained by selecting r digits from k2.

Types of Hash Functions: Folding Method:

The folding method works in the following two steps:

Step 1: Divide the key value into a number of parts. That is, divide k into parts k1, k2,.... ,kn , where each part has the same number of digits except the last part which may have lesser digits than the other parts

Step 2: Add the individual parts. That is, obtain the sum of k1 + k2 + ...+ kn. The hash value is produced by ignoring the last carry, if any. Note that the number of digits in each part of the key will vary depending upon the size of the hash table.

For example, if the hash table has a size of 1000, then there are 1000 locations in the hash table. To address these 1000 locations, we need at least three digits; therefore, each part of the key must have three digits except the last part which may have lesser digits.

Types of Hash Functions: Folding Method: Example

• Question: Given a hash table of 100 locations, calculate the hash value using folding method for keys 5678, 321, and 34567.

- Solution : Since there are 100 memory locations to address, we will break the key into parts where each part (except the last) will contain two digits. The hash values can be obtained as shown below:


COLLISIONS:

Collisions occur when the hash function maps two different keys to the same location. Two records cannot be stored in the same location. Therefore, a method used to solve the problem of collision, also called collision resolution technique, is applied. The two most popular methods of resolving collisions are:

1. Closed Hashing (Open addressing)

2. Open Hashing (Separate Chaining)

• The difference between the two hashing methods is that

a. Collisions result in storing one of the records at another slot in the table (closed hashing)

b. Collisions are stored outside the table in open hashing and collision resolution by open addressing.

 

1.Collision Resolution by Open Addressing:

Once a collision takes place, open addressing or closed hashing computes new positions using a probe sequence and the next record is stored in that position. The hash table contains two types of values: sentinel values (e.g., –1) and data values.

The presence of a sentinel value indicates that the location contains no data value at present but can be used to hold a value. When a key is mapped to a particular memory location, then the value it holds is checked. If it contains a sentinel value, then the location is free and the data value can be stored in it.

• However, if the location already has some data value stored in it, then other slots are examined systematically in the forward direction to find a free slot. If even a single free location is not found, then we have an OVERFLOW condition. The process of examining memory locations in the hash table is called probing.

• Open addressing technique can be implemented using

• Linear probing,

• Quadratic probing,

• Double hashing, and

• Rehashing.

 

2. Collision Resolution by Open Addressing : Linear Probing:

 In this technique, if a value is already stored at a location generated by h(k), then the following hash function is used to resolve the collision: h(k, i) = [h’(k) + i] mod m. Where m is the size of the hash table, h’(k) = (k mod m), and i is the probe number that varies from 0 to m–1.

Therefore, for a given key k, first the location generated by [h’(k) mod m] is probed because for the first time i=0. If the location is free, the value is stored in it, else the second probe generates the address of the location given by [h’(k) + 1]mod m.

Similarly, if the location is occupied, then subsequent probes generate the address as [h’(k) + 2]mod m, [h’(k) + 3]mod m, [h’(k) + 4]mod m, [h’(k)+ 5]mod m, and so on, until a free location is found.

 

Collision Resolution by Open Addressing : Linear Probing Example: 1

Question : Consider a hash table of size 10. Using linear probing, insert the keys 72, 27, 36, 24, 63, 81, 92, and 101 into the table.

Solution : Collision Resolution by Open Addressing : Linear Probing


Example: 1[2] Collision Resolution by Open Addressing : Linear Probing


Example: 1[3] Collision Resolution by Open Addressing : Linear Probing


Example: 1[4] Collision Resolution by Open Addressing : Linear Probing

Example: 1[5] Collision Resolution by Open Addressing : Linear Probing

Example: 1[6] Collision Resolution by Open Addressing : Linear Probing

Searching a Value using Linear Probing:

The procedure for searching a value in a hash table is same as for storing a value in a hash table. While searching for a value in a hash table, the array index is re-computed and the key of the element stored at that location is compared with the value that has to be searched.

• If a match is found, then the search operation is successful. The search time in this case is given as O(1).

• If the key does not match, then the search function begins a sequential search of the array that continues until:

- the value is found, or the search function encounters a vacant location in the array, indicating that the value is not present, or the search function terminates because it reaches the end of the table and the value is not present.

• In the worst case, the search operation may have to make n–1 comparisons, and the running time of the search algorithm may take O(n) time.

- The worst case will be encountered when after scanning all the n–1 elements, the value is either present at the last location or not present in the table.

• Thus, with the increase in the number of collisions, the distance between the array index computed by the hash function and the actual location of the element increases, thereby increasing the search time.

Linear Probing : Pros and Cons:

• Pros:

Linear probing finds an empty location by doing a linear search in the array beginning from position h(k). The algorithm provides good memory caching through good locality of reference

• Cons:

• It results in clustering, and thus there is a higher risk of more collisions where one collision has already taken place.

- The performance of linear probing is sensitive to the distribution of input values.

- As the hash table fills, clusters of consecutive cells are formed and the time required for a search increases with the size of the cluster.

• In addition to this, when a new value has to be inserted into the table at a position which is already occupied, that value is inserted at the end of the cluster, which again increases the length of the cluster.

• Generally, an insertion is made between two clusters that are separated by one vacant location.

• More the number of collisions, higher the probes that are required to find a free location and lesser is the performance. This phenomenon is called primary clustering.

• To avoid primary clustering, other techniques such as quadratic probing and double hashing are used.

Collision Resolution by Open Addressing : Quadratic Probing

• In this technique, if a value is already stored at a location generated by h(k), then the following hash function is used to resolve the collision: h(k, i) = [h’(k) + c1i + c2i2] mod m

where m is the size of the hash table, h’(k) = (k mod m), i is the probe number that varies from 0 to m–1, and c1 and c2 are constants such that c1 and c2 != 0.

 Quadratic probing eliminates the primary clustering phenomenon of linear probing because instead of doing a linear search, it does a quadratic search.

• For a given key k, first the location generated by h’(k) mod m is probed.

- If the location is free, the value is stored in it,

- else subsequent locations probed are offset by factors that depend in a quadratic manner on the probe number i.

• Although quadratic probing performs better than linear probing, in order to maximize the utilization of the hash table, the values of c1, c2, and m need to be constrained.

Collision Resolution by Open Addressing : Quadratic Probing : Example 1[1]

• Question : Consider a hash table of size 10. Using quadratic probing, insert the keys 72,27, 36, 24, 63, 81, and 101 into the table. Take c1 = 1 and c2 = 3.

Solution :

Collision Resolution by Open Addressing : Quadratic Probing : Example 1[2]

Collision Resolution by Open Addressing : Quadratic Probing : Example 1[3]

Collision Resolution by Open Addressing : Quadratic Probing : Example 1[4]

Collision Resolution by Open Addressing : Quadratic Probing : Example 1[5]


Collision Resolution by Open Addressing : Quadratic Probing : Searching a Value:

While searching a value using the quadratic probing technique, the array index is re-computed and the key of the element stored at that location is compared with the value that has to be searched.

If the desired key value matches with the key value at that location, then the element is present in the hash table and the search is said to be successful. In this case, the search time is given as O(1).

 However, if the value does not match, then the search function begins a sequential search of the array that continues until:

• the value is found, or

• the search function encounters a vacant location in the array, indicating that the value is not present, or

• the search function terminates because it reaches the end of the table and the value is not present.

In the worst case, the search operation may take n–1 comparisons, and the running time of the search algorithm may be O(n). The worst case will be encountered when after scanning all the n–1 elements, the value is either present at the last location or not present in the table.

Thus, with the increase in the number of collisions, the distance between the array index computed by the hash function and the actual location of the element increases, thereby increasing the search time.

 Collision Resolution by Open Addressing : Quadratic Probing : Pros and Cons:

Quadratic probing resolves the primary clustering problem that exists in the linear probing technique. Quadratic probing provides good memory caching because it preserves some locality of reference. But linear probing does this task better and gives a better cache performance.

One of the major drawbacks of quadratic probing is that a sequence of successive probes may only explore a fraction of the table, and this fraction may be quite small.

If this happens, then we will not be able to find an empty location in the table despite the fact that the table is by no means full.[ In previous example: try to insert the key 92 and you will encounter this problem].

Although quadratic probing is free from primary clustering, it is still liable to what is known as secondary clustering. It means that if there is a collision between two keys, then the same probe sequence will be followed for both.

With quadratic probing, the probability for multiple collisions increases as the table becomes full. This situation is usually encountered when the hash table is more than full. Quadratic probing is widely applied in the Berkeley Fast File System to allocate free blocks

 

Double Hashing:

Double hashing uses one hash value and then repeatedly steps forward an interval until an empty location is reached. The interval is decided using a second, independent hash function, hence the name double hashing. 

The hash function in the case of double hashing can be given as: h(k, i) = [h1(k) + ih2(k)] mod m where m is the size of the hash table, h1(k) and h2(k) are two hash functions given as h1(k) = k mod m, h2(k) = k mod m', i is the probe number that varies from 0 to m–1, and m' is chosen to be less than m. We can choose m' = m–1 or m–2.

 

Double Hashing : Pros and Cons

• Double hashing minimizes repeated collisions and the effects of clustering.

• That is, double hashing is free from problems associated with primary clustering as well as secondary clustering

 Double Hashing : Example 1[1]

Question : Consider a hash table of size = 10. Using double hashing, insert the keys 72, 27, 36, 24, 63, 81, 92, and 101 into the table. Take h1 = (k mod 10) and h2 = (k mod 8).

Solution:



Double Hashing : Example 1[2]


Double Hashing : Example 1[3]


Double Hashing : Example 1[4]


Double Hashing : Example 1[5]


Double Hashing : Example 1[6]

 Now T[6] is occupied, so we cannot store the key 101 in T[6]. Therefore, try again for the next location with probe i = 2. Repeat the entire process until a vacant location is found.
We have to probe many times to insert the key 101 in the hash table. Although double hashing is a very efficient algorithm, it always requires m to be a prime number.

 In our case m=10, which is not a prime number, hence, the degradation in performance. If it m been equal to 11, the algorithm would have worked very efficiently. Thus, we can say that the performance of the technique is sensitive to the value of m.

 

Rehashing:

When the hash table becomes nearly full, the number of collisions increases, thereby degrading the performance of insertion and search operations. In such cases, a better option is to create a new hash table with size double of the original hash table.

All the entries in the original hash table will then have to be moved to the new hash table. This is done by taking each entry, computing its new hash value, and then inserting it in the new hash table. Though rehashing seems to be a simple process, it is quite expensive and must therefore not be done frequently.

 Rehashing : Examples-

a. Republishing an old news article with a new headline but no new information.

b. Remaking a movie with little to no changes to the original story.

c. Repackaging a product with a new label or design but no changes to the actual product.

d. Repurposing a blog post by changing the title and images, but keeping the same content.

e. Reusing old research or data without adding any new analysis or interpretation.

Collision Resolution by Separate Chaining or Open Hashing:

 Open Hashing or Separate Chaining

In chaining, each location in a hash table stores a pointer to a linked list that contains all the key values that were hashed to that location.

• That is, location l in the hash table points to the head of the linked list of all the key values that hashed to l.

• However, if no key value hashes to l, then location l in the hash table contains NULL.

Open Hashing or Separate Chaining : Pros and Cons

• The main advantage of using a chained hash table is that it remains effective even when the number of key values to be stored is much higher than the number of locations in the hash table.

• However, with the increase in the number of keys to be stored, the performance of a chained hash table does degrade gradually (linearly).

• For example, a chained hash table with 1000 memory locations and 10,000 stored keys will give 5 to 10 times less performance as compared to a chained hash table with 10,000 locations.

• But a chained hash table is still 1000 times faster than a simple hash table.

• The other advantage of using chaining for collision resolution is that its performance, unlike quadratic probing, does not degrade when the table is more than half full.

• This technique is absolutely free from clustering problems and thus provides an efficient mechanism to handle collisions.

• However, chained hash tables inherit the disadvantages of linked lists.

• First, to store a key value, the space overhead of the next pointer in each entry can be significant.

• Second, traversing a linked list has poor cache performance, making the processor cache ineffective.

 APPLICATIONS OF HASHING:

Hash tables are widely used in situations where enormous amounts of data have to be accessed to quickly search and retrieve information.

• A few typical examples where hashing is used are given here.

• Hashing is used for database indexing. Some database management systems store a separate file known as the index file. When data has to be retrieved from a file, the key information is first searched in the appropriate index file which references the exact record location of the data in the database file. This key information in the index file is often stored as a hashed value.

- In many database systems, file and directory hashing is used in high-performance file systems. Such systems use two complementary techniques to improve the performance of file access.

- While one of these techniques is caching which saves information in the memory, the other is hashing which makes looking up the file location in the memory much quicker than most other methods.

• Hashing technique is used to implement compiler symbol tables in C++.

- The compiler uses a symbol table to keep a record of all the user-defined symbols in a C++ program.

- Hashing facilitates the compiler to quickly look up variable names and other attributes associated with symbols.

 • Hashing is also widely used for Internet search engines.


FAQ:

1. Define Hashing in Data Structures.
2. Write applications of Hashing.
3. Explain Rehashing.
4. What is Hash function?
5. What is Hash Table?
6. Write Hash Function Types With Examples.
7. Define Double Hashing.


Thank You!!!
Deep99Notes

Post a Comment

Previous Post Next Post