Bloom filters are an information analysis function. These allow to determine if a data or set of these are stored within a database or distributed data set. Its main characteristics are its extreme efficiency in this task. It is for these characteristics, which are widely used in systems where it is necessary to verify the existence of a specific data within a huge set of these.
UOne of the most useful tools for analyzing probabilistic and one-way information is bloom filters. These bloom filters are tools or instruments that facilitate us to analyze large amounts of probabilistic information. This in order to know if an element or data is part of a set. This is a function that is extremely useful at times when we must handle large volumes of data. Especially when such information cannot be processed manually quickly.
That is why thanks to bloom filters, cryptocurrencies like Bitcoin have the function of SPV purses. But we also see this feature in cryptocurrencies like Ethereum where they allow you to search for information in your blockchain efficiently.
And this is thanks to the fact that the Bloom filters allow us to have only two results: false positives or negatives. That is to say, by means of the implementation of the bloom filters it is possible to know quickly and efficiently if certain elements may exist in memory, or if they definitely do not exist. The results of false positives throw the possibility that an element or data can be part of a set. While the negative results definitively conclude that the element or data is not included within the evaluated set. The tool at the same time allows us to completely rule out false negatives, which greatly facilitates data analysis.
But what led to the creation of the bloom filters? What is the relationship of these with the world of blockchain? Well, we will see this below.
Origin of bloom filters
Bloom filters were designed in the 70's by the developer Burton Howard Bloom. Bloom, who graduated in Computer Science from MIT, designed these filters as a space-efficient probabilistic data structure that allows you to check whether an element or data is part of a set or not. The goal after its creation was to create a data classification tool through the application of hash functions that return a result or an identification. At the same time, it allows to answer with certainty if the element being checked is not part of the set, or reflecting that it is probably within it.
Thus, the design of these Bloom filters allow to handle large databases or information at high speed. And at the same time efficient use is made of the storage space. This is because bloom filters do not require to contain or store the elements or data itself, but simply check whether or not they are within the set. A read-only data operation that enables high performance and great information processing capabilities.
How are bloom filters configured?
Bloom filters have what is known as a data structure of input array. This array has a length or storage capacity as large as necessary. This means that when building a bloom filter you can set how big the filter length will be, as required. Defining how many entries will be added to the base data structure and how many hash functions They will be used within the filter, associating each of these inputs.
Likewise, at the time of its design it must be taken into account that the range of the hash functions must start at 0 and end at the number of the number of existing entries minus 1. In other words, if a bloom filter is designed for 10 entries, it will start with number 0 and will end at number 9. If one is designed for 20 entries, the bloom filter will start at number 0 and will end at number 19. A computational design practice that seeks to maximize filter processing resources.
Likewise, when the set of existing entries finds all their values at 0, it means that the data is not in the bloom filter. So it is empty. So the moment you start to add data or elements to the filter, the information will be passed through the respective hash functions that will place said information in the corresponding place within the bloom filter. So these locations will reflect the value 1, indicating that they contain elements already analyzed.
From these values the operation of the bloom filters is constructed, which we will explain in detail below.
How bloom filters work
So, once the bloom filter has been configured we can start to verify if an element is part of the set or not. To achieve this, the process to follow begins with passing the desired data entry to the bloom filter algorithm. That is, we take the data from the system and process it using the hash functions of the system. These hash functions will return two positions as a result.
These hashes and the positions they return as a result are stored and related to the data that gives rise to it. Thus, the filter continues to collect information, applying hash functions to them and storing the results of their operation. However, this process has an additional procedure that maximizes its efficiency and improves the response time of systems that apply this type of filter to their structures.
First, if the data that has been passed to the filter goes through the hash functions and returns positions with values other than 0, then the element is inside the set. This is what is known as positive indicating the existence of that element in the set. It may also be the case that hashes return results with different values.
Conversely, if one or both of the positions show a value of 0, then the item is definitely not in the set. Another situation foreseen by the algorithm and which is called negative or false positive. This result is definitive or conclusive since the bloom filters will never result in false negatives. In other words, if the algorithm of a bloom filter detects a negative or a false positive, this information is definitely not in the analyzed data set.
On the other hand, when configuring a bloom filter, it is very important to define the number of bits and hash functions that will be applied. Because the greater the number of hash functions, the error ratio is greatly reduced, so the probability of having false positive results will be less. Likewise, once the bloom filter bitset is completely filled, the entered data cannot be deleted. This in order not to cause the appearance of false negatives in the filter.
How important are false positives and negatives within bloom filters?
The importance of false positive and negative states of bloom filters lies in efficiency. As we have already mentioned, bloom filters are programmed to take into account both states. And in case they appear, we can take the appropriate actions to give a suitable response.
For example, if we work with a data storage system to generate a cache, a bloom filter is of great help to us. This is because every time the system receives a data, what we must do is verify if said data is not in the data that we have stored in the cache. So if we enter this data and the bloom filter returns a negative or a false positive, we can be sure that this data is not in the set of information that we handle. And at that point, we can proceed to store this new data in the cache so that we can then access it quickly and efficiently.
If, on the contrary, the bloom filter returns a positive, we can simply discard storing the information and work with what we have in the cache, giving better access to the information and thereby saving valuable computational resources.
This type of operation is no stranger to the software we use on a daily basis. For example, web browsers use cache memory stored on our hard drives to give us access to certain resources quickly, compared to consulting such data online. Server databases and other systems that handle vast amounts of data also use bloom filters or similar algorithms to improve the efficiency of their responses and data processing.
Hash functions inside bloom filters
When configuring a bloom filter, independent and evenly distributed hash functions must be used. These hash functions allow an identifier to be assigned to any type of data, which can be used to index or compare said data within a set.
When we talk about hash functions we talk about the well-known SHA-256, MD5 or other functions like CRC32. However, in bloom filters you have to be careful. Using many hash functions adds security but also makes it more complex and time consuming, so the functions should be chosen so that their capabilities are fully exploited.
On the other hand, the unidirectional characteristic of the hash functions allow that an identifier can be determined or created from an element or data, but that the opposite process cannot be carried out. So if a user discovers an identifier, they will not be able to know what the data or elements related to it are.
Advantages and disadvantages of using bloom filter
Advantages
- Bloom filters, by not storing a dataset as such, are more efficient in terms of storage space usage. Since they only save if an information or element exists or not within the bloom filter.
- Likewise, this feature allows verification of data or elements can be done much more quickly and efficiently. Although it must also be taken into account that the greater the number of hash functions, the greater the time required by the bloom filter to verify the existence of the elements or data.
- Like the bloom filters use the concept of one-way hashing. If a user accesses them, they will not be able to directly know any of the information that is contained in these filters.
Disadvantages
- These tools do not return data verified. Instead they only allow to check if they possibly exist or not.
- When you have positive results you can only assume that they are probably correct. There can be no certainty or complete assurance that positive data is part of the package. Contrary to what happens in case of obtaining negative results. Where you can have an answer or a final decisive result.
- When designing the bloom filter, it must be assigned a size, regardless of whether it is a few bits or millions of bits. Once a size is designated, it will not decrease or grow more than previously established. Therefore, for the bloom filter to be efficient, it is necessary to define or be clear in advance how much data will be added. So if this information is not known, it is likely that a bloom filter will be designed with very few items that is not as effective in handling the information that is wanted. Or it may be the case that a very large bloom filter is designed that requires a very large storage space for the small amount of information to be handled. Which would result in a waste of space.
Bloom filters use cases
Cryptocurrencies: Bitcoin and Ethereum
The Bitcoin system uses bloom filters for speed up synchronization of SPV wallets or wallets; which allow them to specify only the transactions for which they want to receive system updates. Forming a set of transactions that can transmit to the complete nodes of the network. There you can verify through these filters. Then receiving confirmation of whether or not this set of transactions has been added to the chain. No need to handle a full copy of the blockchain. In Bitcoin this functionality is being changed by the Compact Blocks mentioned in the BIP-158.
For its part, the Ethereum network uses bloom filters as a mechanism through which you can find logs within your blockchain. Thus, by implementing these filters, you can easily search for events that occurred within the Ethereum system. Without overloading it by excessive information handling. Making applications can manage this information much more efficiently. While not requiring a large amount of storage space. Since with the bloom filters there is no need to store data that could be duplicated within the system.
In Ethereum, when a block is generated and verified, the contract address and the indexed fields of the records are added to a bloom filter. This filter is located in the block header. So if an application wants to find all the registry entries, the node only needs to scan the header. So you can recognize if the required data is there or not. So these elements are not added to the block as such, in order to save storage space.
Networks and information channels
Another important implementation of bloom filters allows networks or information channels to make article recommendations to users. Allowing these to not recur. That is to say, you can find out what articles a user has read to recommend the ones he hasn't seen yet.
Likewise, large data centers and content distribution centers (CDNs) use bloom filters to maximize the efficiency of data storage and network use, preventing repetitive or little-used elements from becoming part of their systems by overloading them. This includes companies like Akamai, Namecheap CDN, Fastly or Cloudflare.