fastbloom高性能布隆过滤器
fastbloom 是使用 Rust 实现的 bloom filter(布隆过滤器) | counting bloom filter(计数布隆过滤器) Python 库及 Rust 库。比目前广泛使用的 pybloom-live 插入性能大约快10倍以上。
如果对您有帮助,麻烦给项目点个赞吧^v^
setup
Python
requirements
Python >= 3.7
install
使用如下命令安装 fastbloom 最新版本:
pip install fastbloom-rs
Rust
fastbloom-rs = "{latest}"
Examples
BloomFilter
布隆过滤器(Bloom Filter)是1970年由布隆提出的。它实际上是一个很长的二进制向量和一系列随机映射函数。布隆过滤器 可以用于检索一个元素是否在一个集合中。它的优点是空间效率和查询时间都比一般的算法要好的多,缺点是有一定的误识别率和删除困难。
参考: Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426. 全文
Python
基础用法
from fastbloom_rs import BloomFilter bloom = BloomFilter(100_000_000, 0.01) bloom.add_str('hello') bloom.add_bytes(b'world') bloom.add_int(9527) assert bloom.contains('hello') assert bloom.contains(b'world') assert bloom.contains(9527) assert not bloom.contains('hello world')
基于 bytes 或者 list 构造布隆过滤器
from fastbloom_rs import BloomFilter bloom = BloomFilter(100_000_000, 0.01) bloom.add_str('hello') assert bloom.contains('hello') bloom2 = BloomFilter.from_bytes(bloom.get_bytes(), bloom.hashes()) assert bloom2.contains('hello') bloom3 = BloomFilter.from_int_array(bloom.get_int_array(), bloom.hashes()) assert bloom3.contains('hello')
更多例子参考 py_tests.
Rust
use fastbloom_rs::{BloomFilter, FilterBuilder}; let mut bloom = FilterBuilder::new(100_000_000, 0.01).build_bloom_filter(); bloom.add(b"helloworld"); assert_eq!(bloom.contains(b"helloworld"), true); assert_eq!(bloom.contains(b"helloworld!"), false);
更多例子参考 docs.rs
CountingBloomFilter
计数布隆过滤器的工作方式与常规布隆过滤器类似;但是,它能够跟踪插入和删除。在计数布隆过滤器中,布隆过滤器的每个 条目都是一个与基本布隆过滤器位相关联的小计数器。
参考: F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese, “An Improved Construction for Counting Bloom Filters,” in 14th Annual European Symposium on Algorithms, LNCS 4168, 2006
Python
from fastbloom_rs import CountingBloomFilter cbf = CountingBloomFilter(1000_000, 0.01) cbf.add('hello') cbf.add('hello') assert 'hello' in cbf cbf.remove('hello') assert 'hello' in cbf # because 'hello' added twice. # If add same element larger than 15 times, then remove 15 times the filter will not contain the element. cbf.remove('hello') assert 'hello' not in cbf
本计数布隆过滤器使用4bit计数器存储hash索引,所以当重复插入同一个元素到过滤器中,计数器很快就会位溢出, 所以可以设置 enable_repeat_insert
为 False
用于避免重复插入,如果元素已经加入过滤器中,设置 enable_repeat_insert
为 False
将使元素不会重复插入。 enable_repeat_insert
默认为 True
。
from fastbloom_rs import CountingBloomFilter cbf = CountingBloomFilter(1000_000, 0.01, False) cbf.add('hello') cbf.add('hello') # because enable_repeat_insert=False, this addition will not take effect. assert 'hello' in cbf cbf.remove('hello') assert 'hello' not in cbf
更多例子参考 py_tests.
Rust
use fastbloom_rs::{CountingBloomFilter, FilterBuilder}; let mut builder = FilterBuilder::new(100_000, 0.01); let mut cbf = builder.build_counting_bloom_filter(); cbf.add(b"helloworld"); assert_eq!(bloom.contains(b"helloworld"), true);
benchmark
computer info
CPU | Memory | OS |
---|---|---|
AMD Ryzen 7 5800U with Radeon Graphics | 16G | Windows 10 |
add one str to bloom filter
测试添加一个字符串到布隆过滤器:
bloom_add_test time: [41.168 ns 41.199 ns 41.233 ns]
change: [-0.4891% -0.0259% +0.3417%] (p = 0.91 > 0.05)
No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
1 (1.00%) high mild
12 (12.00%) high severe
add one million to bloom filter
添加一百万字符串((1..1_000_000).map(|n| { n.to_string() })
)到布隆过滤器:
bloom_add_all_test time: [236.24 ms 236.86 ms 237.55 ms]
change: [-3.4346% -2.9050% -2.3524%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
check one contains in bloom filter
测试布隆过滤器包含的元素:
bloom_contains_test time: [42.065 ns 42.102 ns 42.156 ns]
change: [-0.7830% -0.5901% -0.4029%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
1 (1.00%) low mild
5 (5.00%) high mild
9 (9.00%) high severe
check one not contains in bloom filter
测试布隆过滤器不包含的元素:
bloom_not_contains_test time: [22.695 ns 22.727 ns 22.773 ns]
change: [-3.1948% -2.9695% -2.7268%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
4 (4.00%) high mild
8 (8.00%) high severe
add one str to counting bloom filter
测试添加一个字符串到计数布隆过滤器:
counting_bloom_add_test time: [60.822 ns 60.861 ns 60.912 ns]
change: [+0.2427% +0.3772% +0.5579%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low severe
4 (4.00%) low mild
1 (1.00%) high mild
4 (4.00%) high severe
add one million to counting bloom filter
添加一百万字符串((1..1_000_000).map(|n| { n.to_string() })
)到计数布隆过滤器:
counting_bloom_add_million_test
time: [272.48 ms 272.58 ms 272.68 ms]
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild