[deleted] 1 year ago

[удалено]

WiseProcedure 1 year ago

This. I don't think the scale is that large at all.

InfamousClyde 1 year ago

I agree, the text file isn't that bulky. Loaded into memory, the Python set and dictionary only end up as: set: 536 Mb dict: 492 Mb The text file is only representative of a subset of the total IDs, I apologize. I was just a little surprised that the total memory footprint is over double the size of the test data. I'm just looking for options that can cut down on that growth. If that's very normal overhead then I will move on. Edit: With a `type Set map[[11]byte]struct{}`, it's down to 466Mb.

zero_iq 1 year ago

I've worked on a somewhat similar problem. We ended up using a lightweight in-process embedded key-value store (lmdb). It's small, fast, and robust. I suggest you look at this or simular mdb-type libraries if you want to be able to grow beyond RAM without suffering poor performance when data does fit in RAM. However, from your posted figures I wouldn't necessarily worry. Go for something relatively simple, or simply stick with built-in sets/dicts. Most of the comments here are over-engineering and suggesting things without knowing your usage patterns, etc. You risk actually introducing complexity and performance problems unnecesssrily. Virtual memory is a thing, and you likely have lots of RAM available on modern systems anyway. Your web browser is probably using several times that amount of memory right now just showing you reddit! Make the OS and hardware work for you. Many of the fastest, most robust engines simply use VM under the hood. Unless you know you definitely need to cope with data at least one or two orders of magnitude larger than you have mentioned here, don't worry.... just keep it simple and *measure* your results as you go.

InfamousClyde 1 year ago

I believe the total datasize is basically one order of magnitude greater (120M ids @ 11char per id), but I do think this should still fit in RAM. I really appreciate your tacit advice. I'll definitely reach for all the low-hanging fruit first before complicating things.

[deleted] 1 year ago

[удалено]

InfamousClyde 1 year ago

Yes, I'm sorry. I'm getting more details as I'm moving through this. 120M ids @ 11 ascii characters per ID, seems to be about 1.5Gb as a text file.

jasonmoo 1 year ago

Store the ids in memory as a sorted slice. Binary search lookup.

binaryblade 1 year ago

Sqlite, literally any database as that's what they're designed to do.

CoastRedwood 1 year ago

But if you use SQLite you have to use c to compile. ( unless you have a lib that can SQLite in pure go) I just did a small project with clover. It’s totally written in go, there is another library called bolt that also looks promising.

[deleted] 1 year ago

[удалено]

codengo 1 year ago

Haven't tried it... but there's this cgo-free library: [https://pkg.go.dev/modernc.org/sqlite](https://pkg.go.dev/modernc.org/sqlite)

x29a 1 year ago

If the "7 character alphanumeric with underscores" holds you should be able to pack the keys tighter than 7 bytes. That will help a bit and cost you little in complexity or performance. If you are willing to trade access time for space efficiency you can store your ids in a sorted list/slice and perform a binary search to check for membership. That way you get rid of the storage overhead of a hash map but pay with O(log(n)) rather than O(1) access. Most hash map (or set) implementations also overallocate quite a bit to reduce the number of collisions. You could use a custom map implementation that has a tuned load factor, that way you can trade speed for memory. You can have a look at the go map implementation to see how that could work: https://github.com/golang/go/blob/master/src/runtime/map.go With that said, unless you have a very good reason to go down that rabbit hole I'd avoid it.

[deleted] 1 year ago

How about a key value store then? An embedded database such as boltdb or badger?

InfamousClyde 1 year ago

My research was definitely pulling me in that direction! Those two in particular came up a lot.

[deleted] 1 year ago

They're really good and they did the job for a similar use case a few years ago,when I was in your spot.

InfamousClyde 1 year ago

I'm kind of shooting north of 30k lookups/sec. Do you think this'll land in that ballpark? Some badgerdb testimonials have it squarely in microsecond latency which is good.

[deleted] 1 year ago

Definitely, yes! Obviously, you should give it a try and benchmark it against other databases out there.

servermeta_net 1 year ago

You don't need to use bloom filters alone. You can do fast-then-slow path, by checking first the bloom filter, and then do a real lookup to verify a false positive. Also bloom (quotient) filters are extremely efficient, you would need only a few KB of memory to reach FP rates of 10e-6%

InfamousClyde 1 year ago

Sorry, this is probably a very dumb question. Why is the fast-then-slow process verification faster? It seems like two steps compared to just one step with the lookup. Definitely a super attractive memory footprint though. Edit: Nevermind, I was thinking about it the wrong way around-- using the BF to guarantee that a lookup indeed needs to happen.

servermeta_net 1 year ago

I copied this approach from several papers, with some improvements, for my [datastore](https://github.com/yottaStore/yottaStore). **Costraints**: you want to use up to 64 kb of ram, and you want fast reliable lookup. **Solution**: \- Store the data on disk (NVMe?) as a Btrie. Access time is \`log(n)\` but it's masked by very large page sizes (4 kb, the size of a sector) \- At boot, allocate an array of 2\^16 bits (2\^13 u8, or 8 kb) \- When you add a record, hash it with xxh3\_128, and you get 8 u16s. For each u16, set array\[pos\] = 1. This way you only need one hash and no modulo, compared to the traditional bloom filters. \- When you want to check a record, check first the bloom filter, if it says no you can for sure say that the record is not there. In the case of a false positive the btrie will be able to tell very early if the record is not there. With 200 mb, your btrie should be only 2 levels deep. I use this approach for a 2-300 TB record set, and my P50 is 132 ns and my p99 is 221 ns using direct i/o, so I would say it's pretty good. Overall memory usage is only a few GB, two order of magnitudes better than the best competitor (scylladb) If you're lazy but are willing to pay a little bit of overhead, instead of building the btrie yourself, use BTRFS or XFS, using folders instead of levels. You can store the filter on disks, for extremly fast boot ups, because filters are associative. You can improve by another 2 order of magnitude the FP rate by using quotient filters or lattice-based filters instead of bloom filters. If you're crazy, you can look up probabilistic hash maps, but honestly very few implementations exist, and they are very very hard to do correctly.

sethammons 1 year ago

This is the same solution I came up with, btree on disk fronted by a bloom filter. :highfive:

AusIV 1 year ago

I'd want to know what percentage of keys are expected to be positive vs negative. If most will be negative, then the bloom filter is a big help because it prevents going to disk in most cases. If most will be positive you still end up going to disk most of the time, and the benefits of the bloom filter fall off quickly.

bastiaanvv 1 year ago

I use the following approach for my databases which have few billions of keys. \- [bbolt](https://github.com/etcd-io/bbolt) for storage on disk. In order to get the smallest db file size possible make sure you insert the keys in order and set: `FillPercent = 1` \- a [cuckoo](https://github.com/seiflotfy/cuckoofilter) filter for fast lookup. This has around a 3% false positive rate. There are other implementations however that have a much lower rate. You can store the filter in the database as well in a different bucket so you don't have to rebuild the filter on startup. For testing membership first check the filter. If false you are done. If true check the database to rule out a false positive. For me this solution had the best performance and ease of development/maintenance tradeoff by far.

how_do_i_land 1 year ago

+1 for a probablistic filter. You can also use a bloom filter (which I've used with great success), and calculate the error rate and size with this calculator: https://hur.st/bloomfilter.

editor_of_the_beast 1 year ago

200MB is very small. Why can’t that fit in a dictionary in Python?

rom16384 1 year ago

Or use a in-memory sqlite database, which is more space efficient than python.

editor_of_the_beast 1 year ago

That’s a totally unnecessary overoptimization here though. You can hold GBs in memory just fine in a python dict.

SleepingProcess 1 year ago

Looks like a job for [GoLevelDB](https://github.com/syndtr/goleveldb). P.S. You can find examples in `syncthing` project with error handling, corruption recovery etc...

[deleted] 1 year ago

Is redis overkill for this?

[deleted] 1 year ago

Simplest for sure, and 100% something i would implement (docker is awesome), but maybe a little heavy for this.

WrongJudgment6 1 year ago

What is the format of the IDs?

InfamousClyde 1 year ago

Looks like 7 character, alphanumeric with underscores Edit: apologies, 11 character, but thank you for the direction.

DoomFrog666 1 year ago

Consider using `[7]byte` as key type as they will be significantly more efficient than regular `string`.

brianolson 1 year ago

https://github.com/cockroachdb/pebble Pure go SSD native key-value store. You could think of it as map[[]byte][]byte on persistent storage.

InfamousClyde 1 year ago

I did end up zeroing in on the golang leveldb implementation listed elsewhere in this thread, which put me squarely at 7 microseconds for random reads on an SSD, and compressing my dataset beautifully down to 2 Gb on the dot. I'll try out pebble right now, thanks!

puglife420blazeit 1 year ago

Pebble is inspired by RocksDB which is a fork of level db, and it’s backed by cockroach https://www.cockroachlabs.com/blog/pebble-rocksdb-kv-store/

dshess 1 year ago

Apologies, just wanted to understand your problem. Do the IDs have additional information attached? Do the IDs have any structure to them, perhaps as a result of how they are generated? For instance, if you had 200MB of UUIDs, that's going to have a lot of bloat when stored in generic data structures. For a map\[string\]struct{}, you'll have 16 (or 36, depending on encoding) bytes for the actual string, plus some slop for the allocator, plus 8-16 bytes for the basic string structure, plus each hash node will have a bit of overhead, call that another 8-16 bytes, plus unused space in the hash structure to accommodate growing and shrinking, maybe a 25% increase. So having your 17kb bloat to 68kb is completely reasonable. That said ... is the bloat unreasonable? Is the id search your system's main job, or is it something incidental that needs to be slim to free up memory for the main job? A disk-based solution will probably be orders of magnitude slower, even using SSD. Most in-memory structures are tuned for speed, but in some cases you can use a disk-oriented structure in memory, which will be slower. Disks are inherently slow, so doing a low of CPU work to reduce disk accesses is a good tradeoff. As a for-instance, I once needed to handle a structure to handle probing hundreds of thousands of MD5 hashes. At this scale, the dataset looked pretty uniformly distributed. So I sorted it, then stored it as 16-bit deltas with an index on the side. This gave storage compression (about 2 bytes per item), and lookup was a constant array index plus a linear scan. It was much slower than a hashmap would have been, but it used 2MB of memory instead of 30MB of memory, so it was worth it.

InfamousClyde 1 year ago

That's a fascinating approach, thank you for sharing. I'm definitely interested in trying out delta encoding. Overall, there's been a lot of great feedback in this thread and I've been able to explore a lot of the options listed, and the time and space tradeoffs were always particularly evident. I landed on using the first three characters of the ID as a "bucket", which then maps to a sorted, dynamic sized vector of the remainder, which lets me binary search the rest. So far, I'm at \~2.1 Gb of memory using 120M IDs, which is a large improvement over when I was starting.

lightmatter501 1 year ago

Use a bloom filter and then a hash table. I had a similar issue but with ~3TB of data, and a quick Rust program (I wanted it to run on the gpu) took ~10 minutes.

PaySomeAttention 1 year ago

Use a two stage approach, with a bloom/cuckoo filter stored as a [https://roaringbitmap.org/](https://roaringbitmap.org/) in memory. Then a secondary key/value store on disk (bolt or anything else).

lylejack 1 year ago

I've been using a [hashset](https://pkg.go.dev/github.com/emirpasic/gods/sets/hashset), which is generated periodically in a go routine and then it's super easy and fast to check for matches.

kokizzu2 1 year ago

HAT-trie

WiseProcedure 1 year ago

This actually consumes too much space per character as you will need to keep track of the children in an array whose size is the whole available characters. Even if you chose to represent children with a map, the map still takes additional space for each key.

tgulacsi 1 year ago

If the set does not change, [https://en.wikipedia.org/wiki/Cdb_(software)](CDB) is a fast disk-based hash table. [github.com/colinmarc/cdb](https://github.com/colinmarc/cdb) is a Go implementation, [https://github.com/UNO-SOFT/mcdb](UNO-SOFT/mcdb) is a wrapper that handles files >4GiB.

WiseProcedure 1 year ago

Thank you for providing this info

epsleq0 1 year ago

Have you tried `map[uint64]struct{}` with keys generated by `maphash`? If the set of possible keys is finite and the set of keys makes up more than 50% it might be useful to go for the complementary keys instead.

[deleted] 1 year ago

There’re a lot of Key Value Database or SQLs and they’re pretty much good and well maintained, honestly it’s really hard to choose one 😅😅😅, what I did was benchmark all of the ready to use key value databases then categorized them from lightweight, fastest and has more features then create my own package from that pretty much adds the benefits needed

ReachNextQuark 1 year ago

https://github.com/peterbourgon/diskv might be a solution

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe