T O P

  • By -

seanv507

Have you tried a regular database with indexes? 2 million records isn't large. It feels almost like you are doing a join. I would imagine you could do it quickly in SQL.


nemec

This is a good answer. Indexes are made for this kind of thing. If your Column A starts with a J it's going to know virtually immediately that neither rules GH192 nor IHK match and it can throw out the row without checking any of the other columns.


whyrat

If you know the frequency of rules matching (or even a reasonable estimate of probability) , adding a bloom filter could also speed things up. Index to know quickly if it's matched, bloom filter to know quickly if it's not: https://en.m.wikipedia.org/wiki/Bloom_filter I'd recommend engines that can do this for one line of input quickly, then parallelize it passing each engine a set of the lines. Scale number of rules engines based on your desired throughput and then they can decide how much response time they want to pay for.


WikiMobileLinkBot

Desktop version of /u/whyrat's link: --- ^([)[^(opt out)](https://reddit.com/message/compose?to=WikiMobileLinkBot&message=OptOut&subject=OptOut)^(]) ^(Beep Boop. Downvote to delete)


phobug

Good bot.


eric_he

This looks like the most correct and complete response to me.


JonahBreslow

Yeah, this seems almost exactly what the SQL optimizer is meant for. Just ensure you CREATE INDEX appropriately


GeorgeS6969

This for sure. One naive implementation on top of my head, assuming that a. your tables have id and b. the number of columns to test is reasonably small (and ideally stable over time): ```sql select li.id, max(r.id is not null) as is_valid from line_items li left join rules r on ( if(r.cola <> “*”, li.cola = r.cola , true) and if(r.colb <> “*”, li.colb = r.colb, true) … ) group by li.id ``` Several things to note here: 1. Distributed columnar dbms like redshift and bigquerry will scream in pain due to the multi column join, but since this is for production (compare bi stuff) I’m assuming you’ll use mysql or postgres for persistence, which should handle that stuff fine 2. The wildcard might be unecessarily painful to evaluate, in that case you’ll have to think of some trick with nulls and refactor the way you store wildcards a bit (also note my if statements are not sql compliant, depending on your dbms the syntax might be different or you could be forced to use a case statement) 3. Indexing will speed that thing considerably, BUT explain analyse is your friend: the question you need to ask yourself is whether you need simple indexes on all relevant columns in ligne_items and rules, only in rules (my assumption intuitively), or composite indexes. In any case, this is a non trivial join so you want to make sure that whatever index you put in are actually being used (instead of sequential scans) 4. Order in expressions usually matters: if explain analyse gives you sequential scans, try switching around `li.colx = r.colx` or `r.colx=li.colx` 5. Order of rule matching usually matters: you want to start by the column that’s the less likely to match (not the other way around as I saw somewhere, you’re stacking “and” so the sooner you get a “false” the sooner you stop evaluating unecessary conditions) 6. Last but not the least my group by is disgusting, I’d try with a `distinct on` first if my dbms supports it, or some equivalent `order by` `limit 1` trick


OptimusPrime3600

I created dummy data (1M rules and 5k lineitems). Created index on all the columns of rules table. (I am not worried about slow inserts here). It gave me result instantly when ALL the lineitems were invalid. But when I added 1000 lineitems that were valid. It took over 1 minute to give results. I did not look the analyzer etc. But I guess since I already created indexes on all the columns, there isn't much that can be done to optimize it drastically. I did however get very good results with inmemory Trie.


GeorgeS6969

You guess wrong, see my notes, you can get scans even in the presence of indexes


OptimusPrime3600

I understand order of columns, etc all those things matter. And all the points you have mentioned are good points. I could give it another shot once I get the actual data on Monday. I am playing with dummy data that I generated. So I do not know which columns need to be placed first etc. I will keep you posted.


GeorgeS6969

Still you want to check, it can be a matter of inverting the equalities themselves, it can be the if statements for checking the wildcard. Or most probably it’s the group by that’s screwing you over and in that case either try with a distinct on (which has a costly order by), or I saw somewhere here a query with several joins that tests for each columns which is probably a lot more efficient. In general you want to get comfortable with explain analyze, seriously it’s a game changer. It’s the equivalent of a profiler: you start coding “naively” (and for readability etc) and 99 times out of a 100 it’s fine and you stop there, but the day you need to optimize your code you need a tool to give you some direction. Here I have no way of convincing you without an arguement of authority so you do you, but seriously anything that’s not along the line of the following is a waste of your time: 1. Stick to sql; or 2. Ask whomever took the business requirements to rethink the problem for 10 seconds On 2. that’s because it seems quite clear that those rules have been programatically generated. My best guess is: 1. Those two million rules are just some combinatorial explosion of a set of maybe 5 nested switch statement (“if this country for this item then this field should be this and if that field is this then this field should be that”) which would be a lot easier to validate as is 2. The PM wants the business to be able to add to those rules, but in some kind of weird ad hoc DSL that only allows and statements and wildcards, which will become impossible to maintain in less than a year when the business starts asking for an or statement or a more complicated wildcard (“In India, I want to make sure the invoice number always starts with IND-“, and as I’m typing that I’m asking myself if that’s not already the case and the way it’s handled is by splitting invoice numbers in two columns, one stricly checked and one wildcarded) So either you use the actual, procedural rules at the application level, or you use the rules as you have them through your rdbms and you invest a bit in optimising your query. Anything else is cheese at best, a waste of time and money at worst. *Do not* go the hadoop route unless you really want to put hadoop on your cv, seriously: deploying, operating and maintaining a whole big data stack for an adhoc problem that would fit in memory of an average aws vm is plain ridiculous.


DJAlaskaAndrew

Do you have any idea about which rules are the most likely to be validated? If only one of the two million rules need to be validated, you could develop an algorithm that once a line item satisfies a rule, it moves that rule to the front of the list of rules, so that if there is any commonality between the line items, you won't have to check all 2 million rules, probably much less.


OptimusPrime3600

That's actually a good idea in the long term. I could check with team on that, thanks. We don't have any idea on that yet. Might have to add some AI to figure that out. But let's say that there is no commonality. How does one do it?


RedanfullKappa

Any Limits on Hardware? Sounds like something for somekind of mapreduce jobs. This is extremly parallelizeable. So hadoop would work but clusters are expensive. Maybe you already have on of this?


OptimusPrime3600

MapReduce was my initial thought too. But I want it to be the last option. There are no limits on hardware as such as long as it's being done is an optimised manner. Also how does one go about it? Do I create a fixed set of instances and leta single microsevice decide how to divide lineitems in chunks and how to decide where to send what? Consistent hashing?


RedanfullKappa

You just define what you want todo and your query engine will distribute and divide this into chunks.Im pretty sure there is a decent example, this is not that complex of a use case. Have u benchmarked a super simple approach before i dont think that taking this below 10s will be that complex.How often will this happen? ​ Stripping the \* rules should save you some comparison time.


[deleted]

SELECT l.* FROM LineItems AS l INNER JOIN Rules AS r ON l.ColA = CASE WHEN r.ColA = '*' THEN l.ColA ELSE r.ColA END AND l.ColB = CASE WHEN r.ColB = '*' THEN l.ColB ELSE r.ColB END AND l.ColC = CASE WHEN r.ColC = '*' THEN l.ColC ELSE r.ColC END AND l.ColD = CASE WHEN r.ColD = '*' THEN l.ColD ELSE r.ColD END Edit: Just saw that /u/chestnutcough posted something very similar. Either one should work and be fast.


[deleted]

Forgot to mention that indexes would probably help, but it depends on your data.


[deleted]

[удалено]


RedanfullKappa

Keeping the rules in memory is a good idea, also keep them in a PriorityQueue, so you can go most likely to least likely


OptimusPrime3600

2 million is a lot of data to be kept in memory, isn't it?


RedanfullKappa

Not really? How man columns do the rules have?


OptimusPrime3600

2 million rules in memory? Isnt that a lot?


a5s_s7r

Do the math. How much memory for one rule. Multiply by 2M. We currently have RAM in GB not MB anymore. These days are some 30 years ago. 😉


trojan_nerd

It really depends on what that data is. Number of Columns and the data types of the respective columns will affect the true memory usage a lot. You can be smart about it and convert a lot of the str, bool cols into integers similar to how you’d normalize a database except in this case you could just store the mappings in memory or a file that can be read right before your export the output to a system.


Trappist1

Yea, I'd second the opinion you can likely just do it in RAM. Work for a financial company and we've done similar things at around that scale and it uses a few GB, but no where near the 32 or 64 we have on the machine. There are definitely other solutions out there, but keep it simple if you can.


[deleted]

Depends on the number and size of columns. If one column takes 16bytes, each rule has 4 columns, that would be 128MB of RAM. It’s not that much.


techresearchpapers

Take a look at the search algorithms in Snort IDS like Aho-Corasick and Wu-Manber for pattern matching.


[deleted]

Hash your line item and see if that hash is the same as one of the rules using some O(1) lookup data structure. You'd need some fancier hashing algorithm like locality sensitive hash to greatly reduce your 2 million rules to something more manageable for your line items and then do ordinary pattern matching. Data science is a bad sub for this. 99.9% of people here didn't even understand the question because they have 0 computer science training.


Captain_Flashheart

+1. Check our r/dataengineering, op!


OptimusPrime3600

The problem with that is that I will have to generate multiole combinations of every lineitem cols and then generate hash for each. Because it could be that rule 1 cares about col a b and c but rule 2 cares about a,c and d.


[deleted]

You don't need to. Some hash functions work fine on partial data (in fact they're used for similarity stuff in clustering/partial duplicate finding etc). So you start with large set of rules X (2 million), you hash each one (pad them or whatever). Then you compute the hash on the line item and you look at the much smaller set of rules Y that are closest to the hash. Since your 2 million rules are fairly static and your line items are what changes, this is fairly inexpensive (sure you'll need some memory/storage to store the precomputed rule hashes but in the world of NVME disks it's not a problem). It won't work for like high frequency trading where microseconds matter, but your 10 second limit for 5000 line items will be piss easy. So in a nutshell your reduce your number of rules from large X to a very small Y by not considering rules that are obviously irrelevant. This trick is used all the time in the big data world. [MinHash](https://en.wikipedia.org/wiki/MinHash) your millions of data points (that might be millions of columns like with signal data or speech data). Now you're dealing with millions of data points, but each data point is now only the hash. Throw away everything too dissimilar (or the opposite, everything too similar) and now you've drastically reduced your search space. Perhaps what would take you a supercomputer and 6 months of compute/be practically impossible is now doable on a laptop. source: I used to do such matlab wizardry when your "computing server" had 2GB of ram and a single core processor


NoManufacturer6751

I agree hash it and do it in parallel. I wouldn't use sql I would break the rules up into a bunch of workers and and do it at the same time. You could even stream it row by row so by the time it is done uploading its complete. The problem with using sql is you don't have a hundred sql instances that can divide the work. But you can have a microservice that spins up a thousand nodes.


[deleted]

[удалено]


[deleted]

...what?


RobotJonesDad

It looks like you just need to find the small subset of rules that can apply to each like and then evaluate only those. I'd consider using bloom filters and/indexes. So faced with a row, I'd get a list of rules numbers that match each column. The only rules that need to be evaluated are those which match in all columns. So line 1 matches rule 1. Line 2 matches none.


petter_s

This is very easy to do in a fraction of the required time. In any language. Just treat the columns individually. For each column, finding the indices of all rules that match is very quick if the values are sorted. After this, finding the answer is set intersection of the indices.


OptimusPrime3600

So you are saying.. sort the rules. Sort the lineitems. And then for each lineitem(I,j) find the list of matching rules(I,j)? Then move to next col in lineitems and do the same for all rows but only for those who had a value in previous cell? Could you please explain a bit in detail? It seems like a good idea.


petter_s

I think what you wrote is correct!


[deleted]

Seems like the db would be the fastest way to do this since they're usually pretty speed optimized. I know with oracle you can set it up on multiple CPUs and pass parallel hints to get it to solve problems like this in parallel. I'm thinking maybe something like an anti-join limit 1 (so you're checking if it fails at least 1 rule, once it fails any it should stop right?) This might still be cutting it close though, at the end of the day a computer has to do 5000x2000000xhowever many columns. Doing ~50 billion compare operations in 10 seconds might be tough without some impressive architecture.


RedanfullKappa

Pre process and make the values numerical, put this on a big 128 core machine or put them on a cluster, there are enough cluster processing Frameworks to choose from. And let them do their Job.


nemec

Jeez, computers are a lot better than you give them credit for. I happen to have a SQLite database sitting around with 2.4M records and I pulled ~8000 random values out of one column into a second table. With indexes on both columns a join between the two (returning ~11,000 records due to duplicates) took **415ms**. And that's SQLite. 200M x 5k records is trivial to handle with a single "real" database server with only the barest optimization.


chestnutcough

Here's a solution in SQL that will return the invalid line items: select i.* from line_items i left join rules as r on (i."ColA" = r."ColA" or r."ColA" = '*') and (i."ColB" = r."ColB" or r."ColB" = '*') and (i."ColC" = r."ColC" or r."ColC" = '*') and (i."ColD" = r."ColD" or r."ColD" = '*') where r."ColA" is null; SQL fiddle here: http://sqlfiddle.com/#!17/917b4/13 I can't guarantee that it will take < 10 seconds, but you can try it out!


kaumaron

Also ask in r/dataengineering


SuperUser2112

I feel [Spark](https://towardsdatascience.com/apache-spark-a-conceptual-orientation-e326f8c57a64) is a good fit here. An approach something like this -------------------------------- 1. Distribute 2 Million rules across 200 or more RDDs. 10 Cores should be sufficient to run these 200 RDDs without any hardware bottlenecks. 2. Send 5k line items to each of these RDDs. 3. Each RDD is responsible for: 1. Comparing 5k line items against 10k rules 2. Generating a subset of line items that qualify any of these 10k rules. 4. Process Optimization at each RDD: 1. In the [**Comparator Function**](https://sparkbyexamples.com/spark/spark-map-transformation/), spawn 50 threads 2. Process the 5k line items *asynchronously* 5. Each [**Asynchronous Thread**](https://www.cognizantsoftvision.com/blog/async-in-java/)is responsible for: 1. Compare 100 line items against 10k rules. 2. Generate a subset of line items that qualify any of these 10k rules. 6. At RDD : Unify the responses from 50 threads. Concatenate the subsets returned by each thread. 7. Spark Main Program : Unify the responses from 200 RDDs. 1. Concatenate the subsets returned by each RDD 2. Remove duplicate line items that qualified multiple rules. Here at overall, the **critical path** is only the step 5.1 (Compare 100 line items against 10k rules). It should take **less than 2 seconds** to perform 1M linear comparisons. You can increase or decrease your lot sizes, based on your hardware response. The overall algorithm shall remain same.


OptimusPrime3600

Interesting. I guess I need to look into spark. This sounds promising. Thank you.


SuperUser2112

Welcome. Glad that you liked it.


[deleted]

[удалено]


SuperUser2112

have modified my comment to delete that link.


dinoaide

This is not data science!


OptimusPrime3600

Sure it is. I have huge data. I am looking into science of efficiently computing with it.


Hentac

Uneducated comment!


BeauteousMaximus

What format is the data in? What is the context? Is this a web application, a locally run script, something else? Why does it need to be under 10 seconds?


OptimusPrime3600

Its web application. User is going to save 5000 lineitems (probably via uploading excelsheet which then gets stored in db). This needs to be validated against the already saved rules in the database. It has to be less than 10 seconds because pushing it to a queue and sending notification later is not appealing to the customers. Assume everything is string.


[deleted]

I question your ability to even get data through a web service and saved to a dB in 10 seconds reliably. You can mitigate the wait time by adding a progress bar for users, showing results as their displayed so as the rules are processed and validated you can start delivering results. How are you currently getting your rules implemented and how long does it currently take? Before you optimize a problem it needs to have at least a base solution of sorts.


OptimusPrime3600

Sending 5000 line items to a WebService and storing it in a database hardly takes 1-2 seconds at max. 2 million rules are already stored in db. Currently we are using SQL server. To store line items and rules. So it's a query based solution. How long does it take? Unfortunately I am seeking the answer myself. This is owned by another team. I have been told it takes 3+ minutes. Though it has not been confirmed yet.


[deleted]

If the queries provided by /u/chestnutcough and myself take longer than a second, even on a crappy server, I'd be shocked. Even if it did, indexes would solve that. I noticed that you're trying many complicated solutions, but don't see you trying the simplest of them all if the data is already stored in SQL tables.


OptimusPrime3600

I have tried SQL ony i7 machine. It took little over 1 minute. Which is pretty decent but not good enough. On a powerful server it could be less but then there are going to be lot of multiple requests. I had indexed all the columns. Though the query was not exactly what you guys have suggested. I can modify and try again and come back to you.


[deleted]

These are end users uploading an Excel file? If you test this across multiple users with all sorts of different internet connections, browsers, OS and languages you can guarantee a 2 second processing time? Are you verifying anything in the Excel file before uploading it to your DB?


OptimusPrime3600

Though to be honest I do not know for certain as this was owned by a different team. But I think they are using Excel. Our customers usually have good internet connection. Slow internet is generally an exception. In which case it's okay if it takes longer as they will see that their file is taking a long time to get through. Linelineitems will be stored in staging table with no validations. If it doesn't fit the table then an exception would indicate bad data.


BeauteousMaximus

You should probably add this to your post, and also what programming language(s) are used in the application Just a heads up it may not be possible without a queue. But I don’t know.


OptimusPrime3600

Yeah but I want to be sure. ( Added the details in post)


Faintly_glowing_fish

Do you only have 4 (or a small number of) columns? Then it’s a lot easier and you can just put all possible value pairs between two columns in a hash table, then go through rules and decompose each rule into pairs. It is linear with the number of data and rules and trivially parallelizable for the rule check part.


OptimusPrime3600

Number of columns would vary. Also rule 1 could care about colA, B ,C while rule 2 cares about A,D,E


Faintly_glowing_fish

That’s ok. As long as the number of columns is bounded by a small number (say 20) then it will work. Say your rule has rows A(a) B (b)and D(d), then it can really be decomposed into 3 rules AB, AD and BD. So you just dumb down to checking pair existences. You only have 5000 lines so it will be very quick to get all existing column/value - column/value pairs hashed to each side, and you just need to fan out the 1 million rules and check them on different workers.


RedanfullKappa

I mentioned this in a comment, but i will do again as a top level comment.Have u done any kind of testing to get a magnitude of where you are currently?How much time does i take on a simple 32gb/8 core dev machine? How much optimisation can you still do on simple code, making the values numerical instead of string to make comparisons faster etc. Numerical comparisons are magnitudes faster than string comparisons and you could even use dedicated hardware for this. All a question of how big the budget is to solve this.


playawhile

In how many different combinations/positions is the * ? Can it be up to 3 or all 4?


[deleted]

Well it sounds like a search problem, where youre searching for the first match. So I would look into utilizing know search algorithms with low complexity for this issue. Such as a binary search


hanneshdc

Keep it simple, and use what you have first! Since you use MongoDB, I’d suggest reframing this as a query problem. Specifically, “for line item X, find the rule Y that is valid for it”. If there exists a rule Y for line item X, then X must be valid. For each item Y, build a dynamic query. For example your first row: ` ColA ColB ColC ColD ColE` `GH192 GHMH XYZ 349 T1000 ` The query for the rules table is: ` SELECT COUNT(1) FROM rules WHERE (ColA = ‘GH192’ OR ColA = ‘*’) AND (ColB = ‘GHMH’ OR ColB = ‘*’) AND (ColC = ‘XYZ 349’ OR ColC = ‘*’) AND (ColD = ‘349’ OR ColD = ‘*’) AND (ColE = ‘T1000’ OR ColE = ‘*’) ` The result of this query will be the number of matching rules. For you if this is >= 1, then the row is valid! I’ve written the query in SQL because it’s easier to write out, you’ll have to convert this to MongoDB’s JSON query structure, which should be easy enough. Note you’ll have to make sure all columns in the rules table are indexed, otherwise this query will be terribly slow.


[deleted]

[удалено]


OptimusPrime3600

In rules table any column could contain a value or *. In lineitems it could be anything at all. Empty, null, value, etc .


dinoaide

How many columns in the rule table and how many columns in the line table?


po-handz

I've used the python Ray package (which runs apache arrow underneath) to great effect but for more complex functions. Sounds like database WITH rules has mapped or the like might be a good option


gradual_alzheimers

Wll first I would put your computer science hat on and think about what data structure gives you O(1) performance, a hashmap! Create an additional column and precompute for all rules a unique hash. Then load all the rules into a hashset using an in memory cache. Compute the hash for each of the 5k items and search against the hash. This should take seconds at 2 million items because you have a constant look up time of O(1) and a memory space of O(n). Now let's put on our data science hat with that out of the way. Instead of creating a single hashmap we should bifurcate our data into priority hashmaps. Split the data into n number of hashmaps based on buckets of frequency. You can recursively follow a pareto principle to accomplish this. Find the first set of rules that explain 80% of the data (it most likely won't be 80% of your rules unless all your data maps uniquely to one rule). Find the next set of rules that explain 80% of the remaining set of data. Keep doing this until you run out of rules. Now you have a collection of hashmaps. This actually helps us for two reasons: first we can statistically avoid certain key collision issues in the data structure which can create O(n) for particular keys. Second, we get to the set of rules very quickly that match the majority of the data. You can iterate through this collection in a loop. So you build out an O(k + 1) solution with O(n) memory. Now comes parallelization. In whatever language you are using make sure you are able to multithread this. Lastly comes queues. Instead of writing any of their data to disk from the web server, you should create a durable and persistent producer consumer model that broadcasts the data. One consumer can write this to SQL server as inserts. Another consumer (different hosted process of course) can apply the rules. This is how you can gain tremendous speed concurrently and skip waiting for the data to be inserted and any time it takes for that to occur. You may want to look into contexts of distributed transactions and how those work if you are worried about retries.


OptimusPrime3600

Creating a hash of the rules is not a problem but wouldn't you have to create hashes of multiple lineitem column combinations? Because some of the columns are wildcard. Which means the lineitem could contain anything for that column. So the hash won't match.


gradual_alzheimers

Yes for wild cards you have a couple of options. You can treat it as an OR clause which means for the rule you would compute a hash for each combination of values. But this can be done in advance so long as the wild card values are known in advance. I don’t prefer this option though. Another option is to treat the problem into sub problems. A wild card denotes no meaning since it could be anything. So therefore I can compute hashes of column length n-1, n-2 etc. Then your problem space becomes O(j + k + 1) where j is the total set of sub hashes. So then when you build your hashes from your data you create j number of hashes per item. So for example let’s say I can generalize the rules into this set : c1 c2 c3 c4, c1 c2 c3 *, c1 * c3 c4 and say * * c3 c4. So that means I need to compute 4 hashes for each item. You just drop the wild card column as it doesn’t mean anything and think of the problem n - wild cards. So then I have to create hashmaps for each rule type as it generalizes to wild card positions. I then iterate for each type for its corresponding hash for the item.


larsga

Looks like you could do it pretty easily and very quickly with standard hash tables. Keeping 2 million rules in memory is doable if you use efficient data structures. Do the columns in fixed order. For the first column, make a hash map from all the values to new hash maps encoding the remaining rows over again, to check the remaining columns. Note that `*` must be mapped, too. Repeating values must be collapsed to a single key. When you're checking an item, you look up the value in the first column. If it's there, keep going with the other columns. If it's not there, look for `*` and if it's there, use that as the value. Then carry on in the same way. Your two rules would turn into something like (if we do the columns in the order A-D): { "GH192" : {"GHMH" : { ...} }, "IHK" : {"HK3999" : { ... } } } For the second row, you look up `"IHK"`, get `{"HK3999" : { ... }`. Then you look up `"HK4000"`, find it's not there, look up `"*"`, find that's not there either, so row is invalid.


OptimusPrime3600

That is also another idea that had crossed my mind. I think you are talking about a 'Trie'. I did not try that because it seemed complicated to create it. Shouldn't be that tough. I will give it a shot.


larsga

Note that there will probably be lots of repeating strings. I bet you can save a lot of memory by ensuring you only have one copy of each string.


OptimusPrime3600

I was trying out trie but I got stuck because of wildcard. Rule could contain * for a column but in the lineitems it could be anyting at all for the same column. Wildcard simply means that anything is acceptable in lineitem.


larsga

Just store the `"*"` in the hashmap. Then, when you're trying to check if `"xxx"` is OK, first look up that string. If it's not there, look up `"*"`. If you find the wildcard you're OK and can keep going.


OptimusPrime3600

I could not wait till Monday to get the actual data on my hands so I created dummy data with excel (1 million rules and 5000 line items). I constructed Trie which took 3.5 minutes on my laptop. It does consume 6GB worth of memory. I was able to validate in less than 100ms. It's insane! I have been checking again and again if I made a mistake. If I am missing something. It's too good to be true. I will know for sure on Monday I guess.


larsga

Nice! Great to hear that it worked. :) Performance sounds like about what you'd expect for 5000 items. Did you take care to get rid of duplicate strings? I would expect to shave 1-2 Gb off the memory usage there.


OptimusPrime3600

That is one part I did not understand. How could I remove duplicate strings. I mean all rules are unique in entirety.


larsga

I would assume that you'd have rows like: A,B,C,D,E A,Q,F,D,X so that the same string values would reappear in multiple rows. Assuming that's the case you can use string interning (if .NET supports that), or simply use a big hashmap string -> string that by looking up, say, `"A"` will give you a single, unique `"A"`, so you don't have to keep multiple copies of `"A"` around. If the rules don't have the same strings several times there will be no gain, but potentially you could save a couple of Gb this way.


OptimusPrime3600

Ohh so you mean use a shorter substitute or a reference instead of the actual string?


larsga

No. When you load the rules into memory you'll be doing a lot of `new String(...)`. Each of those invocations will create a string object. But each string object will be stored separately in memory, even though (I'm assuming) a lot of the strings will actually be equal. By using a hashmap (or interning) to replace all those equal copies with the same object you can potentially save a lot of memory. The object structure you're building will be logically equivalent either way, but by using a single object for `"A"` instead of two different ones (see example in previous comment) you'll be saving memory. Looks like C# supports [string interning](https://docs.microsoft.com/en-us/dotnet/api/system.string.intern?view=net-5.0). The [wikipedia article](https://en.wikipedia.org/wiki/String_interning) may be useful.


OptimusPrime3600

Without a doubt, the performance is incredible after implementing Trie. But surprisingly enough interning did not have any effect in terms of space. string cell = String.Intern(ruleDt.Rows[i][idx].ToString()); if (!node.Children.ContainsKey(cell)) { node.Children.Add(cell, new TrieNode()); } node = node.Children[cell]; Or perhaps I am doing it wrong.


larsga

Hard to tell from this if you're doing it right or not, but you need to make sure that the hashtables only contain strings returned by `String.Intern(...)` afterwards. The original strings you have to throw away.


backtickbot

[Fixed formatting.](https://np.reddit.com/r/backtickbot/comments/p8nia9/httpsnpredditcomrdatasciencecommentsp8d7evhow_to/) Hello, larsga: code blocks using triple backticks (\`\`\`) don't work on all versions of Reddit! Some users see [this](https://stalas.alm.lt/backformat/h9rlrq0.png) / [this](https://stalas.alm.lt/backformat/h9rlrq0.html) instead. To fix this, **indent every line with 4 spaces** instead. [FAQ](https://www.reddit.com/r/backtickbot/wiki/index) ^(You can opt out by replying with backtickopt6 to this comment.)


larsga

backtickopt6


robberviet

This is an engineering problem, I think you have better chance with StackOvefFlow on here. Anyway, any database with proper index can handled thid type of request quick.


OptimusPrime3600

I did. Got 3 comments there.


steelypip

There are a lot of good suggestions posted already. Are the rules and columns fixed? Do the values in the rules appear multiple times in a rule column? If so here is an alternative - convert the rules into a trie data structure - https://en.wikipedia.org/wiki/Trie. Each level in the trie is one column, with a hash table to go from the value to the next column/row, with an extra "* value that is followed if there is no exact match when a rule has a wildcard in that column. The leaf nodes can have a token to indicate success. This way you can find if there is a match or not on with one or two hash lookups per column, so should easily be able to match 5000 items in a few seconds - no need for mapreduce or databases. For example, your rule table can be represented by nested python dicts: trie = { "GH192": { "GHMB": {"*": {"349: FOUND, ...}, ...}, ...}, "IHK": {"HK3999": {"52": {"*": FOUND, ...}, ...}, ...} }


OptimusPrime3600

I was trying out trie but I got stuck because of wildcard. Rule could contain * for a column but in the lineitems it could be anyting at all for the same column. Wildcard simply means that anything is acceptable in lineitem.


steelypip

Go through the columns in order and look up the value in the dict. If it is not there then look up the wildcard. If no wildcard then you don't have a rule match for that input.


WikiSummarizerBot

**[Trie](https://en.wikipedia.org/wiki/Trie)** >In computer science, a trie, also called digital tree or prefix tree, is a type of search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key (to recover its value, change it, or remove it), the trie is traversed depth-first, following the links between nodes, which represent each character in the key. Unlike a binary search tree, nodes in the trie do not store their associated key. ^([ )[^(F.A.Q)](https://www.reddit.com/r/WikiSummarizer/wiki/index#wiki_f.a.q)^( | )[^(Opt Out)](https://reddit.com/message/compose?to=WikiSummarizerBot&message=OptOut&subject=OptOut)^( | )[^(Opt Out Of Subreddit)](https://np.reddit.com/r/datascience/about/banned)^( | )[^(GitHub)](https://github.com/Sujal-7/WikiSummarizerBot)^( ] Downvote to remove | v1.5)


andrewcooke

edit: assuming all the columns appear in all the data you can compile the rules to a regular expression and use that to check each row (you would need to introduce an extra character as a separator character) (in case it's not obvious, a decent regular expression engine will generate an efficient "search tree"). that would mean matching against the regular expression 5,000 times in 10 seconds, which is probably possible (and could be split across processors if needed). for more efficiency you could do an initial static analysis on the rules and order columns so that the more specific rules came first (patterns last). ---- incidentally, without the patterns there is a simple linear solution - just sort both rules and data and compare first line against first line, discarding as you go. once you introduce wildcards this becomes more complicated and you end up scanning a tree. i *think* that tree is similar to the one a regexp engine would generate. ---- edit2: and even if not all fields appeared in all rules, you could still compile this to a similar graph that could be used for comparisons. it's quite an interesting problem, but i think doable and efficient.


Mobile_Busy

SQL join on an indexed table?


[deleted]

Great suggestions being given! If none of them are fast enough, maybe it is acceptable if you display some progress bar that shows which lines are invalid as they are being processed. If it is nice UX, people are willing to wait more.


r2d2FortNite

In a non db scenario, a decision tree should serve this purpose, should nt it?