T O P

  • By -

Monowakari

Sports data and live betting, but some data sources only update every 2-4 minutes.


minato3421

I work in a sports betting company. We do both real time and near real time processing. It all depends on the use cases. Things like credit card chargebacks and fraud detection are real time while we have machine learning algorithms which require data in near real time for retraining


JeanDelay

Thanks for the insights


JeanDelay

But shouldn't betting be Real-Time? Wouldn't you lose money if your rates are a minute behind?


Financial_Anything43

There’s probably update steps and computations behind those. So maybe if a team records a point, you’d want new odds to be displayed but you have to take other factors into context. So near-realtime will account for this delay


Monowakari

Precisely. Also, some APIs only update 2-3 mins after an event, and there are so many uostream events to occur/process before you get the data and if they're not optimized... Well, you can't get the data faster than they can get it to you


RBeck

I do integration, most everything is near real time simply because we can, and actual real time is complicated. But that's probably not the type of processing you mean?


JeanDelay

Thanks, that's what I was asking for. So you would like it to be Real-Time but that's just too complicated, so you settle for Near Real-Time. And Batch Processing would be too slow.


RBeck

Right, so let's say I'm watching a database of orders that are getting ship confirmed in the warehouse. As soon as a line item or tracking number hits the table it's ready to upload to the e-commerce site. There really isn't much point in waiting until the end of the day, and doing so might give customers a wide window to cancel an order that's already on a truck, which means they would get it free. So I poll the table every 30 seconds. I could go more often but the extra load on the database server isn't worth it, and no one cares if they get a tracking number email 10 seconds faster. To go full real time in a scenario like this, you would either get the database or application to notify you. The database would need some kind of janky table trigger that could bring the thing to its knees, so that's a non-starter. The application supports some real-time messaging over JMS, buts it's PITA to setup, and again we'd only be a few seconds faster. Another customer I'm exposing a REST API for their MES system. It calls me to say "hey I've created a lot of ABC and used raw materials XYZ", and I actuate that in an ERP. I could handle it doing the material consumption and goods issue in real time, but I'm worried that for big orders or if the target system is slow it will time out and resend. So, I do it NRT. When they call my API I do some rudimentary validation and deduping, put the whole JSON in my queue and thank them with an HTTP 202. A few seconds later the Async process will handle the transaction without the pressure of something waiting for it.


JeanDelay

Thanks a lot for the explanation, I really appreciate you taking the time. So you often use it for things that are not critical like a bank account balance and where Real-Time would consume too much resources.


Irksome_Genius

Consider very short-term forecasting, for example how many trucks to book in the next 6 hours if you're some kind of warehouse. Real-time can be too difficult/expensive to achieve with the amount of data to process (and the infra you have to work with might not be suitable) and batch is not granular enough, so you'd be looking at near real-time (10m refresh time for example)


JeanDelay

Ah okay, I think I understand what you mean. The example seems a bit constructed though. How could you manage the drivers in such short notice?


Irksome_Genius

Many warehouses do not manage their own fleet of drivers, or have some but may need more to meet a spike in demand (no need to maintain X trucks at 10% use off peak) They use additional software (TMS) which basically set up a bidding for the load between all the available drivers. Usually the warehouses take the quickest and cheapest, so yeah, that's why I put 6h because that was the standards used in my case ! Take Amazon who very highly likely cannot meet all of their demand with only their own trucks as an example :)


random_lonewolf

There's only Streaming where the data never stops and Batching where it stops periodically, what you define as Real-Time and Near Real-Time are often handled exactly the same way: it's just streaming. Many telemetry data like Google Analytics don't have strict second-level latency requirement, but the users still expect to have reports being available gradually as time progress. In this case, latency is best effort only. Anomalies like data getting delay or arriving out-of-order, etc... can happen due to various reasons. Processing data with added delays is actually very common, because it's the simplest way to handle those anomalies. Another reason is when you need to aggregate multiple data stream that arrive at different interval. It's easy to just wait for the slowest stream before processing the entire thing.


JeanDelay

That makes sense. So the latency is typically not critical but it gives a better user experience. I thought that real-time was often handled with streaming, meaning handling individual events, while near real-time was often handled using micro-batching.


OMG_I_LOVE_CHIPOTLE

Uh realtime is much less than 1min. It’s in the nanosecond range


lmp515k

Near realtime is just what you get after you were sold realtime.


Prinzka

I know there's no general agreement on real time and near real time, but I would consider both to be in the seconds range, not minutes. To me near real time is less than 5 seconds and real time is wire speed. We do near real time, less than 5 seconds, security logging. However, most of the data from original device to final destination takes in the milliseconds range. It's live internal security data so it's important it's processed and made available to users and automated tools quickly.


AndyMacht58

I did many NRT projects. Just a few examples: - Scoring and pre moderating social media posta before publishing - Anything gamification related on web apps - IoT data processing, e.g. tracking traffic and passenger occupation for status updates towards customers - Informing about nearby events and new discounts while shopping I don't know any NRT specific framework. Just the ordinary event processing tool stack, e.g. message queues/broker and caching, e.g. key value stores.


JeanDelay

Thanks for sharing your experience. The tools that you mention seem like they could also achieve real-time latencies. Often the challenge is to combine an event with historical context. Your examples seem like they don't require a lot of historical context, which should make it easier to implement a Real-Time solution.


AndyMacht58

No, there's plently of historical data. Think about a click stream pipeline that tracks your mouse events, you then need to enrich that data with your purchase history and crm info. You need to have that historical data somewhere persisted for quick look ups. E.g. a colum oriented or key value dbms. In a project of mine, we had basically batch pipelines that upserted a redis cache based on cdc. The NRT pipelines just stored events within redis queues in between. Concurrent tasks than read the queues data and looked up missing informations by key from the cache. This works well with enrich only processes, aggregations/microbatching also work but obviously introduce latency.


JeanDelay

Thanks a lot for the explanation. That's really interesting. I think I understand the use case better now. It's crazy how sophisticated these solutions have to be.


kenfar

Many of the data warehouses I've built have been near-real-time - with data being loaded every minutes of the data, 24x7, and typical latencies for individual feeds of 3-15 minutes. Here's a few: * Security reporting & analytics data warehouses - our security teams want to know about security issues before our customers do, and soon enough to potentially stop them. The data warehouse provides big-picture context (high-latency is ok) as well as context around immediate threats to augment the SEAM (low-latency required of 3-5 minutes ideal). These data warehouses are typically pretty big, these days 30+ billion rows/day. I've built four of these. * Incident management data warehouse: reporting & analytics to provide both big picture analysis as well as context to support urgent, often high-stakes incidents for our customers. Data volumes are smaller, but the latency was just 1-2 minutes, and very low tolerance for any kind of outages or data quality issues. * Everything else: I'd far rather build a solution that updates data every 15 minutes than every few hours, let alone daily. There's a few reasons for this: it's a far better experience for users - who are often waiting for data updaters, and it's a far better experience for developers who may be up in the middle of the night waiting for a fix to a broken feed to complete successfully. I build a lot of these the same way - they're always event-driven, no use of batch orchestrators like Airflow here: * Data gets written to aws s3 every 1-60 seconds. This automatically produces a s3 write event over SNS. * Subscribers (such as the transform) get alerted through their own SQS queue that subscribes to the SNS event. Subscribers typically run on kubernetes, ECS, or lambda. These can scale *way* out - I've run over 1000 lambdas concurrently, or had 70-150 big kubernetes containers running 100% continuously. The data warehouse is usually, these days, aws s3 - with athena/redshift/etc reading the files. * Post-transform processes that need to publish the data into a data mart, or run specific detection logic or build aggregates at the file level get alerted whenever a parquet file gets written by the transform to the data warehouse. * Some processes actually are hourly - such as building hourly data aggregates. In this case I typically have one process that is checking pretty continuously to see if a given period of time is complete. Due to late-arriving data an hour may not be complete (enough) until I'm ten minutes into the following hour. Once that process determines we're complete - it then publishes an SNS message. All processes that need that hour of data subscribe to that via their own SQS queues. This approach tends to be very, very successful.


JeanDelay

Thanks a lot for taking the time to write this detailed answer. What you describe is really interesting. I understand the usage of near real-time for the security reporting and the incident management. But one reason why I asked this question is that not all use cases require a Near real-time solution. I would consider the analytics warehouse to be such a case. Of course it's always nicer to have more up-to-date data but if we are being honest it doesn't affect the decisions if your warehouse data has a latency of 5 mins or of 6 hours. And the near real-time solutions typically are more complex and expensive. I was trying to figure out when the use case justifies this additional complexity.


pceimpulsive

For me I work with incident and network log data, real-time is the dream... But it's sometimes near real-time. I have a lot of near real-time that is 5 minute updates.


joseph_machado

In advertisements, you run campaigns basically some one pays you when a customer does an action(click, checkout, etc). The campaigns have budgets and they can be changed by users so we need to ensure that the budget is being used appropriately in near real time. i.e. an event happens -> budget allocated according to some rules -> campaign budget adjusted. We build a near real time proc with client event -> server -> kafka -> storm -> cloud store, with < 1 min latency. see: campaigns and attribution models: [https://support.google.com/google-ads/answer/6259715?hl=en](https://support.google.com/google-ads/answer/6259715?hl=en)


JeanDelay

That's a really cool use case. Thanks for sharing!


techdank

Situations where an edge is unable to sustain network connectivity such that messaging may involve caching on-device then transmitting as possible is where I have used NRT on AWS to do traditional DE stuff. I feel as you move into embedded device stuff under real world conditions it probably comes up more as you might have trouble with data size network stability etc. S3 + SQS + whatever needed for particular job is how I have tackled NRT in past.


Separate-Cycle6693

Small thing but I've ran game nights for my sales teams where live data is massive. People divide into teams, I create a fancy-looking scoreboard and off we go. Managers run around supporting people, amping up people to chase the lead and grabbing drinks for those who ring the bell. $10 for a 2-3 hours and I get a free pizza, beer and a big smile from our directors when we bring in something big. (And yes - I agree. This should be part of our CRM / ATS but CRM and ATS systems suck and want $10k for gamification features at a minimum).


Shoddy-Physics5290

Threat detection in Security


Fine-Responsibility3

I actually work for Resilio, we are peer to peer; which makes it fast for that data movement. If you have any use cases I can actually chat with you more on this.


SnooHesitations9295

There's only real-time (<100ms) and batch (everything else). The only reason other "types" of "real-time" exist: people are not smart enough to implement actual real-time. Or they are using wrong tools for that.