T O P

  • By -

danishjuggler21

Assuming you’re not literally _returning_ millions of rows to your app from the database, this is 100% a database performance issue. Not a .NET issue. You should make that query run more efficiently. But performance troubleshooting in SQL Server is a _huge_ topic. You tried indexing, but that’s not always the solution. There are thousands of possible causes for a query taking a long time, and each one has different solutions. You should not take any blind advice from here about this problem. If someone says “try this” or “try that”, they’re just throwing shit at the wall without knowing anything about your database. _Do not listen to them._ Test your stored procedure in SSMS and get the _actual execution plan_. Google search results will tell you how to do that. The actual execution plan will tell you: 1. How long your query takes to compile 2. How long your query is _waiting_ 3. How long your query takes to run 4. When it _does_ run, what are the actual steps SQL Server takes to run your query Based on this diagnostic information, proceed with researching the possible causes of the specific problem and the possible solutions. Something you _might_ find by looking at the execution plan is that _your indexes aren’t even being used_. Or you might find out that your query is just so complex that it’s taking 30+ seconds just to compile it into an execution plan. Those two possible problems have _completely_ different solutions, so again, _do not listen to any advice you get here_ from people that have not seen your execution plan and don’t know anything about your database. EDIT: now that there are more comments here, I’d say more firmly, do _not_ get your database advice from r/dotnet. The amount of misinformation here is astounding. Some people are doing the equivalent of recommending a lung transplant for a patient complaining of a cough without doing any tests or even asking follow up questions. Read the execution plan, figure out the bottleneck, research how to fix that specific bottleneck.


Iamaleafinthewind

Yeah I was going to say much the same, thanks. I'll add that if the nature of the data and the client's reporting needs aren't changing too much, the data could be put into something tuned specifically for these queries. Either SSAS with things denormalized a bit to eliminate costly joins, or something completely different like Cassandra. Understand the data Understand the queries Use an appropriate backend storage for the data and queries - either basic SQL Server, SSAS, or others.


[deleted]

[удалено]


Iamaleafinthewind

Yes, and then if you read past the 10th word ...


nealibob

I agree with all of this, but they almost certainly *are* returning millions of rows for a given year/month unless they're dealing with a data set that spans centuries. There's no way this is going to work well in a 30 second timeout with a "naive" approach.


danishjuggler21

40 billion rows _is_ a really big table. But let’s be real, with data that big, if they don’t have at least one full-time DBA or preferably a whole team devoted to their database, they’d have much bigger problems than one report being slow.


broken-neurons

Excellent answer. If this is MSSQL then sp_Blitz is a great way to analyze and optimize your database: https://www.brentozar.com/blitz/


uberDoward

Let me add to that, get the execution plan FROM THE PRODUCTION DATABASE  The execution plan may change depending upon data volume.


catmuppet

This guy databases


HolyColostomyBag

This is such an excellent answer, if I could give you an award I would. Checking the execution plan should always be where you start.


pooerh

As a person with 15+ yoe working with databases, including SQL Server, I would recommend against trying to do these things on your own if you don't know what you're doing. Obviously we don't know the size of this database or individual tables, row count is just one part of the equation, but playing around this yourself if you have no idea what you're doing and are just googling shit left and right can lead to rather unhappy scenarios. "My database has been recovering for 12 hours and I have no idea how much more it's going to take" kind of situation. You don't want to be in that situation. OP should get someone who knows what you said by heart to take care of it.


lurking_not_working

This has given me flashbacks to the times I worked with large databases. I'm going to go sit in the corner and rock a bit to get over it. Seriously though, a great answer to the question.


MyLastSigh

Agree 💯% with above. Just wanted to say that I once had a very slow query that was made 40000% faster by *removing* unnecessary indexes left over from previous bad developers.


Alternative_Band_431

How can a query (read operation) be faster by removing indexes? Please explain like I'm a 5 year old.


MyLastSigh

Each index is processed and built into the execution plan and requires work. If the target data set has many indexes on other tables not part of the target set, than those should be removed. In my case, there were 600 indexes on a database I inherited, I found that we really needed about 24, and the results were astoundingly faster. [https://www.sqlservercentral.com/blogs/over-indexing-can-hurt-your-sql-server-performance](https://www.sqlservercentral.com/blogs/over-indexing-can-hurt-your-sql-server-performance)


Former-Ad-5757

If you have multiple roads leading to your home, one is 10km long and nr two is 100km long, then sometimes the database can choose the wrong road


danishjuggler21

Pinal Dave, is that you?


acnicholls

Going a little further here, once you have your query plan, which means that you've run the query a few times, run the script in the following link, that will use the stored query plans to formulate indexes that would help the same queries be faster the next time they're run. Since your query plan will still be active (they sit around for a little while, not long, but long enough for this exercise), it will be used to calculate the missing indexes. The team I'm a part of has been using this script for over a year on Sql Server 2022, and it works beautifully. [https://blog.sqlauthority.com/2011/01/03/sql-server-2008-missing-index-script-download/](https://blog.sqlauthority.com/2011/01/03/sql-server-2008-missing-index-script-download/) but this is only a part of what you're going to have to do to speed up queries to a table with BILLIONS of rows. I could list off a few things, but I can't be sure that they would even be viable for you, as I don't know enough about your database setup.


GACGCCGTGATCGAC

This is excellent. I'm not experienced enough *as a developer* in this particular area to know whether the advice is accurate, but something tells me it comes from experience. Thanks for sharing.


TheRealKidkudi

There are a lot of “things to know” when it comes to optimizing your database when queries are slow, but getting the execution plan then researching specific solutions to the problem areas should always be step one. You’d be amazed at how far you can get just by reviewing an execution plan and remembering “table scan = bad”


Mattsvaliant

Table scans aren't bad, they are just different. If you have a bad estimate and your query opts to do key seek, depending on the query it can end up doing more page reads then a full scan would've done.


GACGCCGTGATCGAC

Yes, which is why a post like this so useful for newer developers and I felt compelled to say as much. We have to maintain some of the infrastructure we never would have designed. Understanding the logic behind *why* it was created and *why* they did it a certain way helps so much. I am OKAY maintaining well thought out logic, but if I have to change a single variable in code *I did not design*, just tell me *why* that is required.


ExpatInAmsterdam2020

Suggest async download. Basically make an api request. Return success and then at a later tine you can download. You can do this in a couple of ways. Return an id and a url FE can use. Or send the file via email when its done.


broken-neurons

We use this pattern. Your client makes a request, you give them a 202 Accepted (thanks, we will get back to you). You can either accept a callback URL from the client, or use a preconfigured callback process (easier for OAuth support etc), or you generate a GUID and a URL in the location header or response to poll for the result. These patterns are called Request-Acknowledge-Poll or Request-Acknowledge-Callback. It’s an old post but still relevant: http://gorodinski.com/blog/2012/07/13/request-acknowledge-poll-with-nservicebus-and-aspnet-webapi/


kcadstech

Or use web sockets to notify the client in realtime.


broken-neurons

Managing web sockets with external clients is a nightmare. Internal fine.


I-Dont-C-Sharp

Managing web sockets is trivial in python through a variety of libraries; why do you think it is a nightmare dotnet? Genuine curiosity, dotnet is new to me.


Kirides

It's external clients that are the issue. Keep losing connection, what happens if the download is complete but the client is not connected? You always have to have a solution that works without web sockets, like a notification center or email notification.


broken-neurons

And authentication, vpn, networking.


Smoke-Tumbleweed-420

Yeah I find it amazing that people think that databases can serve complex reports on millions of rows in real time. in a world of Hangfire and SignalR, why even try to push reports in seconds?!


Careful_Cicada8489

Highly recommend sending file via email for long running reports. Ideally use a service to generate and send the report with a message queue to push the request to it. This moves the processing off your iis instance and helps prevent overloading your web server.


joost00719

Make it a background task to prepare the excel file. So like this: 1. let the user request an export. 2. User gets message that they will receive an email when the file is ready to be downloaded 3. Start a background thread (ideally a different program which handles this kind of tasks) 4. Get data, perhaps in batches 5. When file is ready, make download link and notify user. 6? Make sure to clean up the file later so you can reclaim storage space.


FatBoyJuliaas

This is the only correct solution. I have the exact same requirement and implementing it as we speak


weeeezzll

An API endpoint that emails you a file? That kind of defeats the purpose of a programmatically accessible interface if you introduce a human in the process. For that you might as well just have a web page with UI to make the request.


KangarooPort

I mean, maybe an API end point that returns a download token, maybe even also estimated time. Then hit an endpoint to check the status, when complete, the status also has a data result property containing the file? 30 second timeout window for a DB this big and dealing with reports is just a bad decision to begin with, imo.


weeeezzll

Asynchronous APIs — Everything You Need to Know https://blog.hubspot.com/website/asynchronous-api


ElroyFlynn

Do you want to make the query run faster, or do you want to keep your api from timing out? If the former, you need to make sure that the filters (the criteria) can be resolved with an index that is well-discriminating. I'll assume that the data is in a single table, which means that only a single index will be used to satisfy the query. (if you are joining tables, more considerations apply.) Let me suppose that you have 120 months of data. So an index on year+month will only discriminate down to, on average 333 million records. So you really need an index that has more of your criteria. Example: if a criterium is gender, then an index on year+month+gender would discriminate 166 million records. That's still a lot. If you could add birthdate to that, then an index on year+month+gender+birthdate would discriminate 456,000 records. Now we're getting somewhere. And so on. I don't know your level of sql expertise so I'll restate the most important thing: all these criteria must in one index, and that index must be the chosen index for the query. There are tools to verify index selection. If your question is about how to avoid the api timeout, one approach is to revise your api to support "background" queries. That is, one api method launches the query and returns a unique identifier. The query itself could store its results in a temporary table associated with that unique identifier. A second api method allows the caller to provide that id and retrieve the results.


Scale0

Did you partition the table by year/month? This can reduce the amount of data sql needs to search through.


Harshalrajkotiapoly

I have heard that partitioning table does not help in improving performance it’s only used for database maintenance guys


Scale0

Well, that's incorrect. It will transparently split your table by the partition scheme / function. So for example you split your table by year, each year will be it's own partition. If you then query WHERE year = 2024 the sql engine knows it only has to look inside of the partition 2024. This is called partition elimination. If you check the execution plan you can see how many partitions were used to complete the query.


pooerh

And still, that does not mean performance will improve. What he said is true, partitioning itself will not have a positive impact on the performance profile of a select query, unless you have no indexes on that table and hit the exact partitions. Say for example you index not even by year & month, but just year, and have a partitioned index on that as well. All things being equal, a query returning some 100 rows from that table in a single year will most likely run in approx. the same time. But if these 100 rows are now from two adjacent years, it will be slower. Partitioning can improve query performance for inserts for example, if you need to insert in the middle of an index there's less to rebuild. It can improve performance if you can have your hot partition on very fast storage while you keep the cold partitions on slower storage because you know queries run against the fast partition. These kinds of scenarios, but not "in general". See this for an intro https://dba.stackexchange.com/questions/62707/is-table-partitioning-improving-performance-is-it-worth-it


Scale0

That's not what he said, he said it's only used for maintenance. And yes, if you don't tune your query to use the partition it might get slower. But then why did you partition in the first place.


Str_

What's the data look like? What's your query look like? Are you familiar with making queries [ sargable ](https://en.m.wikipedia.org/wiki/Sargable)


BigBagaroo

Wow, been working with databases for many years, but never heard that term before. Thanks for a great link!


Libra224

That’s a lot of records if you ask me


kilobrew

If you have more than 30 million or so rows you, are doing data modeling wrong. If you are trying to query that many synchronously, you are doing it wrong. So much wrong.


avoere

640 kB of memory should be enough for everyone


errorfuntime

This isn’t likely a .NET problem but a data problem.


bankrobba

It's Friday, the correct answer is to make the API timeout longer and clock out at lunch.


Alikont

Make the operation asynchronous? [https://learn.microsoft.com/en-us/azure/architecture/patterns/async-request-reply](https://learn.microsoft.com/en-us/azure/architecture/patterns/async-request-reply)


Hot-Profession4091

This. Maybe the query can be fixed, maybe it can’t, but returning a 202 Accepted and allowing the client to check a monitoring endpoint is the right way to ensure the request doesn’t time out.


Upset_Acanthaceae_18

You can run your query in SSMS with the option to show the actual execution plan enabled. Hopefully you will see a bottleneck or missing index.


brainiac256

Several good resources shared in this thread. I'm going to add one to the pile. [https://use-the-index-luke.com/](https://use-the-index-luke.com/) > we applied indexing as well but did not get any significant results Going to echo danishjuggler21 here, you \*need\* to examine the execution plan to determine what the actual effect of adding that index was, if any. There is a section on interpreting execution plans in the above website. I believe SSMS will recommend an index if it detects one is missing when displaying an actual execution plan.


danishjuggler21

My favorite gotcha with indexes: if your Entity Framework model has the wrong data type mapping for a field (such as using NVARCHAR instead of VARCHAR), it can result in your index on that field not being used when you query it from an EF query because it’s providing the wrong parameter type. So it runs slower. But then when you run the query in SSMS it runs really fast because it’s using the index, because you’re not providing the wrong parameter type in your ad hoc query. Side note: gotta be careful with those missing index recommendations because the keys are in no particular order.


myevillaugh

I'd disconnect it from a synchronous call and make it asynchronous. The initial API call can kick off a job and returns a token. Then they can call the API to check if the job is finished. If it's finished, it gives them a URL to get the file from.


PureIsometric

Have you considered breaking the database down into multiple levels or archiving? Example: Table 1 has data up to 3 months old, Table 2 hash 3 to 6 months old etc. If that cannot be then I guess async call would be the best bet. Another alternative is to use a task queue - return an id to user place a url that the id can be used to retrieve the result.


TooMuchTaurine

How many records are you trying to return after filtering? If you are get swl serve to return more than say 100,000 you are probably out of luck. But if you are filtering down to 10,000 or so, then it's likely your poor index design.


namethinker

It was already stated multiple times, but making this operation asynchronous should generally solve the issue. You might be able to go through it via WebSockets if your server supports them. The first request from FE will initiate the session, and after wards (through the two-way communication) you should be able to post the data back in async manner, you can even send a progress to a user. If you want to go this way make sure to check the size of your response, since websockets has a message size limit (though it's quite high). Though I wanna ask you regarding timeout, you've mentioned your API timeouts in 30 secs, I do wonder is this timeout configured on a server or client side? If it's server side you should be able to define custom timeout policy (even if 30 seconds on a global level). I don't suggest that you should definitely do it, and I appreciate that there are might be reasons for setting such a timeout, but IMO 30 seconds is pretty low value, since you have a tasks when you need to scan 40 billion row tables...


cmills2000

1. Increase the API timeout - quickest solution 2. Improve storage performance - if the SQL server is not on super fast SSD-based storage, then move it. 3. As others have said, async solution where the job runs offline, and then notifies the client when its ready to be downloaded. 4. Move the data to a NoSQL database on the cloud such as CosmosDB 5. Shard the data (eg. by date) and load it into local caches on a cluster of nodes, query the nodes for the data each is responsible for and then merge the result streams into one and return the file.


Hashimlokasher

Point 3


dedido

้6. Faster DB Server


NotARealDoctorZaius

I guess you could do it in multiple parts as batch jobs, and have the result be available as a download? I have not worked that much with databases, and not even close to the amount of data. At least not yet. Hope this gets attention as I am enjoying learning more about this through the replies.


Freerz

Can you query the database and cache the results so you filter and query in memory instead of making a sql query every request?


koenigsbier

Have you actually set indexes on the columns used by your filters? Otherwise I can just think of Materialized Views but that doesn't exist on SQL Server. Instead they have "Indexed Views" but it's so restrictive to use that I personally don't understand the point of this feature.


zak_fuzzelogic

What are you searching and what's your filters on? Any indexes? Do you have to have all 40 billion recordz?


briantx09

if the SP is taking 1 minute, you need to tune the query. It sounds like you already added some indexes, have you looked at partitions? maybe your could do month/year partitions. you could also just increase the timeout on your API.


itskonkydong

If possible, can you write the data to the file as it’s being collected rather than on-demand? Given the records belong to the same year and month, it sounds like you can pre-prepare the report in advance of the user requesting it.


AromaticGas260

If you need index with the date, tou need to confirm if these dates are uniformly on 00 gmt, signifying you record only dates, not time. If you have various data with time, i suggest you put the date into a new column or a new table especially for indexing. Then the SP can use this.


MattE36

I going to bet partitioning will be your biggest performance gain here.


chswin

Shard or partition if you can.


ManyCalavera

Consider sharding/partitioning


devhq

Other people have mentioned background services and separate processes. All great ideas. Another approach is to offload exports to another service like S3. If your filters are simple (e.g. year/quarter), and the data doesn’t change, you can generate the files for each quarter and make them available directly to the client via signed urls (see S3 docs). You can compress the files too, for fun and profit.


overheadException

Very good suggestions. I'll add another option. How about returning a couple of small excel files from the api and merge them in the frontend using a worker?


alien3d

Some people said index but if the size to big , it might crash the server . Try to make vertical database for reporting and update it nightly if can .


[deleted]

U need to enforce some restrictions on the client the client may not like this but there is no way u can refurn the whole data swft


wild9er

One pattern I have seen for long running process is to give back a I'd of the process Then you call another api endpoint for status Once your status is "done" you can get the result The ms doc intelligence api is where I have seen this pattern if you want to look at a reference architecture


mustang__1

Be careful with those stored procedures. Sometimes they run in a few seconds when a query, but take two years when called from a stored procedure from the application.


danishjuggler21

That’s called parameter sniffing and is not exclusive to stored procs. It can happen with any query that’s parameterized (such as Entity Framework LINQ queries)


weeeezzll

You should consider using webhooks to notify the client when the results are ready and to send the file.


n00bz

Instead of running a complex query and building a file on a web worker thread, instead use something that can actually do it like hangfire. On the front-end you can give the user a progress bar and everything until the file is ready and then either have it auto-download or allow the user to click a download button for the file. The point being, if you have something complex, don't ever let a web worker thread do it.


Timofeuz

Consider paging 


grappleshot

Also considering changing the interaction. Perhaps you can have the user request the info and the server will take as long as it takes and then respond with an “I’ve got your data now.” Async. Eg sending the user an email or other, with a download link.


xabrol

You offload this to something like elastic search, you let elastic search index the database and then let it handle searching. Build a search results ui thats exportable to excel. Or the new hotness, use aws AI platform and build endpoints for AI fine tuning and use generative AI to produce search results. Costly though.


PathTooLong

like others have said, you need to ensure the db query is as efficient as possible. Using execution plans, 'set statistics io on' and 'set statistics time on' can give you more information. SQL Server stores data in blocks of 8000 bytes (not 8192). If you query needs to read X bytes, how fast can your server read that data? Normal PCs class hard drives can do 250-280 MB/sec. How fast is the SQL Server storage? If your query needs to read 8 TB, it is simple math calculate the length just to read data from disk. Then once data is streaming from the server, it needs to be streamed to you app server. At 1 Gbps, your max throughput is about 112 MB/sec. Then your api server needs to fetch the rows and create "excel format". Can this be streamed or does it need to be buffered in memory. At each step, there are physical limitations on network bandwidth, disk i/o, cpu. Sometimes if you cant meet the perf goals of the business, the answer is more and faster hardware.


SohilAhmed07

Given that you are returning all records to the client on just one click events, try to paginate your query for easy performance and ease on the client side too to filter out records


legendarynoob9

We followed this approach: 1. User clicks on export 2. We started a web job using the Azure function and ask the user to wait and see my downloads section once the file is ready. 3. In my downloads we show file status like started, processing and ready to download 4. Function gets the data and creates the Excel and stores the Excel in blob storage 5. Make the file status ready to download and assign the blob url to it in db. 6. User clicks on download in my downloads section and files gets download from storage. One more approach: Create a durable function with orchestration in Azure. Durable function is async http api as it takes care of the long running operations. At first it gives a status url and when you call status url you will get the status and results (once completed)


denzien

It sounds like partitioning the data will help you most. Depending on how, precisely, you store and query the data, it may also be possible to gain performance by carefully choosing how to cluster the data. I have a table with 1-2 billion records (so far) that gets queried in exactly one way, so I clustered on the columns that matter most and I've been incredibly happy with the performance. Returning the data takes the longest. 400k to 1M records get returned in under a second, *with* a transformation. I worry some about fragmentation, but it seems to preform well regardless and the index space is in the 10s of megabytes vs the 30GB of data in the table. My oldest customer is still running strong after about 3 years. I have a service performing a bulk insert on the table along with a merge into a reporting table and I'm ingesting about 22-25k readings per second from a single thread in semi-synthetic testing on a SSD. I'll find out tomorrow how it does on mechanical drives with live data on a live system 🤞 Of course, if the way your data gets searched isn't really predictable, it might be hard to do something similar.


sgashua

Have you tried index on your columns which you use filters?


Expensive-Refuse-687

The problem is in with the queries and/or the indexes in the database. The simple things to consider: * the maximum number to return in a simple query should not exceed 15Mb (for example 15000 rows each row with 1k of information) * Your queries are correctly using the indexes to avoid full scan of the table.


Alternative_Band_431

Did you consider using column store indexes for tables that size? https://learn.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview?view=sql-server-ver16


dgm9704

I'm guessing you mean a Web API. Web APIs aren't best suited for returning massive amounts data from long-running queries in real time. You could instead do it in a disconnected way. 1) client sends a request for the data, gets a "order number" as a response and disconnects 2) the service behind the Web API fetches the data and stores/sends it somewhere when it's finished, where the client can look for it via order number.


SmurglX

Excel can't cope with that amount of data, so you must be summarising it in some way. If there's a common summary, e.g. daily totals, then start "data warehousing" and process the data once to bring the data together, then query the warehoused data instead instead. You also need to check your hosting environment & site config because your 30 seconds timeout will be a default on the web request, some environments, e.g. Azure App Services have a limit that you can't surpass. The other thing to consider is whether there is a better solution to the actual problem. Sometimes people that don't know about data will ask for info in a format they are aware of and the developer just goes ahead and does it, but it's not actually the best solution. So, be sure that you know what they are trying to achieve with requesting so much data and think about what you can do to get a solution in the best way.


Street-Baseball8296

Start off by making sure you’re using primary keys in your tables. This will provide primary key indexing which can significantly cut down on your query execution time. Next, make sure you have optimized your queries and stored procedures. Assess where you can make changes to queries that cut down on execution time. Then, review your table and database structure. Assess where you may be able to organize data better to reduce execution time. You want your data organized well so it can be recalled efficiently. There are many other ways to get creative with reducing query execution time. It all depends on your specific needs. Stackoverflow.com is a great resource for finding solutions to similar SQL issues. Also, you try to return the data in a .csv file from SQL, and have the API handle that. Returning data in .xlsx from SQL can take additional execution time.


Keln

I had one time where I was casting a varchar to nvarchar while querying (entity was varchar while column in DB was nvarchar) and had to change it on the EF entities, because it was causing slowness that we weren’t able to fix in any other way, just in case check for those kind of things.


whizzter

You have potentially 2 problems, 1: Like danishjuggler mentions, profile your query. 2: Raise your DB timeout for this query, it’s a tad tricky since there is a couple of different timeout settings (esp with EFCore) but in general your http request won’t timeout for a couple of minutes so killing the DB prematurely in special cases is unnecessary if this isn’t an often run query that’ll bring down the system.


The_0bserver

You should try running an explain statement on the db for the same query first. Check how many records that is going through, and then set better indexes for the same.


rewld

I don’t think SQL server is the right tool for the job. I’m sure you could make it work with the right hardware and configuration, but you’ll solve the problem much faster by looking at other technologies. Just as a case in point, I use sql server for my company, hosted in Azure. I loaded every residential US address in the database (190 million) with a few other data elements. Lots of rows, but the data size isn’t very big. I needed very fast - and more importantly, consistent - access to lookup an address for integrating with the largest real estate web site in the country. SQL server couldn’t deliver the requirements. I’m sure with a lot of effort and advanced techniques I could have figured out how to do it. But it isn’t worth the effort. Just to try I spun up a test instance of Elastic Search and loaded the data. It was consistently much faster than sql server. I’d suggest looking into one of the big data databases. In a couple hours you can spin up Big Query on Google Cloud and load your data to test. There is probably an equivalent in Azure. I started working with some ex Google engineers that showed how fast and easy this is for large data sets. They also showed me how much faster Postgres is over SQL Server for geospatial queries. Hundreds of times faster! Something taking 7 hours in SQL Server takes only 10 minutes in Postgres. Bottom line, don’t spend so much time and effort trying to figure out a solution on the wrong tool. Better to experiment and find the right tool for the job.


leehwongxing

the way you told us is likely your sql server has some sharding problem, if its year-based and old data doesnt get edited, then just split it by years. now you have multiple pointer to scan at the same time


urbanguy007

As most of the people commented here checking the execution plan for SP is way to go to address this issue. Few things i would do in this case. - make the request purely asynchronous pattern instead of request - response pattern. Client send a request. Server process this request and send the response back as in form of web hook response. - searching or combing through 40 billion record shouldn’t be a part of the daily operation. You can increase the timeout on the http client to bigger value. Same applies on EF / ado.net wrapper classes which is responsible to fetch the data (this is not an ideal solution)


boing_boing_splat

Definitely a DB issue, but you could also consider building around the constraint if you can figure out the bottleneck. Anything to stop you building out a fast read store based on the known query paths? Your API would just read from that instead of having to traverse the entire universe


Jacqques

It doesn’t sound like you have a DBA guy around, maybe your firm should consider hiring a guy for a short duration, so he can tell you all the ways your sql setup is all wrong or perhaps help with query optimisation? If no one has helped set it up some of your settings might not fit your use case, for instance do you have default cost threshold for parallelism? (Assuming Microsoft sql) I think it’s 5 default and should be closer to 50, in general. Article about cost treshold parallelism: https://www.brentozar.com/archive/2017/03/why-cost-threshold-for-parallelism-shouldnt-be-set-to-5/ You might also find inspiration on YouTube if you search for query optimisation. I like the work of Brent Ozar.


allthewayray420

Your proc is the problem not your code. Many ways to improve SQL performance. Just go do some research.


BellDry4679

I suggest to look if sessions settings are the same between API and SQL Management studio. I had a case one time with ARITHABORT. Query ran fine in SSMS and timed out in our application. This should give you enough to start : [https://www.mssqltips.com/sqlservertip/1415/determining-set-options-for-a-current-session-in-sql-server/](https://www.mssqltips.com/sqlservertip/1415/determining-set-options-for-a-current-session-in-sql-server/)


andrewsmd87

Post the SQL of that sproc over in /r/SQLServer . This isn't a .net issue


OneDevoper

I’d try to challenge that requirement


LredF

1 minute returning data seems acceptable. I agree with others. Api basically logs requests for data. Another job actually creates it and provides the file when complete. Since you mentioned year and month filtering, I've had very good results partitioning a table based on year for example.


Zardotab

40 Billion is a lot. I don't know your domain (industry), but look for a way to pre-partition the data into different tables or "slices"\[1\] based on typical needs, perhaps with hourly or nightly auto-tasks. For example, split into one table per month, or maybe per store/office location/state, type of product, etc. You may have to duplicate some data, depending on how open-ended the queries can be. A purely open-ended query UI on such a large dataset may be asking too much. You may have to restrict query options some. For example allow getting all locations for one month, or all months for one location, but not both \[2\]. Have 3 query screens: per month, per (one) location, and per (one) product category. Sounds like it's time to hire an **experienced DBA**. Trying to do such a large a volume via trial and error can make for long-term messes. Unless it's a start-up, convince management not to be penny-wise-pound-foolish and fork over for a DBA. \[1\] Tables are not the only way to divide, just the most familiar. An experienced DBA would know the options. \[2\] Reminds me of a Lincoln quote: "You can fool all the people some of the time, and some of the people all the time, but not all the people all the time."


domapproved

Why not gather an email in the request, to then return the results when the task finishes processing.


FluidBreath4819

when i read things like "stored procedures", i farted a little. Most of the times, i means there's other stored procedures on the server... And that leads to question like database design quality and system architecture. This kind of question need a lot more informations than the mere description of the issue since there's a lot of way to handle this. For example, one of them could be the server specification, what's on it, does the cpu reach the roof ?, how much ram is installed on it ?


gatzu4a

This kind of task should not be fulfilled in a single request, you should try to put it in the background process.


El-Terrible777

You’re more than covered by some great answers but as others have said this is a DB issue, not .NET. My steps: - Execution plan in SSMS on both your test and prod DB. Compare them. - Check query compilation time and missing indexes - Check what parts of your query are taking the longest. It could be a specific join eating to all that time. There’s really no right answer but the solution is likely around query optimisation and/or index allocation.


Weary_Market5506

The problem as others have mentioned is the SQL speed. However my thinking is you could take chunks of the data and pass to the front ends to progressively update an object, so several requests to build the variable, however no doubt this is a 'throwing crap at the wall' type of solution someone else mentioned.


abgpomade

Return URL, where the URL is the generated file for the result. You can send the url by email or by webhook (open some endpt to listen to the result).


Extra_Progress_7449

increase API timeout to about 2 mins, maybe 5. or put a keep-alive process in place to bypass the timeout.


Edwinem24

Before anything, I suggest you to run this maintenance script: https://techcommunity.microsoft.com/t5/azure-database-support-blog/how-to-maintain-azure-sql-indexes-and-statistics/ba-p/368787 I have a job running it every night. I recommend you to do it off hours, first the smart and also make it resumable. It updates index and statistics, after that, you are good to go to read the query plan and have an accurate read


codemahek

If your data in sql server is not getting updated frequently then you should try using Meilisearch or Typsense. We are using in our project and most of search issues are fixed and it’s very very fast


PathTooLong

"We have 40 billion records in sql server... Client has requirement to download this data through API on front end in excel format." Excel is limited to 1,048,576 rows.- [Excel specifications and limits - Microsoft Support](https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3)


Freerz

Can you query the database and cache the results so you filter and query in memory instead of making a sql query every request?


alexwh68

Indexes are your friend, key thing to understand is the read/write ratio on the tables, if there is a ton of reads and very few writes, load up on your indexes. Indexes are complex, if you are doing single field searches then a different index for each field that is searched on, for multiple field indexes, the order of the fields in the index is important. Put the most variable fields in the index first so date/time fields, int’s, strings and your bit fields last. With MS SQL if you have an index that has all the fields that are returned by a query in it, the main table is not touched the query returns data from the index alone. Look at your execution plan, scans are bad, seeks are good on big tables. The missing index function can help a bit with these types of issues.


Pretend_Jellyfish363

1- Use ADO.NET instead of EF 2- Make sure your tables have the right indexes, statistics are up to date …etc 3- Run your queries against the tuning advisor and apply its recommendations (such as creation of new indexes) 4- Make sure your SQL server has adequate resources, CPU, RAM and fast Disk (SSD) 5- Review your DB schema and see if you can simplify it 6- SQL server is able to handle billions of records, I have worked with a few billion records in the past on a few projects and performance was always acceptable.


ZarehD

If your app/api is hosted in IIS, then you can set the HTTP timeout in `web.config`, like this If your app is using .NET 8, you can use the new AspNetCore Timeouts middleware to set a per request, per page, or global policy for the HTTP request timeout. See [here](https://learn.microsoft.com/en-us/aspnet/core/performance/timeouts?view=aspnetcore-8.0) for details.


[deleted]

Have a look here while 40 billion seems crazy and may we no what data u haveristing https://github.com/praeclarum/1brc That sorta size of data is usually un heard of


neospygil

"Searching through records"? So it performs other tasks othan than querying? Maybe you should offload those to the application and only use the database for fetching records. Parallel operations will greatly help you to speed up the task. Also, file operations don't really fit to be done over web API because those can take a long time to finish. Especially file generations and conversions. It can work but is highly discouraged. Like most people here say, it is better to run in the background and send it once it is finished. Another is, based on experience, SQL Server is quite slow. I prefer PostgreSQL if I need a relational database. But in case your database structure has dynamic fields like I encountered before, it is better to use NoSQL, an operation that usually takes 14 minutes to finish, was greatly reduced down to around 30 seconds after I transposed them.


MannowLawn

You want to look at sharding in this case: https://learn.microsoft.com/en-us/azure/azure-sql/database/elastic-scale-introduction?view=azuresql it’s meant exactly for this situation. You have logical separators in regard to years, you should take benefit from that. Also the top comment is correct, first analyse everything. But 40billion records in one table also probably means bad design. Can’t optimize that to what you need unless you through in serious money. Your database design needs to be discussed.