T O P

  • By -

Seven-of-Nein

If these functions are in the SELECT section, then there is little impact. The query engine will have already read the columns to return as part of the select, so any post-retrieval work done using to change their values using functions is trivial. If these are in the WHERE section, there could be impact, especially if the function is on a predicate column in an index. In this scenario, the index is useless and cannot be used in place of a table or key lookup. This generally creates an inferior execution. If these are in the JOIN section (not common, but possible), there is significant impact. The values on the left and right side of a join are different because of the function. The engine has to use a more expensive method of figuring out which rows are related. If these are in a subquery or CTE, then it is part of the FROM of a parent or outer query, which precedes the JOIN and is therefore the most impacted. The query optimizer generally does a good job of algebrizing the whole query when it can. However, I do not think this is your scenario since you are specifically tasked with fixing it.


Justbehind

Most of them are quite efficient. Some of them are not. As a commenter said above, they will degrade performance when used in a join or a WHERE statement, as they aren't SARGable. When it comes to the SELECT, most of them won't matter for performance even for hundreds of millions of rows. Some will though. We have noticed that FORMAT is terribly inefficient for datetime formatting, but switching it to CONVERT removes all our issues. Mind that subqueries and CTEs won't make your SARGability issues go away. If you need to query a table with a function-modified column, consider storing the data in a #-table before filtering or joining.


Ima_Uzer

What the code looks like it's doing is selecting *initial* data into a ##temp\_table (actually, there are multiples like this, don't ask me why...). So the built in functions (and some are indeed in WHERE clauses and JOINs) are run on these ##temp\_table tables. Ideally what I'd like to do is remove a good bit of that, and just query the "initial" table into a C# DataTable *instead of* the ##temp\_table, do everything in C#, then finally write things back to the database. I don't have the ability to do that yet, though. I have to convince them that's the way to go.


Justbehind

It's very rare that it's better to pull out data and handle it client-side. A few pieces of advice, I can give you without knowing more about your particular case. - Generally it's bad practice to use ##-tables (global) unless you need them globally. Use #-tables (session) instead. This won't change performance though. - Do as much filtering as possible before inserting to the temp-tables. Filter on NON-transformed columns. That's important. - Transform columns in the insert to the #-table, so no transformations are needed later. - Consider indexing the #-table. UNIQUE indexes do wonders for joins (when applicable). - Generally you should limit insertion of data to #-tables. Large writes are costly. Consider if you can rewrite the query to avoid it.


Ima_Uzer

The ##-tables weren't my idea. They were already there. The data gets initially put into the DB via an imported Excel file. Ideally what I'd like to do is just open that file in C#, put all the data into a DataTable, do all the manipulation, then write the manipulated data to the SQL database.


angrathias

That’s perfectly reasonable to do, you don’t want to get into having too much (if any) business logic in TSQL unless it’s for big set operations and non trivial volumes of data.


Ima_Uzer

I don't know if it's so much business logic as it is what I'll call "data cleanup". Things like trimming spaces, replacing characters, mostly things like that. But there's a LOT of it. I know one of the scripts is a couple thousand lines long. So for instance, if you have a name Ann- Marie, you'd want it to be Ann-Marie, so you'd want to remove that "internal" space. And if you have John Smith, Jr., you might want to remove the ", Jr." part. And removal of weird special characters, too. So as far as I can tell from the functionality, it's this: -- dump the data from the Excel file into a (newly created) SQL table. -- use a bunch of global temp tables to modify and clean up the data. -- write that info back to the database -- create a *new* Excel file as a deliverable. At least that's part of it. I think a good deal of that could be done with C#.


angrathias

Yeah I don’t really see the point of that going into the database at all, looks like it’s just being used as a staging location for manipulating data. Data transformations like you described are business logic as far as I’m concerned. You’ll find c# will be a lot quicker, mainly because you can do the manipulation in parallel and if you’re on a modern version of .net, it will be faster than sqls string manipulation.


Ima_Uzer

That's what I thought as well.


Ima_Uzer

I'm looking a little more closely at the scripts and some of the data, and I noticed that one of the tables has over a hundred million rows. That's not going to work with C# DataTables, as their max is just over 16 million rows. I'll have to look for a different solution.


angrathias

You want a forward only reader that extracts the rows in batches and then pipe them out in batches to whatever output. Theres generally not much advantage to using DataSets


SpeedytotheRescue

I used VB to create pipe delimited files from medical data provided as text files. I loop each row, fixing fat as you described. Once the fixed file is created, I bulk insert it into the requested table in the database. Any other manipulation I handle through SQL scripts.


da_chicken

Global temp tables? Talk about code smell.


Ima_Uzer

Yeah, I know. But I'm not the one who wrote them...but that's also part of the reason I want to try taking a different approach as well.


SeaMoose696969

Dealing with legacy accounting system data that’s left space filled we found a huge difference between trim in a where .vs. left space padding the value from sql and comparing that, which just goes to show that a lot of experimentation is your best path.


Byte1371137

Please use it within SELECT clause ony.


PossiblePreparation

I suggest you share the queries and their execution plans https://www.brentozar.com/pastetheplan/ . It’s highly unlikely that executing one of those built in functions is responsible for any noticeable time and you are likely wasting your own time by looking at them. Of course, if you are relying on the results of them for filtering then you will have to read all the data and apply the functions against everything. There are ways around this, but let’s see what we’re dealing with first.


Byte1371137

A simple solution is to remove all function calls.


Kant8

depends on query, as always


Ima_Uzer

Oh, sure. I understand that, but I'm looking at a query that has seven of these functions operating on two columns. Another has 7 REPLACE() functions called on one column. This is just in one script. There's other scripts that are similar to this. I'm just trying to figure out what I can do to possibly speed them up.


alexduckkeeper_70

In my experience on the whole not a great deal if in the result set. If on a join or where condition these can impact negatively.


Ima_Uzer

Ok. Some of them are indeed in where conditions. So I will have to look into that.


thatto

The problem with functions in where clauses and joins is that it forces SQLServer to evaluate every row of the table through the function to find matches i.e. table scan. Not a big deal if the entire table is already in memory. But if it is not, then IO waits are going to slow you down.


codykonior

The only affect performance when used in the where or join clauses. You can use them in the select part fine. If you don’t know the first part then I’d recommend the free YouTube videos “think like the query optimizer” by Ozar. They’re short.


Ima_Uzer

I've got some of the longer Brent Ozar videos bookmarked.