Blog ala Vidar

SQL, AppFrame and other cool technologies

Tag Archives: Indexing

Unused Indexes

Two weeks ago, I held a session called “Hardcore SQL” session at AppEd where I (among other things) said that you should always start with removing unused indexes, before adding missing indexes. You can find my blog post about missing indexes here. So, how do you find unused indexes?

SELECT AS ObjectName, AS IndexName,
  user_seeks + user_scans + user_lookups AS Reads,
  user_updates AS Writes, P.rows AS RowsNum
    FROM sys.dm_db_index_usage_stats AS IUS
    INNER JOIN sys.indexes AS I ON I.index_id = IUS.index_id AND IUS.object_id = I.object_id
    INNER JOIN sys.partitions AS P ON P.index_id = IUS.index_id AND IUS.object_id = P.object_id
    INNER JOIN sys.objects AS O ON IUS.object_id = O.object_id
    WHERE O.type = 'U'
      AND IUS.database_id = DB_ID()
      AND I.type_desc = 'nonclustered'
      AND I.is_primary_key = 0
      AND I.is_unique_constraint = 0
      AND P.rows > 10000
    ORDER BY Reads, RowsNum DESC

Btw, I’ve got a Live Mesh folder with SQL scripts. If anyone wants access to it (and maybe wants to add scripts) you know how to contact me.

Missing Indexes

In SQL Server 2005, Microsoft introduced Dynamic Management Views and Functions (DMV/DMFs). These provides real-time data about indexes, hardware, query usage and many many other things. Using this, you can actually get suggested indexes from the SQL Server itself, that in it’s mind it would gain performance if you implemented them. So, why isn’t these generated automatically? The simple answer is disk-usage. If you implemented all these, you’d end up with A LOT of indexes, and therefore a much bigger database. So, which should you implement?

  DENSE_RANK() OVER (ORDER BY MIGS.avg_user_impact DESC, MIG.index_handle) AS IndexRank,
  RIGHT(MID.statement, CHARINDEX('[', REVERSE(MID.statement))) AS table_name,
  MIC.column_id, MIC.column_name, MIC.column_usage,
  MIGS.unique_compiles, MIGS.user_seeks,
  MIGS.avg_total_user_cost, MIGS.avg_user_impact
    FROM sys.dm_db_missing_index_details AS MID
    CROSS APPLY sys.dm_db_missing_index_columns (MID.index_handle) AS MIC
    INNER JOIN sys.dm_db_missing_index_groups AS MIG
        ON MIG.index_handle = MID.index_handle
    INNER JOIN sys.dm_db_missing_index_group_stats AS MIGS
        ON MIG.index_group_handle = MIGS.group_handle
    WHERE statement LIKE '%atbl%'
      AND MIGS.unique_compiles > 10
      MIGS.avg_user_impact DESC,
        WHEN column_usage = 'EQUALITY' THEN 1
        WHEN column_usage = 'INEQUALITY' THEN 2
        ELSE 3

Using this query, you get indexes, with it’s columns, ordered by the average user impact. You should consider those that have a relatively high avg_user_impact and/or avg_total_user_cost.

Handling strings in SQL Server

In .NET, if you have a string it’s:
VB: Dim vMyString As String
C#: string vMyString;

In SQL Server you’ve got more than one choice. varchar, nvarchar (with or without MAX), char, nchar, text and ntext. That’s why SQL Server is better than .NET. No, just kidding. When you’ve got a String in .NET, what’s the equivalent? nvarchar. If you’re trying to put varchar in .NET, you’ll end up with an nvarchar (since the text is changed to Unicode).

SQL is built to store large amounts of data and retrieving it faster than lightning. It’s not built to handle strings, but sometimes you need to do it on the SQL Server. That’s when functions like SUBSTRING, LEFT, RIGHT, REVERSE and PARSENAME are useful! The last one, PARSENAME, is a function that’s saved me A LOT of time. The purpose of this function is to retrieve table names, schema names etc from a string. For example a string like “Server.Database.Schema.Table”. If you put PARSENAME and set the second parameter to 1, you’ll always get the Table. 2 will always give you Schema etc. Now, how can we use this other than retrieving object-names?


SET @Name = 'Nordnes, Vidar Stangeland'
SET @Name = REPLACE(@Name, ',', '.')

  PARSENAME(@Name, 1) AS FirstName,
  PARSENAME(@Name, 2) AS LastName

The @Name variable contains a name where you’ve got “LastName, FirstName (MiddleName)”. Replacing comma with a period enables you to use PARSENAME. The cool thing about this is that it returns NULL if it doesn’t have any values. If you didn’t use PARSENAME, you’d have to do something like:


SET @Name = 'Nordnes, Vidar Stangeland'

  CASE WHEN CHARINDEX(',', @Name) > 0
			CHARINDEX(',', @Name) + 2,
	ELSE @Name
  END AS FirstName,
  CASE WHEN CHARINDEX(',', @Name) > 0
	THEN LEFT(@Name, CHARINDEX(',', @Name) - 1)
	ELSE @Name
  END AS LastName

WAY more time consuming, for doing something that should be pretty easy. REVERSE is also a great function to know about. As is REPLICATE, STUFF, ISNULL (or COALESCE), LTRIM and RTRIM.

While I’m at it, “asdf “ (with trailing space) isn’t always the same as “asdf” (without trailing space). Just take a look at this example:

SELECT '1' WHERE 'asdf' = ' asdf'
SELECT '2' WHERE 'asdf' = 'asdf '
SELECT '3' WHERE 'asdf ' = ' asdf'
SELECT '4' WHERE 'asdf ' = 'asdf'
SELECT '5' WHERE 'asdf' LIKE ' asdf'
SELECT '6' WHERE 'asdf' LIKE 'asdf '
SELECT '7' WHERE 'asdf ' LIKE ' asdf'
SELECT '8' WHERE 'asdf ' LIKE 'asdf'

Before you run the code, think about what you THINK will be returned. 1-4 is the same as 5-8 except the = vs LIKE. My first guess was 0 records. My second guess was 2 and 6. None of my guesses was correct though.

Remember also that a nvarchar(MAX) field can contain very much data. Returning this to the client could be very slow, especially if you include this field in other fields (for instance one field returning HTML-data, and one without the HTML-formatting), or as we often see here in Omega, a SearchColumn which is used by our framework for filtering. This field normally consist of all the (n)varchar-fields in the table.

Primary key != Clustered Index

There’s been a lot of confusion about this team. What’s the difference between a primary key, unique key, clustered index etc.
For starters, notice the keywords: KEY and INDEX
A key means that it’s used for identifying something. Primary key is the primary identifier for the ROW. A Primary Key MUST be unique, and it can not have NULL-able fields.
A unique key also ensures uniqueness, but can be defined on NULL-able fields. Although, only one NULL pr field (unlike Oracle).
So, what is the primary (and unique) key used for? Making relations between tables. To create a foreign key you first need to create a key on the parent table.
When you create a primary key, it defaults to a clustered index. This doesn’t mean that this is the best way though!
Let’s say we’ve got a Person table:
Name NVARCHAR(200)
Phone VARCHAR(50)
Address ….

and so on.. PersonID will be "NordVida" (first 4 chars in last name and first name). It’s clear that PersonID should be the Primary Key, because it identifies the row.
Putting the Primary Key on this field ensures uniqueness. But, should this be a clustered index? NO!
A clustered index should be ever increasing. What does that mean? It should start on 1, next should be 2, or start with "a" and next should be "b".
PersonID will never be ever increasing, because you add Trygve Haaland (HaalTryg) which is before NordVida, then you add Werner Waage (WaagWern) which is after Trygve etc.
If you had clustered index on PersonID, this means that it has to rearrange the hole "table" (clustered index) EVERY time you insert a record.
Since the clustered indexed-fields are included in all non-clustered indexes, all non-clustered indexes also will be rearranged EVERY TIME you insert a record.

So, what should be the clustered index on this table? Many use an integer field with identity. This is a good solution, as long as you’re not importing/exporting data.
Then the identity column is just a big PAIN IN THE ***. Personally I prefer having a uniqueidentifier field with NEWSEQUENTIALID() as the default. This is unique, ever increasing and VERY nice to deal with when importing/exporting data.

Lets sum up:
My opinion is that Primary keys should be set where you’ve got your "logical keys" (in this case, PersonID). The uniqueidentifier field should be set as clustered index.

The how’s and why’s about indexes, part 2

SQL Server uses something called a B+tree. Some of you have probably gone to school and learned what this is, but I’ve used some time to really understand this. What is a B+ tree and what does that have to do with indexes? Well, when you insert a record into a table (with an index), these values will be populated into a tree to make it easier to search.

Download a little application called STX B+ Tree Demo. Set data type to integer and node slots to 4 and click “Insert Random” and choose “Insert 20 Random Integer Pairs”. This will leave you with something like this:

On the top we’ve got a root node (410). This is the KEY. If you look a bit more down, you’ll find 410 once more, but then with 472 under it. 472 is the VALUE for KEY 410. On the left of the root node, you’ll find all keys that are LESS OR EQUAL to the root node. Both 104 and 190 is less than 410. This level is called the intermediate nodes, and on the bottom you’ve got the leaf nodes. So, if we’re for example looking for the key 679, we start at the top. It’s more than 410, and therefore we move to the right. It’s between 575 and 788, therefore it’s in the middle of these. Then there’s only 4 keys to search in till we find what we looked for: 679, and the value is 277.

Now, compare this to our world. If you set an index on three columns, all three of them will be the key in this index. If you add included columns, these will only exist in the leaf node. But, what if you DON’T add included columns in the index. What will then be the value? RID. Row ID. You might have seen “RID lookup” in the execution plan. In the value of the leaf nodes, there’s always stored a RID (and if exists, included columns). This will point it to the correct row in the table. So, the reason for adding included columns is to get rid of the RID lookups. Also, it might be a good idea for covering indexes, since the total amount of bytes in an index (excluding the included columns) can maximum be 900 bytes.

The how’s and why’s about indexes

Indexes is a big topic when it comes to SQL Server. I’m going to try to tell you the most basics about it. Hopefully this will help you figure out when and why you should use indexes. First, lets start with the real basics!

Go get some candy, coke and coffee. This will probably be the longest blog I’ve done so far. BUT PLEASE KEEP READING! In SQL Server you’ve got tables which stores data. In my example I have a table called Persons. This will have 5 records. In management studio there’s a VERY NICE FEATURE called “Include Actual Execution Plan”. I’ll use this to show you.

USE tempdb

	FirstName VARCHAR(50) NOT NULL,
	ReverseEMail AS REVERSE(Email)

  (PersonID, FirstName, LastName, Email)
  ('StolJens', 'Jens', 'Stoltenberg', ''),
  ('SolbErna', 'Erna', 'Solberg', ''),
  ('JensSiv', 'Siv', 'Jensen', ''),
  ('SponLars', 'Lars' ,'Sponheim', ''),
  ('NordVida', 'Vidar', 'Nordnes', '')

If I now do a SELECT * FROM Persons, everyone knows the results, but here’s the execution plan:

As you see, it’s doing a Table Scan which takes 100% of the cost of this query. A table scan is literally a scan through the whole (or in some cases, parts) of the table. The compute scalar items are there because I’ve got a computed column (ReverseEMail) in my table. We’ll get back to this later.

A table scan is in most cases VERY inefficient, because it has to go through all records with data. If this table had 100.000 records, it would have to go through 20MB of data (8+50+50+100 * 100.000 bytes). Of course, when doing a SELECT * from a table without a where clause, there is no other way than scan the whole table, because you need all data to be returned anyways.

There’s two kinds of indexes in SQL Server. Clustered and non-clustered. Non-clustered is in some ways a table for the table which only stores the values for the indexes columns. In the index there’s also added a rowid field which points to the correct record in the actual table. This way, when seeking the index, it can point you to the correct row in the table after it’s finished seeking. In 2005 you can have up to 249 non-clustered indexes, but in 2008 you can have up to 999! But before you start creating 999 indexes, be aware that indexes also contains data! If you index a column that is a integer, which is 4 bytes, it will be 4 bytes in the table and 4 bytes in the index PER RECORD. So, if you add 999 indexes, this will be 4 * 1000, which is 4000 bytes PER RECORD.

The other type of index is a clustered index. This is the physical (or actually logical, but just think of it as the physical) ordering on the harddisk. So, for instance when I add a clustered index on the PersonID field, it will order the table based on the PersonID field. The clustered index will have the other fields as datapages in the index, but the indexed field (PersonID) will be on the leaf node. I’m not going in to details about this, but if you’re interested, have a look at the definition of a B-tree. Anyhow.. Since the clustered index is the physical ordering of the actual values, so therefore a table can only have ONE clustered index. NB! If you have a clustered index on a table, you’ll never get a Table Scan. You’ll get a Clustered Index Scan, which is the same.

If we think of a table as a book, we’ve got the clustered index in front (Table of contents) of the book, having a sorted list (by chapters 1, 2, 3) of the content with pointers to the actual data (the page of the book). At the end of the book, you’ll find the non-clustered index. This is a index on the “word” field in our table. The word is sorted ascending with a pointer to the page, so the word “Assembly” might be found in pages 15, 19 and 333, and the word “Binary” might be found at page 16.

So, now we’re adding a clustered index to the PersonID field, and a non-clustered on the LastName column. This makes searching on LastName faster for the server, because instead of using a table scan (or clustered index scan), it can use the index to look up the values.


CREATE NONCLUSTERED INDEX IX_Persons_LastName ON Persons (LastName)

	FROM Persons
	WHERE LastName LIKE 'S%'

Before taking a look at the execution plan for the select statement, let’s guess what it will do. First it will look at the Persons table to see if there’s any indexes on LastName. There is (IX_Persons_LastName). It will seek this for ‘S%’ and then do a lookup to the clustered index to get the PersonID.

TRICKED YOU! 😀 It’s only using a Index Seek on IX_Persons_LastName. But, where does it get the value of PersonID? From IX_Persons_LastName, of course. The reason is that if you have a table with no clustered index, it will add a RowID field to the table, and use this in all non-clustered indexes to keep them “in sync”. But, if you do have a clustered index, it will use this field (PersonID) as the RowID. That’s why it’s only using the non-clustered index.

Now, let’s do another query:

	FROM Persons
	WHERE LastName LIKE '%berg'

We’re only getting the PersonIDs for those that have a LastName ending with “berg”. If we look at the execution plan, we’ll see that now it’s using a Index Scan, instead of an Index Seek

Index scan is just as bad as a table scan. It has to search the whole index to find the records it’s looking for. Let’s take the book as an example again. Find a book, go to the index in the back of the book and try to find all words ending with “ing”. If your book of choice is a computer book, you’ll eventually find at least “designing”, “hacking” and “implementing”. This will take some time. If you then try to find all words starting with C, you know where to start, and where to end. You know that you have to start after the last word that starts with B, and you know you have to stop when you hit D. This is the difference between a seek and a scan. But, what if you often have to find things that ENDS with something? Do you always have to do a index scan? No. As you might remember, I added a field called ReverseEMail to the table AS REVERSE(Email). This is a computed, not persisted field. This means it’s computed on the fly, and therefore not stored to the actual database. If we add an index to this field:

	ON Persons (ReverseEMail) INCLUDE (FirstName)

SELECT FirstName
	FROM Persons
	WHERE ReverseEmail LIKE REVERSE('')

	FROM Persons
	WHERE ReverseEmail LIKE REVERSE('')

	FROM Persons WITH (INDEX(IX_Persons_ReversedEmail))
	WHERE ReverseEmail LIKE REVERSE('')

Now let’s have a look at the execution plan again for these three select statements:

Notice when I’m creating the index I’m doing it on the ReverseEMail field, but I’m also including the FirstName. INCLUDE was introduced in SQL Server 2005. This means that it includes the data for this field in the datapage, but not in the leaf node. In other words, you can’t filter on it but you can select it out of it.

In the first query you see it’s only using a index seek on IX_Persons_ReversedEmail, because it’s got everything it needs in that index. In the second query we select LastName instead of FirstName, which is NOT included in the index. Then the query optimizer figures out it’s faster to do a clustered index scan instead of using the index we created, because it needs to go in to this anyways to get the LastName. If we had more records it would probably not do this clustered index scan. In the third query I’ve used a table hint where I’m forcing query optimizer to use the index I created. Then it’s doing a Index Seek on the ReverseEmail field, and then doing a key lookup per record to the clustered index. As you see from the Query cost (relative to the batch) this is 52%, compared to 24% for the others. That means that even though you might think it’s quicker to use the index we just created, it’s actually not. Be aware, although all reasonable logics says that you’re correct, SQL Server might take an alternative route. When you execute a query, it uses a couple of milliseconds to calculates the best way to execute this query. If your statistics are up to date (I might write a post about statistics later), it’s very likely that SQL Server will be better than you to pick indexes to use 😉 So, keep your statistics updated and do not mess with table hints unless you really know what you’re doing! If you are interested, I’ll write a post about statistics, and one more detailed about execution plans. Please let me know!

So, to our conclusion. SQL Server IS smarter than you when it comes to picking the best execution plan. GET USED TO IT! 🙂 Create indexes on fields that are often used for filtering. If you use GROUP BY or ORDER BY, you should consider adding a clustered index to that (or those) field(s).

Marry Christmas everyone!

Query Optimization

When doing query optimization, there is a couple of things to remember:

This KILLS performance! Instead of using OR, use UNION (ALL) or IN

The performance difference between these are big. UNION first groups and then orders. UNION ALL just adds the resultsets together

In many cases, you can use:

	FROM table1
				FROM table2)

instead of using:

	FROM table1
			FROM table2
			WHERE table2.Login = table1.Login)

Subselects vs JOIN
In some cases, like in the queryoptimizing webcast, subselects are MUCH quicker than using JOIN. This is not the case in all queries, so test both!

When you use NOLOCK, you read uncommited data from the table. This means that there MIGHT be data that are not written to the table yet. In most of our cases this is not a problem.

… unless you need it. When using *, you can’t use indexes to select fields. Using a * forces the execution plan to select from the table.

Indexes and Statistics
Always make good indexes! Make sure indexes and statistics are rebuild when they need to! I’ll post an input here later on about indexes and statistics.

Tuning SQL code

Have you ever though of tuning your car? Yeah, why not?
It makes your car run faster and use less fuel. Well, for starters it’s a bit expensive. But, tuning SQL is free, and it’s even more fun than tuning your car. Tuning your SQL means your queries run faster and use less fuel (CPU, memory etc)!

I’m at Conocophillips Indonesia at the moment upgrading to SQL Server 2005. I’ve used a couple of days to tune different queries, and at the most I got a query to run from an average of 4 minutes, to about 10 seconds, with basically the same code! Do NOT tell me that this was a temporary solution, because my solution was even easier! 🙂

When writing SQL code, keep these simple rules in mind:
Do not use sub-queries unless it’s ABSOLUTELY necessary
If you are setting several variables from the values of fields in tables/views, use one SELECT statement instead of three SET statements.

For example, do not use:

SET @Variable1 = 'Test'
SET @Variable2 = (SELECT Test FROM Test1 WHERE Test3='asdf')
SET @Variable3 = (SELECT Test2 FROM Test1 WHERE Test3='asdf')


  @Variable1 = 'Test',
  @Variable2 = Test,
  @Variable3 = Test2
    FROM Test1
    WHERE Test3 = 'asdf'

You can also use SELECT instead of SET to set static values
If possible, do not use CASE or UNION (ALL)
Use EXISTS, instead of 0 < (SELECT COUNT(*) something)
Use TOP 1. Always remember that a sub-query MAY return more than one record. Your query might then fail if you’re using Field = (SELECT SubField FROM subtable).

These rules are things that every developer should keep in mind when writing SQL code.

Of course, there might be places where you’ve followed every rule, but it still takes minutes to run. Then you might want to see the execution-plan. There are a couple of major things too look for here. For example, Scans (Index Scan, Clustered Index Scan, Table Scan etc) is NOT good. Seeks are a good thing.

You might also want to see what’s really happening when you’re running your query. To turn on execution-plan in Server Management Studio, press ctrl + M and rerun your query. It will appear as a new tab beside Results and Messages.

And, as always; Profiler might be a good tool. If you let it run for a couple of hours you will discover queries which take longer time than others. For example, you can set a filter on the Reads field, for >= 50000.

Do you have any other simple roules you’d like to share with us? Please mail them to me, and I’ll publish them here!



We just had “Pizza & Learning” here in Ølen. My opinion: very successful 🙂 First, Peter startet with showing all available functionality in the grid-control. After half an hour, came FINALLY the pizza-guy. 14(?) HUNGRY computer-geeks to feed!

After some pizza and chit-chat, I started trasking about the new features in SQL 2005, compared to 2000. This includes a couple of new functions and fields (ROW_NUMBER, RANK, VARCHAR(MAX) etc).

But, wait. THERE’S MORE!

Something that can be, in my opinion, VERY useful for developers is the new Dynamic Management Views and Functions (DMV / DMF). This is internal views, which gives you ALOT of (un)usefull information about queries, indexes etc. My example was the following:

SELECT * FROM sys.dm_db_index_physical_stats (5, 665678065, NULL , NULL, 'LIMITED')

5 = DB_ID(‘AppFrameR2’)
665678065 = OBJECT_ID(‘stbl_System_Users’)

This returns everything worth knowing about the indexes on stbl_System_Users. What you should notice here is the “avg_fragmentation_in_percent” column. If this is high, you might want to rebuild your index. This is why I have made a new SP: sstp_System_RebuildIndexes

First time I ran that on AppFrameR2 it took somewhere between 3 and 5 hours (goldfish memory). Now it only takes 30-45 minutes, and I’ve run it each weekend since the first time i ran it. OBS!! I do NOT recommend running this within workhours! This will slow down, and sometimes LOCK several tables! Therefore: Run it at night, when everyone SHOULD be in their bed.

Torgeir’s answer to everything:
One other thing, that’s also available in 2000: PROFILER!

Each time I ask Torgeir (Omegas tech. guru number UNO?) a question related to databases, he ALWAYS (I’m not kidding!) answers: Have you tried the profiler?
… and each time I get that answer, I get red like a tomato and mumble his answer to myself. 90% of the time, this helps, the other times, I ask again, and get a “real” answer 😛

Profiler can help you with alot of things. SQL “debuging” of queries, what happens in stored procedures, checking WHAT’S taking so long and alot of other cool things. If you haven’t tried it: WHY NOT? PS! If anyone’s wondering if I got a “What would Torgeir do” t-shirt: No, actually not. The norwegian mail-service has lost it!

Arild’s cute little query:

SELECT TOP 5 total_worker_time/execution_count 'Avg CPU Time',
  SUBSTRING(st.text, (qs.statement_start_offset/2)+1,
  ((CASE qs.statement_end_offset WHEN -1 THEN DATALENGTH(st.text)
    ELSE qs.statement_end_offset END - qs.statement_start_offset)/2) + 1)statement_text
        FROM sys.dm_exec_query_stats AS qs
        CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
        ORDER BY total_worker_time/execution_count DESC

You might want to try this cute little query, just for fun. Or you might end up with tuning the result. Who knows 😉
Anyways, open your copy of books online, and in the “Look for:” textbox, write: sys.dm_db
You’ll get a list of dynamic management views and functions. This is excellent reading material on a saturday night, with a glass, or maybe a bottle?, of wine 🙂