Blog ala Vidar

SQL, AppFrame and other cool technologies

Monthly Archives: December 2008

The how’s and why’s about indexes

Indexes is a big topic when it comes to SQL Server. I’m going to try to tell you the most basics about it. Hopefully this will help you figure out when and why you should use indexes. First, lets start with the real basics!

Go get some candy, coke and coffee. This will probably be the longest blog I’ve done so far. BUT PLEASE KEEP READING! In SQL Server you’ve got tables which stores data. In my example I have a table called Persons. This will have 5 records. In management studio there’s a VERY NICE FEATURE called “Include Actual Execution Plan”. I’ll use this to show you.

USE tempdb

CREATE TABLE Persons (
	PersonID VARCHAR(8) NOT NULL,
	FirstName VARCHAR(50) NOT NULL,
	LastName VARCHAR(50) NOT NULL,
	EMail VARCHAR(100) NULL,
	ReverseEMail AS REVERSE(Email)
)

INSERT INTO Persons
  (PersonID, FirstName, LastName, Email)
VALUES
  ('StolJens', 'Jens', 'Stoltenberg', 'jens@stortinget.no'),
  ('SolbErna', 'Erna', 'Solberg', 'erna@stortinget.no'),
  ('JensSiv', 'Siv', 'Jensen', 'siv@stortinget.no'),
  ('SponLars', 'Lars' ,'Sponheim', 'lars@stortinget.no'),
  ('NordVida', 'Vidar', 'Nordnes', 'vidar@omega.no')

If I now do a SELECT * FROM Persons, everyone knows the results, but here’s the execution plan:

As you see, it’s doing a Table Scan which takes 100% of the cost of this query. A table scan is literally a scan through the whole (or in some cases, parts) of the table. The compute scalar items are there because I’ve got a computed column (ReverseEMail) in my table. We’ll get back to this later.

A table scan is in most cases VERY inefficient, because it has to go through all records with data. If this table had 100.000 records, it would have to go through 20MB of data (8+50+50+100 * 100.000 bytes). Of course, when doing a SELECT * from a table without a where clause, there is no other way than scan the whole table, because you need all data to be returned anyways.

There’s two kinds of indexes in SQL Server. Clustered and non-clustered. Non-clustered is in some ways a table for the table which only stores the values for the indexes columns. In the index there’s also added a rowid field which points to the correct record in the actual table. This way, when seeking the index, it can point you to the correct row in the table after it’s finished seeking. In 2005 you can have up to 249 non-clustered indexes, but in 2008 you can have up to 999! But before you start creating 999 indexes, be aware that indexes also contains data! If you index a column that is a integer, which is 4 bytes, it will be 4 bytes in the table and 4 bytes in the index PER RECORD. So, if you add 999 indexes, this will be 4 * 1000, which is 4000 bytes PER RECORD.

The other type of index is a clustered index. This is the physical (or actually logical, but just think of it as the physical) ordering on the harddisk. So, for instance when I add a clustered index on the PersonID field, it will order the table based on the PersonID field. The clustered index will have the other fields as datapages in the index, but the indexed field (PersonID) will be on the leaf node. I’m not going in to details about this, but if you’re interested, have a look at the definition of a B-tree. Anyhow.. Since the clustered index is the physical ordering of the actual values, so therefore a table can only have ONE clustered index. NB! If you have a clustered index on a table, you’ll never get a Table Scan. You’ll get a Clustered Index Scan, which is the same.

If we think of a table as a book, we’ve got the clustered index in front (Table of contents) of the book, having a sorted list (by chapters 1, 2, 3) of the content with pointers to the actual data (the page of the book). At the end of the book, you’ll find the non-clustered index. This is a index on the “word” field in our table. The word is sorted ascending with a pointer to the page, so the word “Assembly” might be found in pages 15, 19 and 333, and the word “Binary” might be found at page 16.

So, now we’re adding a clustered index to the PersonID field, and a non-clustered on the LastName column. This makes searching on LastName faster for the server, because instead of using a table scan (or clustered index scan), it can use the index to look up the values.

ALTER TABLE Persons
	ADD CONSTRAINT PK_Persons
	PRIMARY KEY CLUSTERED (PersonID)

CREATE NONCLUSTERED INDEX IX_Persons_LastName ON Persons (LastName)

SELECT PersonID
	FROM Persons
	WHERE LastName LIKE 'S%'

Before taking a look at the execution plan for the select statement, let’s guess what it will do. First it will look at the Persons table to see if there’s any indexes on LastName. There is (IX_Persons_LastName). It will seek this for ‘S%’ and then do a lookup to the clustered index to get the PersonID.

TRICKED YOU! 😀 It’s only using a Index Seek on IX_Persons_LastName. But, where does it get the value of PersonID? From IX_Persons_LastName, of course. The reason is that if you have a table with no clustered index, it will add a RowID field to the table, and use this in all non-clustered indexes to keep them “in sync”. But, if you do have a clustered index, it will use this field (PersonID) as the RowID. That’s why it’s only using the non-clustered index.

Now, let’s do another query:

SELECT PersonID
	FROM Persons
	WHERE LastName LIKE '%berg'

We’re only getting the PersonIDs for those that have a LastName ending with “berg”. If we look at the execution plan, we’ll see that now it’s using a Index Scan, instead of an Index Seek

Index scan is just as bad as a table scan. It has to search the whole index to find the records it’s looking for. Let’s take the book as an example again. Find a book, go to the index in the back of the book and try to find all words ending with “ing”. If your book of choice is a computer book, you’ll eventually find at least “designing”, “hacking” and “implementing”. This will take some time. If you then try to find all words starting with C, you know where to start, and where to end. You know that you have to start after the last word that starts with B, and you know you have to stop when you hit D. This is the difference between a seek and a scan. But, what if you often have to find things that ENDS with something? Do you always have to do a index scan? No. As you might remember, I added a field called ReverseEMail to the table AS REVERSE(Email). This is a computed, not persisted field. This means it’s computed on the fly, and therefore not stored to the actual database. If we add an index to this field:

CREATE NONCLUSTERED INDEX IX_Persons_ReversedEmail
	ON Persons (ReverseEMail) INCLUDE (FirstName)

SELECT FirstName
	FROM Persons
	WHERE ReverseEmail LIKE REVERSE('%stortinget.no')

SELECT LastName
	FROM Persons
	WHERE ReverseEmail LIKE REVERSE('%stortinget.no')

SELECT LastName
	FROM Persons WITH (INDEX(IX_Persons_ReversedEmail))
	WHERE ReverseEmail LIKE REVERSE('%stortinget.no')

Now let’s have a look at the execution plan again for these three select statements:

Notice when I’m creating the index I’m doing it on the ReverseEMail field, but I’m also including the FirstName. INCLUDE was introduced in SQL Server 2005. This means that it includes the data for this field in the datapage, but not in the leaf node. In other words, you can’t filter on it but you can select it out of it.

In the first query you see it’s only using a index seek on IX_Persons_ReversedEmail, because it’s got everything it needs in that index. In the second query we select LastName instead of FirstName, which is NOT included in the index. Then the query optimizer figures out it’s faster to do a clustered index scan instead of using the index we created, because it needs to go in to this anyways to get the LastName. If we had more records it would probably not do this clustered index scan. In the third query I’ve used a table hint where I’m forcing query optimizer to use the index I created. Then it’s doing a Index Seek on the ReverseEmail field, and then doing a key lookup per record to the clustered index. As you see from the Query cost (relative to the batch) this is 52%, compared to 24% for the others. That means that even though you might think it’s quicker to use the index we just created, it’s actually not. Be aware, although all reasonable logics says that you’re correct, SQL Server might take an alternative route. When you execute a query, it uses a couple of milliseconds to calculates the best way to execute this query. If your statistics are up to date (I might write a post about statistics later), it’s very likely that SQL Server will be better than you to pick indexes to use 😉 So, keep your statistics updated and do not mess with table hints unless you really know what you’re doing! If you are interested, I’ll write a post about statistics, and one more detailed about execution plans. Please let me know!

So, to our conclusion. SQL Server IS smarter than you when it comes to picking the best execution plan. GET USED TO IT! 🙂 Create indexes on fields that are often used for filtering. If you use GROUP BY or ORDER BY, you should consider adding a clustered index to that (or those) field(s).

Marry Christmas everyone!

SQL Server 2005 SP3

SP3 was released to public the 15th of December. Download it here, and check out the details about fixes here. Notice 959016, 959017 and 959019

Regular expressions

I just had a list of 300 SPs that I wanted to check against sys.sysobjects to see which existed and which didn’t. To do this, I need an IN with the list of all my SPs, and I have to add ‘ at the start of the strings and ‘, (apostrophe and comma) at the end of the strings. Instead of doing this manually, which would take ages, and probably cause my arms to tear off my body, I did the following. ctrl + h (search and replace).
In the “Find what:” I added: ^sstp.*
In “Replace with:” I added: ‘’,
Then, tick of the “Use” regular expressions, and do a replace all 🙂 Saved me A LOT of work!

SQLXML

Ever tried working with XML in SQL Server? Some find it very difficult. I find it.. Interesting? I’m not sure. My opinion is that if possible, stay away from XML, but if you need to use it, here’s some quick starter tips:

USE tempdb;

IF EXISTS (SELECT *
		FROM sys.sysobjects
		WHERE name = 'xmltest1')
	DROP TABLE xmltest1

CREATE TABLE xmltest1 (
	ID INT,
	Name NVARCHAR(200)
)

INSERT INTO xmltest1
  (ID, Name)
VALUES
  (1, 'Jens Stoltenberg'),
  (2, 'Erna Solberg'),
  (3, 'Siv Jensen'),
  (4, 'Lars Sponheim')

SELECT *
	FROM xmltest1
	FOR XML AUTO

SELECT *
	FROM xmltest1
	FOR XML AUTO, ELEMENTS

DECLARE @MyXML AS XML
SET @MyXML = (SELECT *
		FROM xmltest1
		FOR XML PATH('Politician'), ROOT('Politicians'))

SELECT @MyXML
/*
<Politicians>
  <Politician>
    <ID>1</ID>
    <Name>Jens Stoltenberg</Name>
  </Politician>
  <Politician>
    <ID>2</ID>
    <Name>Erna Solberg</Name>
  </Politician>
  <Politician>
    <ID>3</ID>
    <Name>Siv Jensen</Name>
  </Politician>
  <Politician>
    <ID>4</ID>
    <Name>Lars Sponheim</Name>
  </Politician>
</Politicians>
*/

DECLARE @docHandle AS INTEGER
EXEC sp_xml_preparedocument @docHandle OUTPUT, @MyXML

SELECT *
	FROM OPENXML(@docHandle, N'/Politicians/Politician')
	WITH (ID INTEGER 'ID', Name NVARCHAR(200) 'Name')

SELECT @MyXML.query('/Politicians/Politician/Name')

/* Clean up*/
EXEC sp_xml_removedocument @docHandle
DROP TABLE xmltest1

I’m creating a test-table with two columns, and adding some records. These are well known Norwegian politicians. If you run this script, you’ll see it will return 5 record sets. The first one is from FOR XML AUTO. This means it will have one element pr record, with attributes for each field. The next one (FOR XML AUTO, ELEMENTS) will return one element pr record, with child elements for each field. The third (SELECT @MyXML) will have one root element, Politicians, with one element (Politician) pr record, and child elements for each field, as shown in the comments.

The fourth is using OPENXML instead of FOR XML. This means I have to “prepare” a document for the SQL server. I’m adding it, and getting back an ID to that document (using @docHandle). With this I’m querying the XML, to get SQL results, so the fourth is actually the same as running SELECT * FROM xmltest1. Be aware that I’m naming the fields. ID INTEGER ‘ID’. This means that it’s picking the ID element, and putting it in the ID field. For getting attributes, you’ll use @ID. For example ID INTEGER ‘@ID’. You can also query parent elements by doing ‘../ID’ or ‘../@ID’.

The last one is querying the @MyXML variable. Here I’m getting a XML based on @MyXML. I only get the Name elements.

NB!!! I’m using INSERT with multiple values. This is only supported by SQL Server 2008. The rest is supported by both 2005 and 2008.

EASY SECURITY ACTIONS

I should probably call this FREE PORN. That way everyone would have read it, but anyways.. Although this is a bit of a boring subject, this will only take two minutes, and it can save you many hours afterwards! The topic is security. I’ve gathered some security actions after watching a couple of TechEd Sessions from Steve Riley about Security. All my bullet points are COST FREE as in no money spent, and almost no time spent implementing. Then, why don’t you already do it? I don’t care, but please start doing it.

CLIENTS

  • Anti-malware
    This includes virus, spyware, adware+++. Vista comes with Windows Defender, but AVG FREE is a good alternative.
  • Be skeptical!
    Never give your password to ANYONE. This includes your administrator! If I really want/need (for testing purposes) to use your account, I can reset your password to something.
    Never click on anything that you don’t know the results of. For example, lately I’ve gotten many links from a contact on MSN. Typically “Try this cool website: http://vidar.somehost.com”. I have of course changed the host so you won’t click on it 😉 The reason I got suspect was that this person is Norwegian. Why should she start a conversation with an English sentence and a link? And why is she offline? She’s clicked on one of those links herself, added her msn address and password to the site, and there you go. This also goes to e-mails. If for some reason spam gets through our very tough spam-filter, be skeptical! For the record, I’ve blocked this person from MSN. When she’s stopped clicking on stupid links, I’ll unblock her.
  • Do not run as administrator
    Most common applications does not have any problems running as normal user
  • Ensure Microsoft update is running frequently
  • Firewall
    Both XP and Vista comes with built-in firewalls. DO NOT DISABLE THEM! If you’ve got apple, throw it in the garbage and buy a real computer.
  • Lock your computer when you leave it
    THIS COST YOU about 14 calories, but believe me, you want to loose these 14 calories. Have you ever thought about how easy it is to for example go in to the Omega HQ office at for example lunch time, go in to an empty office and play with your computer? Many of you leave your computer when you’re logged on to MarketMaker, logged in with management studio to our SQL Servers and maybe remote desktop. Now, use one minute to think about what the intruder could do, just because you want to save your 14 calories. This is the easiest security action to take, less time consuming and it’s totally cost free, and also, it’s the most efficient!
  • Uninstall unused applications
    This will not only increase security, it will also increase your computers performance!
  • Use a good password
    Your name is not a good password. Neither is the one you’ve used for 10 years. Use a passphrase and change it every now and then. For example: ILoveEasySecurityActions! It’s easy to remember, it’s secure because of the length, the use of lower and uppercase, and it’s got special signs (!).

SERVERS

  • Anti-malware
    For filestores
  • Firewall
    All windows servers comes with built in firewalls. Leave them on, and also install a hardware firewall on your network to protect you.
  • Microsoft Update
  • Reduse the amount of admins
    The more people that have access, the more likely the server is to be exploited!
  • Uninstall unused applications
    SQL Servers should NOT run ANYTHING ELSE than SQL Server. Same goes to web-servers, DNS servers etc!