Saturday, September 29, 2007

Indexes

An index provides fast access to data when the data can be searched by the value that is the index key.Indexes allow data to be organized in a way that allows optimum performance when you access or modify it.
You can use indexes to quickly find data rows that satisfy conditions in your WHERE clauses, to find matching rows in your JOIN clauses, or to efficiently maintain uniqueness of your key columns during INSERT and UPDATE operations. In some cases, you can use indexes to help SQL Server sort, aggregate, or group your data or to find the first few rows as indicated in a TOP clause.

It is the job of the query optimizer to determine which indexes, if any, are most useful in processing a specific query. The final choice of which indexes to use is one of the most important components of the query optimizer's execution plan.

clustered and nonclustered indexes in SQL Server store their information using standard B-trees.B-trees cluster records with similar keys. The B stands for balanced, and balancing the tree is a core feature of a B-tree's usefulness. The trees are managed, and branches are grafted as necessary, so that navigating down the tree to find a value and locate a specific record takes only a few page accesses. Because the trees are balanced, finding any record requires about the same amount of resources, and retrieval speed is consistent because the index has the same depth throughout.

An index consists of a tree with a root from which the navigation begins, possible intermediate index levels, and bottom-level leaf pages. You use the index to find the correct leaf page. The number of levels in an index will vary depending on the number of rows in the table and the size of the key column or columns for the index. If you create an index using a large key, fewer entries will fit on a page, so more pages (and possibly more levels) will be needed for the index. On a qualified select, update, or delete, the correct leaf page will be the lowest page of the tree in which one or more rows with the specified key or keys reside. A qualified operation is one that affects only specific rows that satisfy the conditions of a WHERE clause, as opposed to one that accesses the whole table. In any index, whether clustered or nonclustered, the leaf level contains every key value, in key sequence.


Clustered Indexes


The leaf level of a clustered index contains the data pages, not just the index keys. Another way to say this is that the data itself is part of the clustered index. A clustered index keeps the data in a table ordered around the key. The data pages in the table are kept in a doubly linked list called the page chain. The order of pages in the page chain, and the order of rows on the data pages, is the order of the index key or keys. Deciding which key to cluster on is an important performance consideration. When the index is traversed to the leaf level, the data itself has been retrieved, not simply pointed to.

Because the actual page chain for the data pages can be ordered in only oneway, a table can have only one clustered index. The query optimizer strongly favors a clustered index because such an index allows the data to be found directly at the leaf level. Because it defines the actual order of the data, a clustered index allows especially fast access for queries looking for a range of values. The query optimizer detects that only a certain range of data pages must be scanned.


Most tables should have a clustered index. If your table will have only one index, it generally should be clustered. Many documents describing SQL Server indexes will tell you that the clustered index physically stores the data in sorted order. This can be misleading if you think of physical storage as the disk itself. If a clustered index had to keep the data on the actual disk in a particular order, it could be prohibitively expensive to make changes. If a page got too full and had to be split in two, all the data on all the succeeding pages would have to be moved down. Sorted order in a clustered index simply means that the data page chain is in order. If SQL Server follows the page chain, it can access each row in clustered index order, but new pages can be added by simply adjusting the links in the page chain.

In SQL Server 2000, all clustered indexes are unique. If you build a clustered index without specifying the UNIQUE keyword, SQL Server forces uniqueness by adding a uniqueifier to the rows when necessary. This uniqueifier is a 4-byte value added as an additional sort key to only the rows that have duplicates of their primary sort key. You'll be able to see this extra value when we look at the actual structure of index rows later in this chapter.


Nonclustered Indexes


In a nonclustered index, the lowest level of the tree (the leaf level) contains a bookmark that tells SQL Server where to find the data row corresponding to the key in the index.
If the table has a clustered index, the bookmark is the clustered index key for the corresponding data row. If the table is a heap (in other words, it has no clustered index), the bookmark is a row identifier (RID), which is an actual row locator in the form File#:Page#:Slot#. (In contrast, in a clustered index, the leaf page is the data page.)
The presence or absence of a nonclustered index doesn't affect how the data pages are organized, so you're not restricted to having only one nonclustered index per table, as is the case with clustered indexes. Each table can include as many as 249 nonclustered indexes, but you'll usually want to have far fewer than that.


When you search for data using a nonclustered index, the index is traversed and then SQL Server retrieves the record or records pointed to by the leaf-level indexes. For example, if you're looking for a data page using an index with a depth of three—a root page, one intermediate page, and the leaf page—all three index pages must be traversed. If the leaf level contains a clustered index key, all the levels of the clustered index then have to be traversed to locate the specific row. The clustered index will probably also have three levels, but in this case remember that the leaf level is the data itself. There are two additional index levels separate from the data, typically one less than the number of levels needed for a nonclustered index. The data page still must be retrieved, but because it has been exactly identified, there's no need to scan the entire table. Still, it takes six logical I/O operations to get one data page. You can see that a nonclustered index is a win only if it's highly selective.


Figure illustrates this process without showing you the individual levels of the B-trees. I want to find the first name for the employee named Anson, and I have a nonclustered index on the last name and a clustered index on the employee ID. The nonclustered index uses the clustered keys as its bookmarks. Searching the index for Anson, SQL Server finds that the associated clustered index key is 7. It then traverses the clustered index looking for the row with a key of 7, and it finds Kim as the first name in the row I'm looking for.







Creating an Index

The typical syntax for creating an index is straightforward:

CREATE [UNIQUE] [CLUSTERED | NONCLUSTERED] INDEX index_name
ON table_name (column_name [ASC | DESC][,...n])




CREATE INDEX has some additional options available for specialized purposes. You can add a WITH clause to the CREATE INDEX command:

[WITH
[FILLFACTOR = fillfactor]
[[,] [PAD_INDEX]
[[,] IGNORE_DUP_KEY]
[[,] DROP_EXISTING]
[[,] STATISTICS_NORECOMPUTE]
[[,] SORT_IN_TEMPDB]
]


FILLFACTOR is probably the most commonly used of these options. FILLFACTORlets you reserve some space on each leaf page of an index. In a clustered index, since the leaf level contains the data, you can use FILLFACTOR to control how much space to leave in the table itself. By reserving free space, you can later avoid the need to split pages to make room for a new entry

If you need to, you can use the DBCC DBREINDEX command to rebuild the index and reestablish the original FILLFACTOR specified.

Tip If you plan to rebuild all of a table's indexes, simply specify the clustered index with DBCC DBREINDEX. Doing so internally rebuilds the entire table and all nonclustered indexes.


FILLFACTOR isn't usually specified on an index-by-index basis, but you can specify it this way for fine-tuning. If FILLFACTOR isn't specified, the serverwide default is used. The value is set for the server via sp_configure, fillfactor. This value is 0 by default, which means that leaf pages of indexes are made as full as possible. FILLFACTOR generally applies only to the index's leaf page (the data page for a clustered index). In specialized and high-use situations, you might want to reserve space in the intermediate index pages to avoid page splits there, too. You can do this by specifying the PAD_INDEX option, which uses the same value as FILLFACTOR.


The DROP_EXISTING option specifies that a given index should be dropped and rebuilt as a single transaction. This option is particularly useful when you rebuild clustered indexes. Normally, when a clustered index is dropped, every nonclustered index has to be rebuilt to change its bookmarks to RIDs instead of the clustering keys. Then, if a clustered index is built (or rebuilt), all nonclustered indexes need to be rebuilt again to update the bookmarks. The DROP_EXISTING option of the CREATE INDEX command allows a clustered index to be rebuilt without having to rebuild the nonclustered indexes twice. If you are creating the index on the exact same keys that it had previously, the nonclustered indexes do not need to be rebuilt at all. If you are changing the key definition, the nonclustered indexes are rebuilt only once, after the clustered index is rebuilt.

You can ensure the uniqueness of an index key by defining the index as UNIQUE or by defining a PRIMARY KEY or UNIQUE constraint. If an UPDATE or INSERT statement would affect multiple rows, and if even one row is found that would cause duplicate keys defined as unique, the entire statement is aborted and no rows are affected. Alternatively, when you create the unique index, you can use the IGNORE_DUP_KEY option so that a duplicate key error on a multiple-row INSERT won't cause the entire statement to be rolled back. The nonunique row will be discarded, and all other rows will be inserted or updated. IGNORE_DUP_KEY doesn'tallow the uniqueness of the index to be violated; instead, it makes a violation in a multiple-row data modification nonfatal to all the nonviolating rows.

The SORT_IN_TEMPDB option allows you to control where SQL Server performs the sort operation on the key values needed to build an index. The default is that SQL Server uses space from the filegroup on which the index is to be created. While the index is being built, SQL Server scans the data pages to find the key values and then builds leaf-level index rows in internal sort buffers. When these sort buffers are filled, they are written to disk. The disk heads for the database can then move back and forth between the base table pages and the work area where the sort buffers are being stored. If, instead, your CREATE INDEX command includes the option SORT_IN_TEMPDB, performance can be greatly improved, particularly if your tempdb database is on a separate physical disk from the database you're working with, with its own controller. You can optimize head movement because two separate heads read the base table pages and manage the sort buffers. You can get even more improvement in index creation speed if your tempdb database is on a faster disk than your user database and you use the SORT_IN_TEMPDB option. As an alternative to using the SORT_IN_TEMPDB option, you can create separate filegroups for a table and its indexes—that is, the table is on one filegroup and its indexes are on another. If the two filegroups are on different disks with their own controllers, you can also minimize the disk head movement.


Constraints and Indexes


when you declare a PRIMARY KEY or UNIQUE constraint, a unique index is created on one or more columns, just as if you had used the CREATE INDEX command. The names of indexes that are built to support these constraints are the same as the constraint names. In terms of internal storage and maintenance of indexes, there is no difference between unique indexes created using the CREATE INDEX command and indexes created to support constraints. The query optimizer makes decisions based on the presence of the unique index rather than on the fact that a column was declared as a primary key. How the index got there in the first place is irrelevant to the query optimizer.


When you create a table that includes PRIMARY KEY or UNIQUE constraints, you can specify whether the associated index will be clustered or nonclustered and you can also specify the fillfactor


If you check the documentation for CREATE TABLE and ALTER TABLE, you'll see that the SORT_IN_TEMPDB option is not available for either command. It really doesn't make sense to specify a sort location when you first create the table because there's nothing to sort. However, the fact that you can't specify this alternate location when you add a PRIMARY KEY or UNIQUE constraint to a table with existing data seems like an oversight. Also note that SORT_IN_TEMPDB is not an option when you use DBCC DBREINDEX. Again, there's no reason why it couldn't have been included, but it isn't available in this release


The biggest difference between indexes created using the CREATE INDEX command and indexes that support constraints is in how you can drop the index. The DROP INDEX command allows you to drop only indexes that were built with the CREATE INDEX command. To drop indexes that support constraints, you must use ALTER TABLE to drop the constraint.


the output of sp_helpindex tells us only the names of the indexes, the property (clustered or unique), and the columns the index is on. It doesn't tell us if the index supports a constraint. However, if we execute sp_help on a table, the output will tell us that.


The Structure of Index Pages

Index pages are structured much like data pages. As with all other types of pages in SQL Server, index pages have a fixed size of 8 KB, or 8192 bytes. Index pages also have a 96-byte header, but just like in data pages, there is an offset array at the end of the page with two bytes for each row to indicate the offset of that row on the page. However, if you use DBCC PAGE with style 1 to print out the individual rows on an index page, the slot array is not shown. If you use style 2, it will only print out all the bytes on a page with no attempt to separate the bytes into individual rows. You can then look at the bottom of the page and notice that each set of 2 bytes does refer to a byte offset on the page. Each index has a row in the sysindexes table, with an indid value of either 1, for a clustered index, or a number between 2 and 250, indicating a nonclustered index. An indid value of 255 indicates LOB data (text, ntext or image information). The root column value contains a file number and page number where the root of the index can be found. You can then use DBCC PAGE to examine index pages, just as you do for data pages.


There are basically three different kinds of index pages: leaf level for nonclustered indexes, node (nonleaf) level for clustered indexes, and node level for nonclustered indexes. There isn't really a separate structure for leaf level pages of a clustered index because those are the data pages, which we've already seen in detail. There is, however, one special case for leaf-level clustered index pages.


Clustered Index Rows with a Uniqueifier


If your clustered index was not created with the UNIQUE property, SQL Server adds a 4-byte field when necessary to make each key unique. Since clustered index keys are used as bookmarks to identify the base rows being referenced by nonclustered indexes, there needs to be a unique way to refer to each row in a clustered index. SQL Server adds the uniqueifier only when necessary—that is, when duplicate keys are added to the table

....
.
.
.
.
.


Index Space Requirements


Types of Fragmentation





Special Indexes


SET Options
The following seven SET options can affect the result value of an expression or predicate, so you must set them as shown to create indexed views or indexes on computed columns:

SET ARITHABORT ON
SET CONCAT_NULL_YIELDS_NULL ON
SET QUOTED_IDENTIFIER ON
SET ANSI_NULLS ON
SET ANSI_PADDING ON
SET ANSI_WARNINGS ON
SET NUMERIC_ROUNDABORT OFF




Indexed Views



Using an Index

No comments: