Saturday, August 11, 2007

Efficient Indexing for Performance


Most important, always be more circumspect about creating an index rather than not creating an index. Many databases get so convoluted and mixed up with over-indexing that, after long periods of time, no one knows who created what — and why. Never be afraid of not creating an index. It follows that you should not always assume that an existing index should exist, simply because it does exist.
Types of Indexes
Some databases allow different types of indexing. In relational databases, one index type is most commonly used. That index type is usually some form of binary tree index (BTree). Other index types are nearly always rare and only applicable in specialized cases. Be aware of the needs of, and consequences for, using special types of indexing such as ISAM, hash, or bitmap indexing.
Note
Different database engines use index structures and methods in different ways. Some examples are warranted here. Where one database engine allows creation of separate physical ISAM indexes, another database may use an ISAM algorithm to access a BTree index. Some databases allow creation of hash indexes as separate physical structures; others only allow application of hashing algorithms to all fields in an entire table. Some database engines allow creation of BTree indexes both as an index, and as a sorting sequence on an entire table. The table itself, and all its fields, become a BTree index (known as a clustered index or an index organized table).
They have very specialized applications and are not commonly used. Also, be aware that these less commonly used index types are often subject to overflow when changes are made to source tables. In reality, unusual types of indexes can often be subject to performance-crippling forms of overflow. Overflow is where an index has its performance created index structure completely diverted from and partially undone, by data changes to tables. Most of these unusual types of indexes are more often that not for read-only type environments, and should generally be implemented and applied with great care and forethought beforehand.
Database administrators should always keep a watchful eye on indexing in a database. Of course, there is never really available time, but when an application is released it is always best to re-examine all indexing. Quite often, developers will create many indexes, sometimes each creating their own sets of indexes, for the same application. The result is over-indexing. Too many indexes on a table create a performance problem. Effectively, executing a table change command (INSERT, UPDATE, or DELETE) on a table executes the same command on all of its indexes in addition to just the table. For example, inserting a single record into a table with four indexes comprises five changes to the database.
How to Apply Indexes in the Real World
There are various ways in which indexes can be applied, generally dependent on the function of the table for which an index is created:
No index — Data in a table is heap structured (in a heap or disorganized pile). Both small and large tables can benefit from having no indexing. Small tables may be best accessed as a whole, rather than with table and index, because they access a small amount of physical space. Large tables could very well only be read in their entirety based on application requirements. Why index a table when indexes are never used? It is even sometimes beneficial to drop referential integrity keys and indexes.
Static data — Small static data tables are often better off without indexing. Be aware of two potential problems: removing foreign key indexes can cause serious locking problems that can drastically hamper performance; and highly complex joins with many tables usually benefit from all tables having indexes, particularly unique primary key indexes, even on small static tables.
Dynamic dataDynamic data is data that changes all the time (transactional data). These indexes are changed frequently, are subject to overflow, and require frequent refreshing. Be acutely aware of the type of index used for dynamic data. The default index type for a particular database is usually the best option for dynamic data. This index type is usually some form of binary tree indexing structure. Other index types involving pre-calculated structures such as ISAM, hash tables, and bitmaps will overflow almost immediately when subject to change.
Note
Overflow of an index is seriously ugly for performance. Index overflow happens to certain types of indexes where any changes to data in tables cannot be slotted into the proper physical point in the original structure of the index. This is because of the way in which certain index types are constructed. The result of overflow and a lot of data changes could be query I/O quite literally bouncing all over disk storage trying to find data. This can cause serious performance problems.
Read only reporting index — Unlike dynamic data indexing, read-only data is much more flexible with respect to index types, because data is not subject to change. Read-only indexing is specially designed for read only queries, often in data warehouses. Types of indexing in this category are pre-built structures subject to overflow but highly efficient when used for read only I/O activity. Index types proficient as read-only indexes (bitmaps, clusters, hash tables) are ineffective in highly dynamic environments.
Unique and non-unique indexes —A unique index is an index allowing only a single value in a table. Be careful when creating unique indexes because every insertion or update to a uniquely indexed field (or fields) requires a scan of the entire index space (to verify a value as unique). A non-unique index allows more than one record with the same value and is typical of foreign key indexes. Unique indexing is better for performance and is typical of primary keys. A unique index is better for performance because subset index searching can be used to find single records, in theory making for less I/O, and less traversal through index structures.
Single field versus multiple field indexes — Multiple field indexes are generally known as composite field indexes. Single field indexes are more efficient than composite multiple field indexes. The simple fact is the fewer fields, the less to search through. Also, fewer fields means the index is relatively much smaller than its parent table. The bigger the relative difference in size between table and index, the more effective the index is at reducing I/O, especially for larger tables.
Datatypes to index — Integers are always best. An integer is a whole number with no digits to the right of the decimal point. Any other datatypes are nearly always flexible in terms of both length and content. Fixed-length strings are not quite as efficient as integers but they can be good options for index construction if the strings are a few characters, such as with use of codes. Quite often, codes are used to represent structures, such as code names for states in the United States. For example, NY represents the state of New York and CA represents California. Numbers are still better because there are 10 possible digits. Character strings have 26 different variations for the letters of the alphabet, plus 10 possible digits (strings are alphanumeric and can contain numerals as well), plus all sorts of punctuation and other odd characters.
Sacrificing referential integrity for performance — Sometimes this is a good idea, but most often it is not. Dropping of foreign key indexing can cause serious locking issues. Referential integrity uses primary and foreign keys to validate relationships between records in related tables. If there is a lot of validation occurring, and a table containing a foreign key has no foreign key index, the child table could be frequently fully scanned, resulting in huge competition for the entire table rather than just the index.
Optimizing with alternate indexes — Alternate indexing is often referred to as secondary indexing. Alternate indexing includes any indexes created against tables in a database model, which are not part of referential integrity constraints. Quite often, the need for alternate indexing is a mismatch between the database model and functionality required by applications. Excessive alternate indexing could indicate the database model not catering to application needs. Reporting or data warehousing application requirements will usually cause further demand for alternate indexing.
When Not to Use Indexes
Possibly the most important question is when should an index not be created? There are some circumstances where indexes can be detrimental to performance and sometimes those indexes should not exist. Sometimes (in fact, quite frequently), it is better for query performance for indexes to be ignored, and the entire table be read. The following explains when indexes should be avoided.
A table with a small number of fields may not benefit from an index if a large percentage of its records are always retrieved from it. Creating an index will not prevent full table scans. Note that removing primary key or foreign keys is not a good idea.
Small static data tables are often small enough to be read as a table scan rather than an index scan, plus a point into a table. Let's explain by example. Underlying I/O activity in the operating system (UNIX, Windows, Linux, and many others) is read in what are sometimes called blocks or pages. These pages can be many sizes, but usually at the database level the pages become 2 KB, 4 KB, 8 KB, 16 KB, and sometimes even 32 KB pages. The point to understand is that if a table has a small enough number of records, to occupy a single physical page, why read an index and point to a table? Reading the index and the table is reading two pages, when reading the table only constitutes an I/O on a single page only.
Often, tables created for reporting or during data warehouse periodical appending (batch updates) may already be in the required physical order.
Indexes should usually be created on a small percentage of the fields in a table. Large composite indexes may be relatively large compared with the table. The relative size between index and table is important. The larger the ratio of index to table physical size, the less helpful the index will be in terms of decreasing physical space to be read. Fields containing NULL values may exacerbate this effect. It may be faster to read the entire table, rather than a large composite field index, containing a lot of NULL values. Why create indexes with every field for a composite? It is, of course, acceptable to create a duplicate valued index on less than all the fields desired by the composite field structure. Sometimes a partial index is more efficient than no index, and also more efficient than the complete field set for a composite field index.
That's indexing. Indexes are important to database modeling and overall database performance as a whole, even to the point of not using indexing under certain circumstances. Overuse of indexing can hinder performance just as easily as not using indexing where indexing should be used.
What about using views?

No comments: