Tuesday, May 08, 2007

Statistics - All About

Statistics are data demographics "hints" used by the Teradata optimizer. There are many ways to generate a query plan for a given SQL, and collecting statistics ensures that the optimizer will have the most accurate information to create the best access and join plans. Without collected statistics, the optimizer assumes that any column or non-unique index is highly non-unique and will create join plans accordingly. Collect statistics on any columns or non-unique indexes which are fairly or highly unique so that the optimizer can make proper use of them in join plans, and thus, enhance performance.


Statistics should be collected for the following tables / columns:

1. For small tables where the number of rows is less than 5 times the number of AMPs, collect statistics on the primary index.
2. For large tables, collect statistics on all NUPI's.
3. For all tables, collect statistics on columns in your JOIN condition (i.e. those columns in any DML WHERE clause).
4. Once collected, statistics should be maintained to accurately reflect the data demographics of the underlying table. It is better to have no statistics than stale or otherwise incorrect statistics. A rule of thumb is to collect statistics when they've changed by 10%. (That would be 10% more rows inserted, or 10% of the rows deleted, or 10% of the rows changed, or some combination.)

STATISTICS AND PLANS:

The optimizing phase of the DBC/SQL language processor makes decisions on how to access table data. These decisions can be very important when table joins (especially those involving multiple joins) are required by a query. By default, the Optimizer uses approximations of the number of rows in each table (known as the cardinality of the table) and of the number of unique values in indexes in making its decisions.

The Optimizer gets its approximation of the cardinality of a table by picking a random Access Module Processor (AMP) and asking that AMP how many rows there are in the table. The chosen AMP does not actually count all of the rows it has for the table, but generates an estimate based on the average row size and the number of sectors occupied by the table on that AMP; the Optimizer then multiplies that estimate by the number of AMPs in the system (making an allowance for uneven hash bucket distribution) to estimate the table cardinality. The number of unique index values is similarly estimated. Given that most of the values involved in these estimates, other than the number of AMPs in the system, is an approximation, it is possible (although unusual) for the estimate to be significantly off. This can lead to poor choices of join plans, and associated increases in the response times of the queries involved.

One way to help the Optimizer make better decisions is to give it more accurate information as to the content of the table. This can be done using the COLLECT STATISTICS statement (see the Teradata DBS Reference Manual for details). When the Optimizer finds that there are statistics available for a referenced table, it will use those statistics instead of using estimated table cardinality or estimated unique index value counts (that is, it will trust the statistics).



STALE STATISTICS:

Under normal circumstances, this is the right thing to do; statistics provide more detailed information, and include an exact row count as of the time that the statistics were gathered. However, if the statistics are "stale" -- that is, the table's characteristics (distribution of data values for a column or index for which statistics have been collected, number of rows in the table, etc.) have changed significantly since the statistics were last gathered, the Optimizer can be misled into making poor, or even horrible, join plans, with associated poor performance of queries which use the stale statistics.

To take an extreme case:

Table A - statistics gathered when table had 1,000 rows, but table
now has 1,000,000 rows (perhaps statistics were gathered
during the prototyping phase)

Table B - no statistics; table has 75,000 rows

If a product join between Table A and Table B is necessary for a given query, and one of the tables must be duplicated on all AMPs, then the Optimizer will choose Table A to be duplicated, since 1,000 rows (from the stale statistics) is much less than 75,000 rows. Since in reality Table A now has 1,000,000 rows, the Optimizer will be making a very bad decision (duplicating 1,000,000 rows instead of 75,000), and the query will run much longer than necessary.


Therefore, it is important that statistics be kept fresh. The Reference Manual recommends that statistics be recollected if there is as little as a 10% change -- rows added or deleted -- in the table. There are two general circumstances under which statistics can be considered to be stale:

1. The number of rows in the table has changed significantly.

The number of unique values for each statistic on a table, as well as the
date and time the statistics were last gathered, can be obtained by:

HELP STATISTICS tablename;

For statistics on unique indexes, this can be cross-checked by comparing
the row count returned by:

SELECT COUNT(*) FROM tablename;

For statistics on non-unique columns, the HELP STATISTICS result can be
cross-checked by comparing the count returned by:

SELECT COUNT(DISTINCT columnname) FROM tablename;

2. The range of values for a index or column of a table for which statistics have been collected has changed significantly. Sometimes one can infer this from the date and time the statistics were last collected, or by the very nature of the column (for instance, if the column in question holds a transaction date, and statistics on that column were last gathered a year ago, it is almost certain that the statistics for that column are stale).



REFRESHING STALE STATISTICS:

If the statistics for a table may be stale, they can be easily recollected:

COLLECT STATISTICS ON tablename;

will recollect statistics on all indexes and columns for which previous COLLECT STATISTICS statements were done (and for which DROP STATISTICS statements have not been done). Because collecting statistics involves a full-table scan, it may take a significant amount of time, and should therefore be done off-hours for large tables. It may be revealing to execute:

HELP STATISTICS tablename;

before and after recollecting statistics, to see what, if any, difference the recollect makes. Also, for frequently executed queries, requesting an EXPLAIN before and after recollecting statistics may show differences in join plans and/or spool row count/processing time estimates.

In addition to keeping statistics current, the Reference Manual also recommends that statistics be collected for the primary index of small tables involved in joins.



STALE STATISTICS EXAMPLE:

Here is an example of stale statistics. The table definition is:

CREATE TABLE bpaefer.taemg22, FALLBACK,
( m22_m_fer_ff DECIMAL(10,2) NOT NULL
, m22_d_fin_validite DATE FORMAT 'YYYY-MM-DD'
, m22_m_fer_dev DECIMAL(9,2) NOT NULL
, m08_c_uic_m08 CHAR(2) NOT NULL
, m22_d_deb_v_id DATE FORMAT 'YYYY-MM-DD' NOT NULL
, p66_c_code_id CHAR(8) NOT NULL
)
UNIQUE PRIMARY INDEX ( m22_d_deb_v_id, p66_c_code_id ) ;

Statistics were collected on the unique primary index and individually on each of the two columns comprising that index:

COLLECT STATISTICS ON bpaefer.taemg22
INDEX ( m22_d_deb_v_id, p66_c_code_id ) ;
COLLECT STATISTICS ON bpaefer.taemg22
COLUMN m22_d_deb_v_id ;
COLLECT STATISTICS ON bpaefer.taemg22
COLUMN p66_c_code_id ;

A recent check of this table and its statistics revealed the following:

SELECT COUNT(*) FROM bpaefer.taemg22;

resulted in:

Count(*)
-----------
9203

and

HELP STATISTICS bpaefer.taemg22;

resulted in:

Date Time Unique Values Column Names
-------- -------- -------------------- ----------------------------
93/11/03 12:39:55 17 M22_D_DEB_V_ID
93/11/03 12:40:09 2,095 P66_C_CODE_ID
93/11/03 13:57:17 2,275 M22_D_DEB_V_ID,P66_C_CODE_ID

Note that statistics were most recently collected almost three years ago (as of the date of this TTIP). Also note that the last line from the HELP STATISTICS is for the UNIQUE primary index, and therefore reflects the number of rows in the table when the statistics were last gathered. At that time, there were 2275 rows in the table; there are currently 9203 rows in the table, a factor of more than four difference. Finally, note that the M22_D_DEB_V_ID column is a DATE, with relatively few unique values. With the four-fold increase in the number of rows, it is almost certain that additional values for this column are present.

It is clearly time to recollect this table's statistics.

The SQL below will generate COLLECT STATISTICS statements for all non-unique indexes that have not had statistics collected, except for database crashdumps and DBC. NOTE: You must logon as DBC to generate the batch job and execute it.


/*=================================================================*/
/* Create a working database. */

create database stats as perm = 10000000;

/* Grant a right needed for macro. */

grant select on DBC to stats with grant option;

/* Create the work table. */

ct stats.idx_info
(tvmid byte(6)
,indexnumber smallint
,NameText varchar(512))
unique primary index (tvmid,indexnumber);

/* Create the macro. */

replace macro stats.build_idx_info
(pass smallint)
as
(
update stats.idx_info
set NameText = trim(NameText) ||
',' ||
trim(DBC.TVFields.fieldname)
where idx_info.tvmid = DBC.Indexes.tableid
and idx_info.indexnumber = Indexes.indexnumber
and TVFields.tableid = Indexes.tableid
and TVFields.fieldid = Indexes.fieldid
and DBC.TVM.tvmid = Indexes.tableid
and DBC.Dbase.databaseid = TVM.databaseid
and Indexes.fieldposition = :pass;
);

/* Populate the work table. */

ins stats.idx_info
sel i.tableid
,i.indexnumber
,trim(f.fieldname)
from DBC.indexes i
,DBC.tvfields f
,DBC.dbase d
,DBC.tvm t
where f.tableid = i.tableid
and f.fieldid = i.fieldid
and t.tvmid = i.tableid
and d.databaseid = t.databaseid
and i.UniqueFlag <> 'Y'
and i.fieldposition = 1
and i.indexstatistics is null
and d.databasenamei NOT IN ('dbc','crashdumps');

/* Determine N. */

sel max(i.fieldposition)
from DBC.indexes i
,DBC.dbase d
,DBC.tvm t
where t.tvmid = i.tableid
and d.databaseid = t.databaseid
and d.databasenamei NOT IN ('dbc','crashdumps')
and i.UniqueFlag <> 'Y';

/* Run macro to add fields. */
/* Run multiple times. */
/* Use arguments 2 to N. */

exec stats.build_idx_info(2);

/* ADD ADDITIONAL MACRO EXECUTIONS HERE!!! */

/* Generate grant statements. */

.width 254
.export report file=grant

sel 'grant drop table on '||
trim(d.databasename) ||
' to DBC;' (title'')
from stats.idx_info idx
,DBC.tvm t
,DBC.dbase d
where idx.tvmid = t.tvmid
and t.databaseid = d.databaseid
group by 1
order by 1;

.export reset

/* Generate collect statistics statements. */

.export report file=collect

sel 'collect statistics '||
trim(d.databasename) ||
'.' ||
trim(t.tvmname) ||
' index(' ||
trim(idx.NameText) ||
');' (title'')
from stats.idx_info idx
,DBC.tvm t
,DBC.dbase d
where idx.tvmid = t.tvmid
and t.databaseid = d.databaseid
order by d.databasename
,t.tvmname
,idx.indexnumber;

.export reset

/* Your work is done except for .running the files. */

-------------------------------------------------------------------------------------
The following query can be used to create a report that lists dbase named, column names, and the date statistics were last collected.
You can specify the databasenames, table names, etc., if you want to limit the objects referenced.

sel
c.databasenamei as DatabaseName,
b.tvmnamei as TableName,
a.fieldname as ColumnName,
cast((case when substr(fieldstatistics,1,1) = 'D2'XB
then '2002-'
when substr(fieldstatistics,1,1) = 'D1'XB
then '2001-'
when substr(fieldstatistics,1,1) = 'D0'XB
then '2000-'
when substr(fieldstatistics,1,1) = 'CF'XB
then '1999-'
when substr(fieldstatistics,1,1) = 'CE'XB
then '1998-'
else NULL
end)||
(case when substr(fieldstatistics,3,1) = '01'XB
then '01-'
when substr(fieldstatistics,3,1) = '02'XB
then '02-'
when substr(fieldstatistics,3,1) = '03'XB
then '03-'
when substr(fieldstatistics,3,1) = '04'XB
then '04-'
when substr(fieldstatistics,3,1) = '05'XB
then '05-'
when substr(fieldstatistics,3,1) = '06'XB
then '06-'
when substr(fieldstatistics,3,1) = '07'XB
then '07-'
when substr(fieldstatistics,3,1) = '08'XB
then '08-'
when substr(fieldstatistics,3,1) = '09'XB
then '09-'
when substr(fieldstatistics,3,1) = '0A'XB
then '10-'
when substr(fieldstatistics,3,1) = '0B'XB
then '11-'
when substr(fieldstatistics,3,1) = '0C'XB
then '12-'
else 'xx-'
end)||
(case when substr(fieldstatistics,4,1) = '01'XB
then '01'
when substr(fieldstatistics,4,1) = '02'XB
then '02'
when substr(fieldstatistics,4,1) = '03'XB
then '03'
when substr(fieldstatistics,4,1) = '04'XB
then '04'
when substr(fieldstatistics,4,1) = '05'XB
then '05'
when substr(fieldstatistics,4,1) = '06'XB
then '06'
when substr(fieldstatistics,4,1) = '07'XB
then '07'
when substr(fieldstatistics,4,1) = '08'XB
then '08'
when substr(fieldstatistics,4,1) = '09'XB
then '09'
when substr(fieldstatistics,4,1) = '0A'XB
then '10'
when substr(fieldstatistics,4,1) = '0B'XB
then '11'
when substr(fieldstatistics,4,1) = '0C'XB
then '12'
when substr(fieldstatistics,4,1) = '0D'XB
then '13'
when substr(fieldstatistics,4,1) = '0E'XB
then '14'
when substr(fieldstatistics,4,1) = '0F'XB
then '15'
when substr(fieldstatistics,4,1) = '10'XB
then '16'
when substr(fieldstatistics,4,1) = '11'XB
then '17'
when substr(fieldstatistics,4,1) = '12'XB
then '18'
when substr(fieldstatistics,4,1) = '13'XB
then '19'
when substr(fieldstatistics,4,1) = '14'XB
then '20'
when substr(fieldstatistics,4,1) = '15'XB
then '21'
when substr(fieldstatistics,4,1) = '16'XB
then '22'
when substr(fieldstatistics,4,1) = '17'XB
then '23'
when substr(fieldstatistics,4,1) = '18'XB
then '24'
when substr(fieldstatistics,4,1) = '19'XB
then '25'
when substr(fieldstatistics,4,1) = '1A'XB
then '26'
when substr(fieldstatistics,4,1) = '1B'XB
then '27'
when substr(fieldstatistics,4,1) = '1C'XB
then '28'
when substr(fieldstatistics,4,1) = '1D'XB
then '29'
when substr(fieldstatistics,4,1) = '1E'XB
then '30'
when substr(fieldstatistics,4,1) = '1F'XB
then '31'
else 'xx'
end)as date) as CollectionDate,
cast(substr(cast(a.lastaltertimestamp as char(32))
,1,10) as date) as LastAlter,
date - collectiondate as FromCurrent,
lastalter - collectiondate as FromAlter

from
dbc.tvfields a,
dbc.tvm b,
dbc.dbase c
where a.tableid = b.tvmid
and b.tablekind = 'T'
and b.databaseid = c.databaseid
and upper(trim(c.databasenamei)) in
('AFFINITY','CLAIMS_ECOMM','CLICKSTREAM','PCDW_AURORA','PCDW_AUTO','PCDW_AUT
OCS','PCDW_AUTOQB',

'PCDW_CBR','PCDW_CLAIMS','PCDW_DSS','PCDW_ICAR','PRODUCERS','WEB_TEST')

and a.fieldstatistics is not null
order by 1,2,3
;

Additional Information/Comments:
NOTE: This will check for fieldstatistics. You can also check indexstatistics by changing the column.


Note that the query does not return results for compound indexes, because this data is not kept in TVFIELDS. A parallel query that will return the dates for compound indices follows:

sel unique
c.databasenamei as DatabaseName,
b.tvmnamei as TableName,
d.indexname,
cast((case when substr(IndexStatistics,1,1) = 'D2'XB
then '2002-'
when substr(IndexStatistics,1,1) = 'D1'XB
then '2001-'
when substr(IndexStatistics,1,1) = 'D0'XB
then '2000-'
when substr(IndexStatistics,1,1) = 'CF'XB
then '1999-'
when substr(IndexStatistics,1,1) = 'CE'XB
then '1998-'
else NULL
end)||
(case when substr(IndexStatistics,3,1) = '01'XB
then '01-'
when substr(IndexStatistics,3,1) = '02'XB
then '02-'
when substr(IndexStatistics,3,1) = '03'XB
then '03-'
when substr(IndexStatistics,3,1) = '04'XB
then '04-'
when substr(IndexStatistics,3,1) = '05'XB
then '05-'
when substr(IndexStatistics,3,1) = '06'XB
then '06-'
when substr(IndexStatistics,3,1) = '07'XB
then '07-'
when substr(IndexStatistics,3,1) = '08'XB
then '08-'
when substr(IndexStatistics,3,1) = '09'XB
then '09-'
when substr(IndexStatistics,3,1) = '0A'XB
then '10-'
when substr(IndexStatistics,3,1) = '0B'XB
then '11-'
when substr(IndexStatistics,3,1) = '0C'XB
then '12-'
else 'xx-'
end)||
(case when substr(IndexStatistics,4,1) = '01'XB
then '01'
when substr(IndexStatistics,4,1) = '02'XB
then '02'
when substr(IndexStatistics,4,1) = '03'XB
then '03'
when substr(IndexStatistics,4,1) = '04'XB
then '04'
when substr(IndexStatistics,4,1) = '05'XB
then '05'
when substr(IndexStatistics,4,1) = '06'XB
then '06'
when substr(IndexStatistics,4,1) = '07'XB
then '07'
when substr(IndexStatistics,4,1) = '08'XB
then '08'
when substr(IndexStatistics,4,1) = '09'XB
then '09'
when substr(IndexStatistics,4,1) = '0A'XB
then '10'
when substr(IndexStatistics,4,1) = '0B'XB
then '11'
when substr(IndexStatistics,4,1) = '0C'XB
then '12'
when substr(IndexStatistics,4,1) = '0D'XB
then '13'
when substr(IndexStatistics,4,1) = '0E'XB
then '14'
when substr(IndexStatistics,4,1) = '0F'XB
then '15'
when substr(IndexStatistics,4,1) = '10'XB
then '16'
when substr(IndexStatistics,4,1) = '11'XB
then '17'
when substr(IndexStatistics,4,1) = '12'XB
then '18'
when substr(IndexStatistics,4,1) = '13'XB
then '19'
when substr(IndexStatistics,4,1) = '14'XB
then '20'
when substr(IndexStatistics,4,1) = '15'XB
then '21'
when substr(IndexStatistics,4,1) = '16'XB
then '22'
when substr(IndexStatistics,4,1) = '17'XB
then '23'
when substr(IndexStatistics,4,1) = '18'XB
then '24'
when substr(IndexStatistics,4,1) = '19'XB
then '25'
when substr(IndexStatistics,4,1) = '1A'XB
then '26'
when substr(IndexStatistics,4,1) = '1B'XB
then '27'
when substr(IndexStatistics,4,1) = '1C'XB
then '28'
when substr(IndexStatistics,4,1) = '1D'XB
then '29'
when substr(IndexStatistics,4,1) = '1E'XB
then '30'
when substr(IndexStatistics,4,1) = '1F'XB
then '31'
else 'xx'
end)as date) as CollectionDate,
cast(substr(cast(a.lastaltertimestamp as char(32))
,1,10) as date) as LastAlter,
date - collectiondate as FromCurrent,
lastalter - collectiondate as FromAlter

from
dbc.indexes a,
dbc.tvm b,
dbc.dbase c,
dbc.indices d

where a.tableid = b.tvmid
and d.indexnumber = a.indexnumber
and b.tablekind = 'T'
and b.databaseid = c.databaseid
and upper(trim(c.databasenamei)) in
(' <<<>>>')
and upper(trim(d.databasename)) = upper(trim(c.databasenamei))

and a.IndexStatistics is not null
and d.indexname is not null
order by 1,2,3
;


-----------------------------------------------------------------------------------

2 Comments:

Blogger Jay said...

Hi,
I have a general question, i have a table where a column is used frequently in join clause but not a part of the NUPI. Now is it enough that i just collect statistics on this column or should i necessarily create NUSI on this column and then collect stats. What is the difference for the optimiser. I know that NUSI will occupy further space but from an optimiser point of view what is recommended? Would be great is you can throw some light on this.

12:31 am  
Blogger Ranjith said...

Jay,
What I would probably do is to try collecting stats on this column and monitor the dbc.qrylog to check the impact (no of steps, time taken, IO, CPU, Spool, ec.).

Then create a NUSI (with stats) and check the qrylog again ..

And choose the one that works good for that config and access pattern. Also would review it on a timely basis to ensure optimal performance. As the optimiser plan could change with the size of the table and other demographics.

This is a general approach, can provide more info if I have the table size, sys config, DDL and so on. Hope it helps.

6:30 am  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home