You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Data distribution statistics or Single column statistics: These statistics are related to the data distribution for each relation. They provide information about the most common values in each column in a relation, average width of the column, number of distinct values in the column, and more. They’re collected when we run ANALYZE or when analyze is triggered by autovacuum, and are stored in the pg_statistic system catalog (whose public readable view is pg_stats).
For large tables,ANALYZE takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time.
The amount of samples considered byANALYZE depends on the default_statistics_target parameter. Larger values increase the time needed to doANALYZE, but might improve the quality of the planner’s estimates. The default value for this is 100. To get an accurate plan, the default value is sufficient; however,default_statistics_target is the global default. For the case where there is 1 column that needs more stats, you can use ALTER TABLE [ TABLE ] ALTER [ COLUMN ] column_name SET STATISTICS integer. However, it will consume more CPU, memory, and time. If the value of this parameter is 100, then 300 * 100 = 30,000 rows are sampled from each table. This sample is used to determine up to 100 most common values to store in the most_common_vals array column, and up to 100 histogram bounds to store in that array, plus a few other scalar statistics, like the number of distinct values.
There is no command or function for resetting these statistics (as in resetting to 0).
After you complete an engine major version upgrade, you should run theANALYZE operation to refresh thepg_statistic table (to have the statistics updated for the planner’s use).
For a read replica in Amazon RDS for PostgreSQL and for a reader node in Aurora PostgreSQL, these stats are the same as for the primary or writer. This is because they are stored in a relation (pg_statistics) on disk (physical blocks are the same on the replica in Amazon RDS for PostgreSQL and in the case of Aurora, the reader is reading from the same storage). This is also the reason why it isn’t allowed (and also not logical) to run anANALYZE on a replica or a reader node (both can read from thepg_statistics relation, but can’t update it).
Extended statistics: By default, the statistics fromANALYZE are stored on a per-column per-table basis, and therefore can’t capture any knowledge about cross-column correlation. It’s common to see slow queries running bad run plans because multiple columns used in the query clauses are correlated. However, with the CREATE STATISTICS command, you can create extended statistics for correlated columns.
{
{"default_statistics_target", PGC_USERSET, QUERY_TUNING_OTHER,
gettext_noop("Sets the default statistics target."),
gettext_noop("This applies to table columns that have not had a "
"column-specific target set via ALTER TABLE SET STATISTICS.")
},
&default_statistics_target,
100, 1, MAX_STATISTICS_TARGET,
NULL, NULL, NULL
},
/*
* std_typanalyze -- the default type-specific typanalyze function
*/
bool
std_typanalyze(VacAttrStats *stats)
{
Form_pg_attribute attr = stats->attr;
Oid ltopr;
Oid eqopr;
StdAnalyzeData *mystats;
/* If the attstattarget column is negative, use the default value */
/* NB: it is okay to scribble on stats->attr since it's a copy */
if (attr->attstattarget < 0)
attr->attstattarget = default_statistics_target;
/* Look for default "<" and "=" operators for column's type */
get_sort_group_operators(stats->attrtypid,
false, false, false,
<opr, &eqopr, NULL,
NULL);
/* Save the operator info for compute_stats routines */
mystats = (StdAnalyzeData *) palloc(sizeof(StdAnalyzeData));
mystats->eqopr = eqopr;
mystats->eqfunc = OidIsValid(eqopr) ? get_opcode(eqopr) : InvalidOid;
mystats->ltopr = ltopr;
stats->extra_data = mystats;
/*
* Determine which standard statistics algorithm to use
*/
if (OidIsValid(eqopr) && OidIsValid(ltopr))
{
/* Seems to be a scalar datatype */
stats->compute_stats = compute_scalar_stats;
/*--------------------
* The following choice of minrows is based on the paper
* "Random sampling for histogram construction: how much is enough?"
* by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
* Proceedings of ACM SIGMOD International Conference on Management
* of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5
* says that for table size n, histogram size k, maximum relative
* error in bin size f, and error probability gamma, the minimum
* random sample size is
* r = 4 * k * ln(2*n/gamma) / f^2
* Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
* r = 305.82 * k
* Note that because of the log function, the dependence on n is
* quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
* bin size error with probability 0.99. So there's no real need to
* scale for n, which is a good thing because we don't necessarily
* know it at this point.
*--------------------
*/
stats->minrows = 300 * attr->attstattarget;
}
else if (OidIsValid(eqopr))
{
/* We can still recognize distinct values */
stats->compute_stats = compute_distinct_stats;
/* Might as well use the same minrows as above */
stats->minrows = 300 * attr->attstattarget;
}
else
{
/* Can't do much but the trivial stuff */
stats->compute_stats = compute_trivial_stats;
/* Might as well use the same minrows as above */
stats->minrows = 300 * attr->attstattarget;
}
return true;
}
The text was updated successfully, but these errors were encountered:
Data distribution statistics or Single column statistics: These statistics are related to the data distribution for each relation. They provide information about the most common values in each column in a relation, average width of the column, number of distinct values in the column, and more. They’re collected when we run
ANALYZE
or when analyze is triggered byautovacuum
, and are stored in thepg_statistic
system catalog (whose public readable view is pg_stats).ANALYZE
takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time.ANALYZE
depends on the default_statistics_target parameter. Larger values increase the time needed to doANALYZE
, but might improve the quality of the planner’s estimates. The default value for this is 100. To get an accurate plan, the default value is sufficient; however,default_statistics_target
is the global default. For the case where there is 1 column that needs more stats, you can useALTER TABLE [ TABLE ] ALTER [ COLUMN ] column_name SET STATISTICS integer
. However, it will consume more CPU, memory, and time. If the value of this parameter is 100, then300 * 100 = 30,000 rows
are sampled from each table. This sample is used to determine up to 100 most common values to store in the most_common_vals array column, and up to 100 histogram bounds to store in that array, plus a few other scalar statistics, like the number of distinct values.ANALYZE
operation to refresh thepg_statistic
table (to have the statistics updated for the planner’s use).pg_statistics
) on disk (physical blocks are the same on the replica in Amazon RDS for PostgreSQL and in the case of Aurora, the reader is reading from the same storage). This is also the reason why it isn’t allowed (and also not logical) to run anANALYZE
on a replica or a reader node (both can read from thepg_statistics
relation, but can’t update it).Extended statistics: By default, the statistics from
ANALYZE
are stored on a per-column per-table basis, and therefore can’t capture any knowledge about cross-column correlation. It’s common to see slow queries running bad run plans because multiple columns used in the query clauses are correlated. However, with theCREATE STATISTICS
command, you can create extended statistics for correlated columns.Note:
The text was updated successfully, but these errors were encountered: