Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding statistics in PostgreSQL #279

Open
xluffy opened this issue Oct 28, 2024 · 0 comments
Open

Understanding statistics in PostgreSQL #279

xluffy opened this issue Oct 28, 2024 · 0 comments

Comments

@xluffy
Copy link
Owner

xluffy commented Oct 28, 2024

Data distribution statistics or Single column statistics: These statistics are related to the data distribution for each relation. They provide information about the most common values in each column in a relation, average width of the column, number of distinct values in the column, and more. They’re collected when we run ANALYZE or when analyze is triggered by autovacuum, and are stored in the pg_statistic system catalog (whose public readable view is pg_stats).

  • For large tables,ANALYZE takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time.
  • The amount of samples considered byANALYZE depends on the default_statistics_target parameter. Larger values increase the time needed to doANALYZE, but might improve the quality of the planner’s estimates. The default value for this is 100. To get an accurate plan, the default value is sufficient; however,default_statistics_target is the global default. For the case where there is 1 column that needs more stats, you can use ALTER TABLE [ TABLE ] ALTER [ COLUMN ] column_name SET STATISTICS integer. However, it will consume more CPU, memory, and time. If the value of this parameter is 100, then 300 * 100 = 30,000 rows are sampled from each table. This sample is used to determine up to 100 most common values to store in the most_common_vals array column, and up to 100 histogram bounds to store in that array, plus a few other scalar statistics, like the number of distinct values.
  • There is no command or function for resetting these statistics (as in resetting to 0).
  • After you complete an engine major version upgrade, you should run theANALYZE operation to refresh thepg_statistic table (to have the statistics updated for the planner’s use).
  • For a read replica in Amazon RDS for PostgreSQL and for a reader node in Aurora PostgreSQL, these stats are the same as for the primary or writer. This is because they are stored in a relation (pg_statistics) on disk (physical blocks are the same on the replica in Amazon RDS for PostgreSQL and in the case of Aurora, the reader is reading from the same storage). This is also the reason why it isn’t allowed (and also not logical) to run anANALYZE on a replica or a reader node (both can read from thepg_statistics relation, but can’t update it).

Extended statistics: By default, the statistics fromANALYZE are stored on a per-column per-table basis, and therefore can’t capture any knowledge about cross-column correlation. It’s common to see slow queries running bad run plans because multiple columns used in the query clauses are correlated. However, with the CREATE STATISTICS command, you can create extended statistics for correlated columns.

Note:

If the value of this parameter is 100, then 300 * 100 = 30,000 rows are sampled from each table

{
    {"default_statistics_target", PGC_USERSET, QUERY_TUNING_OTHER,
        gettext_noop("Sets the default statistics target."),
        gettext_noop("This applies to table columns that have not had a "
          "column-specific target set via ALTER TABLE SET STATISTICS.")
    },
    &default_statistics_target,
    100, 1, MAX_STATISTICS_TARGET,
    NULL, NULL, NULL
},
/*
 * std_typanalyze -- the default type-specific typanalyze function
 */
bool
std_typanalyze(VacAttrStats *stats)
{
	Form_pg_attribute attr = stats->attr;
	Oid			ltopr;
	Oid			eqopr;
	StdAnalyzeData *mystats;

	/* If the attstattarget column is negative, use the default value */
	/* NB: it is okay to scribble on stats->attr since it's a copy */
	if (attr->attstattarget < 0)
		attr->attstattarget = default_statistics_target;

	/* Look for default "<" and "=" operators for column's type */
	get_sort_group_operators(stats->attrtypid,
							 false, false, false,
							 &ltopr, &eqopr, NULL,
							 NULL);

	/* Save the operator info for compute_stats routines */
	mystats = (StdAnalyzeData *) palloc(sizeof(StdAnalyzeData));
	mystats->eqopr = eqopr;
	mystats->eqfunc = OidIsValid(eqopr) ? get_opcode(eqopr) : InvalidOid;
	mystats->ltopr = ltopr;
	stats->extra_data = mystats;

	/*
	 * Determine which standard statistics algorithm to use
	 */
	if (OidIsValid(eqopr) && OidIsValid(ltopr))
	{
		/* Seems to be a scalar datatype */
		stats->compute_stats = compute_scalar_stats;
		/*--------------------
		 * The following choice of minrows is based on the paper
		 * "Random sampling for histogram construction: how much is enough?"
		 * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
		 * Proceedings of ACM SIGMOD International Conference on Management
		 * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
		 * says that for table size n, histogram size k, maximum relative
		 * error in bin size f, and error probability gamma, the minimum
		 * random sample size is
		 *		r = 4 * k * ln(2*n/gamma) / f^2
		 * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
		 *		r = 305.82 * k
		 * Note that because of the log function, the dependence on n is
		 * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
		 * bin size error with probability 0.99.  So there's no real need to
		 * scale for n, which is a good thing because we don't necessarily
		 * know it at this point.
		 *--------------------
		 */
		stats->minrows = 300 * attr->attstattarget;
	}
	else if (OidIsValid(eqopr))
	{
		/* We can still recognize distinct values */
		stats->compute_stats = compute_distinct_stats;
		/* Might as well use the same minrows as above */
		stats->minrows = 300 * attr->attstattarget;
	}
	else
	{
		/* Can't do much but the trivial stuff */
		stats->compute_stats = compute_trivial_stats;
		/* Might as well use the same minrows as above */
		stats->minrows = 300 * attr->attstattarget;
	}

	return true;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant