PostgreSQL数据库统计信息——compute_scalar_stats计算统计数据

如果列类型指定的是std_typanalyze函数决定compute_stats的取值:

  • 如果列数据类型支持默认的等于(eqopr equals operator)和小于(ltopr less than operator),那么这个列应该是数值scalar类型,应该可以使用compute_scalar_stats进行分析。
  • 如果列数据类型仅仅支持等于运算符,可以使用compute_distinct_stats函数进行唯一值的分析。
  • 如果列数据类型不支持上述运算,那么只能使用compute_trivial_stats进行分析了。

这里我们以数值scalar类型的列为例,学习一下compute_scalar_stats函数计算统计数据。当我们可以找到数据类型的“=”和“<”运算符时,我们使用compute_scalar_stats函数计算统计数据。我们通过该函数确定非空行的比例、平均宽度、最常见值、不同值的(估计)数量、分布直方图以及物理顺序与逻辑顺序的相关性这些统计项。在将数据值按顺序排序后,可以相当容易地确定所需的统计数据。

/*  compute_scalar_stats() -- compute column statistics
* We use this when we can find "=" and "<" operators for the datatype. We determine the fraction of non-null rows, the average width, the most common values, the (estimated) number of distinct values, the distribution histogram, and the correlation of physical to logical order. The desired stats can be determined fairly easily after sorting the data values into order. */
static void compute_scalar_stats(VacAttrStatsP stats, AnalyzeAttrFetchFunc fetchfunc, int samplerows, double totalrows){
int i, null_cnt = 0, nonnull_cnt = 0, toowide_cnt = 0, ;
double total_width = 0; double corr_xysum;
bool is_varlena = (!stats->attrtype->typbyval && stats->attrtype->typlen == -1);
bool is_varwidth = (!stats->attrtype->typbyval && stats->attrtype->typlen < 0);

ScalarItem *values; ScalarMCVItem *track;
int *tupnoLink, values_cnt = 0, track_cnt = 0;
int num_mcv = stats->attr->attstattarget, num_bins = stats->attr->attstattarget;
StdAnalyzeData *mystats = (StdAnalyzeData *) stats->extra_data;
values = (ScalarItem *) palloc(samplerows * sizeof(ScalarItem));
tupnoLink = (int *) palloc(samplerows * sizeof(int));
track = (ScalarMCVItem *) palloc(num_mcv * sizeof(ScalarMCVItem));

SortSupportData ssup;
memset(&ssup, 0, sizeof(ssup));
ssup.ssup_cxt = CurrentMemoryContext; ssup.ssup_collation = stats->attrcollid; ssup.ssup_nulls_first = false;
ssup.abbreviate = false; /* For now, don't perform abbreviated key conversion, because full values are required for MCV slot generation. Supporting that optimization would necessitate teaching compare_scalars() to call a tie-breaker. 目前,不要执行缩写键转换,因为MCV时隙生成需要完整值。支持这种优化需要教导compare_scalars()来调用平局。 */
PrepareSortSupportFromOrderingOp(mystats->ltopr, &ssup);

经过上述初始化函数需要使用的临时变量后,下一步开始从采样行样本中提取对应列的数据​​value = fetchfunc(stats, i, &isnull)​​,如果该行的列数据为null,则将null_cnt(null数据数量)变量递增;否则将nonnull_cnt(非null数据数量)变量递增。对于计算列数据的平均长度,我们需要将所有采样行的列长度相加放入total_width,如果是变长字段(如text等),使用VARSIZE_ANY获取、如果值是toasted,则使用toastedwidth。最后将列数据设置到values对应samplerow作为Index的数组元素的value中。

for (i = 0; i < samplerows; i++) { /* Initial scan to find sortable values */
Datum value; bool isnull;
vacuum_delay_point();
value = fetchfunc(stats, i, &isnull);
if (isnull){ /* Check for null/nonnull */ null_cnt++; continue; }
nonnull_cnt++;
// 处理有数据的列:
// 1. 如果是可变宽度字段,使用strlen获取数据长度,列数据记录到values数组中
// 2. 如果是toast字段,使用VARSIZE_ANY获取数据长度,如果toast数据长度过长,则递增toowide_cnt,列数据不记录到values数组中;如果toast数据长度不超长,列数据记录到values数组中
// 3. 如果是固定长度字段,则可直接使用stats->attrtype->typlen,这里不作处理
if (is_varlena){ /* If it's a variable-width field, add up widths for average width calculation. Note that if the value is toasted, we use the toasted width. We don't bother with this calculation if it's a fixed-width type. */ // 如果是可变宽度字段,则将宽度相加以计算平均宽度。请注意,如果值是toasted,则使用toastedwidth。如果它是固定宽度类型,我们不需要进行这种计算。
total_width += VARSIZE_ANY(DatumGetPointer(value));
if (toast_raw_datum_size(value) > WIDTH_THRESHOLD){ /* If the value is toasted, we want to detoast it just once to avoid repeated detoastings and resultant excess memory usage during the comparisons. Also, check to see if the value is excessively wide, and if so don't detoast at all --- just ignore the value. */ // 如果该值是toasted,我们希望只对其进行一次detoast,以避免在比较过程中重复进行detoast并导致内存使用过多。此外,检查该值是否过宽,如果过宽,则根本不需要删除——只需忽略该值
toowide_cnt++; continue;
}
value = PointerGetDatum(PG_DETOAST_DATUM(value));
}else if (is_varwidth){ total_width += strlen(DatumGetCString(value)) + 1; /* must be cstring */ }

values[values_cnt].value = value; /* Add it to the list to be sorted */
values[values_cnt].tupno = values_cnt;
tupnoLink[values_cnt] = values_cnt;
values_cnt++;
}

计算统计指标
为了理解上述null数据数量、非null数据数量和平均列数据宽度的处理,我们将后面两个分支提到前面来学习。如下分支满足​​​values_cnt <= 0 && nonnull_cnt > 0​​条件,也就是说这些采样样本中的这些列有非NULL数据,但是这些列数据都太过于宽了(超过了WIDTH_THRESHOLD),这些数据列不会记录到values数组中。stats->stanullfrac为样本中数据为null比例,NULL比例概率的计算公式为​​stanullfrac = null_number / sample_row_number​​。如果列类型是可变宽度字段,则使用收集的total_width来计算平均列数据长度;否则直接是固定长度字段,则可直接使用stats->attrtype->typlen。这里整理一下列有数据且都超宽情况下的公式:

  • 列数据NULL比例stanullfrac:​​(double) null_cnt / (double) samplerows​
  • 如果是可变宽度字段,平均列数据长度stawidth:​​total_width / (double) nonnull_cnt​​​;如果是固定宽度字段,平均列数据长度stawidth:​​stats->attrtype->typlen​
  • 基数的计算stadistinct:​​-1.0 * (1.0 - stats->stanullfrac)​​。我们知道当采样中的值没有重复的时候,则认为所有的值唯一,stadistinct = -1。而通过注释可知超宽数据都认为是唯一的,所以只要减去列数据NULL比例,其他数据都是唯一的。因此上面的基数计算是正常的。
else if (nonnull_cnt > 0){ /* We found some non-null values, but they were all too wide */
Assert(nonnull_cnt == toowide_cnt);
stats->stats_valid = true;
stats->stanullfrac = (double) null_cnt / (double) samplerows; /* Do the simple null-frac and width stats */
if (is_varwidth) stats->stawidth = total_width / (double) nonnull_cnt;
else stats->stawidth = stats->attrtype->typlen;
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac); /* Assume all too-wide values are distinct, so it's a unique column */
}

如下分支满足​​values_cnt <= 0 && nonnull_cnt <= 0​​​条件,也就是找到的都是null值。列数据NULL比例stanullfrac为1,表示全为null值。如果是可变宽度字段,平均列数据长度stawidth为零;如果是固定宽度字段,平均列数据长度stawidth:​​stats->attrtype->typlen​​;基数的计算stadistinct为0。

else if (null_cnt > 0){/* We found only nulls; assume the column is entirely null */
stats->stats_valid = true;
stats->stanullfrac = 1.0;
if (is_varwidth) stats->stawidth = 0; /* "unknown" */
else stats->stawidth = stats->attrtype->typlen;
stats->stadistinct = 0.0; /* "unknown" */
}
/* We don't need to bother cleaning up any of our temporary palloc's */
}

下面的分支是大头了,首先会依据当前字段的值,对记录进行排序。在取出样本数据的时候,按照tuple在磁盘中的位置顺序取出的(ScalarItem的tupno维护了位置顺序信息),因此对值进行排序后即可计算得出相关性。

if (values_cnt > 0) { /* We can only compute real stats if we found some sortable values. */
int ndistinct,/* # distinct values in sample */ nmultiple,/* # that appear multiple times */ num_hist, dups_cnt, slot_idx = 0;
CompareScalarsContext cxt;
cxt.ssup = &ssup; /* Sort the collected values */ cxt.tupnoLink = tupnoLink;
qsort_arg((void *) values, values_cnt, sizeof(ScalarItem), compare_scalars, (void *) &cxt); // 快排排序

compare_scalars除了对项目进行ScalarItems之外,每当发现两个ScalarItems包含相等的值时,会更新tupnoLink[]数组,规则是以tupno小的值作为索引,tupnoLink对应元素里存放的是tupno大的值,tupnoLink[tupno_min]为tupno_max。

/* qsort_arg comparator for sorting ScalarItems
* Aside from sorting the items, we update the tupnoLink[] array whenever two ScalarItems are found to contain equal datums. The array is indexed by tupno; for each ScalarItem, it contains the highest tupno that that item's datum has been found to be equal to. This allows us to avoid additional comparisons in compute_scalar_stats(). */
static int compare_scalars(const void *a, const void *b, void *arg) {
Datum da = ((const ScalarItem *) a)->value; int ta = ((const ScalarItem *) a)->tupno;
Datum db = ((const ScalarItem *) b)->value; int tb = ((const ScalarItem *) b)->tupno;
CompareScalarsContext *cxt = (CompareScalarsContext *) arg;
int compare;
compare = ApplySortComparator(da, false, db, false, cxt->ssup);
if (compare != 0) return compare; // 如果比较下来值相等,那么就需要比较tupno
/* The two datums are equal, so update cxt->tupnoLink[]. */
if (cxt->tupnoLink[ta] < tb) cxt->tupnoLink[ta] = tb;
if (cxt->tupnoLink[tb] < ta) cxt->tupnoLink[tb] = ta;
/* For equal datums, sort by tupno */ // 对于value相同的元素,根据tupno,元组在采样样本中的位置来比较
return ta - tb;
}

现在按顺序扫描这些值,找到最常见的值,并累积排序相关性ordering-correlation统计数据。为了确定哪一个是最常见的,我们首先必须计算每个值的重复数。重复项在排序列表中相邻,因此蛮力方法是比较连续的数据值,直到我们发现两个不相等。然而,这需要N-1次调用数据比较例程,这与排序过程中所做的工作完全冗余。(排序算法必须在某一点上比较排序顺序中相邻的每对项;否则,它无法知道是否正确排序了该对。)我们通过让compare_scalars记住每个ScalarItem被发现等于的最高tupno索引来利用这一点。在排序结束时,ScalarItems的tupnoLink仍将指向自身,当且仅当它是其重复项组的最后一项时(因为该组将由tupno排序)Now scan the values in order, find the most common ones, and also accumulate ordering-correlation statistics. To determine which are most common, we first have to count the number of duplicates of each value. The duplicates are adjacent in the sorted list, so a brute-force approach is to compare successive datum values until we find two that are not equal. However, that requires N-1 invocations of the datum comparison routine, which are completely redundant with work that was done during the sort. (The sort algorithm must at some point have compared each pair of items that are adjacent in the sorted order; otherwise it could not know that it’s ordered the pair correctly.) We exploit this by having compare_scalars remember the highest tupno index that each ScalarItem has been found equal to. At the end of the sort, a ScalarItem’s tupnoLink will still point to itself if and only if it is the last item of its group of duplicates (since the group will be ordered by tupno).

= 0; ndistinct = 0; nmultiple = 0; dups_cnt = 0;
for (i = 0; i < values_cnt; i++) {
int tupno = values[i].tupno;
corr_xysum += ((double) i) * ((double) tupno); // corr_xysum是所有采样行排序顺序乘物理顺序的积的和
dups_cnt++; // 重复值计数增加
if (tupnoLink[tupno] == tupno){/* Reached end of duplicates of this value */ // 非相同值开始了
ndistinct++; // 不同值计数
if (dups_cnt > 1){ // 前面有重复值,也就是有大于2个相同值
nmultiple++; // 重复值的数量
if (track_cnt < num_mcv || dups_cnt > track[track_cnt - 1].count) {/* Found a new item for the mcv list; find its position, bubbling down old items if needed. Loop invariant is that j points at an empty/ replaceable slot. */ // 找到mcv列表的新项;找到它的位置,如果需要的话,可以从旧项目中找出。循环不变量是j指向一个空的/可替换的插槽
// 如果track槽数不够且新找到的重复值的重复次数比track中重复值重复次数大的,则需要替换track中槽(track记录重复次数最高的值);如果track槽数还充足,直接使用空的槽
int j;
if (track_cnt < num_mcv) track_cnt++;
for (j = track_cnt - 1; j > 0; j--) { // 从大到小排序,找到合适的槽
if (dups_cnt <= track[j - 1].count) break;
track[j].count = track[j - 1].count;
track[j].first = track[j - 1].first;
}
track[j].count = dups_cnt;
track[j].first = i + 1 - dups_cnt;
}
}
dups_cnt = 0; // 重置重复值计数器
}
}

这里整理一下列有数据正常情况下的公式:

  • 列数据NULL比例stanullfrac:​​(double) null_cnt / (double) samplerows​
  • 如果是可变宽度字段,平均列数据长度stawidth:​​total_width / (double) nonnull_cnt​​​;如果是固定宽度字段,平均列数据长度stawidth:​​stats->attrtype->typlen​
  • 基数的计算,该部分计算稍微复杂一些,分为以下三种情况:当采样中的值没有重复的时候,则认为所有的值唯一,​​stadistinct = -1​​​;当采样中的每个值都出现重复的时候,则认为基数有限,则s​​tadistinct = distinct_value_number​​​;当采样中的值中,存在有唯一值并且存在不唯一值的时候,则依据以下的公式(by Haas and Stokes in IBM Research Report RJ 10025):​​n * d / (n - f1 + f1 * n/N)​​,其中,N是指所有的记录数,即pg_class.reltuples;n是指sample_row_number,即采样的记录数;f1则是只出现一次的值的数据;d则是采样中所有的值的数量。
->stats_valid = true;    
stats->stanullfrac = (double) null_cnt / (double) samplerows; /* Do the simple null-frac and width stats */
if (is_varwidth) stats->stawidth = total_width / (double) nonnull_cnt;
else stats->stawidth = stats->attrtype->typlen;
if (nmultiple == 0) { /* If we found no repeated non-null values, assume it's a unique column; but be sure to discount for any nulls we found. */ // 如果没有发现重复的非空值,则假设它是唯一的列;但一定要为我们发现的任何空值打折。
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
}else if (toowide_cnt == 0 && nmultiple == ndistinct) { /* Every value in the sample appeared more than once. Assume the column has just these values. (This case is meant to address columns with small, fixed sets of possible values, such as boolean or enum columns. If there are any values that appear just once in the sample, including too-wide values, we should assume that that's not what we're dealing with.) */ // 样本中的每个值都出现了多次。假设列只有这些值。(本例旨在处理具有较小固定值集的列,例如布尔或枚举列。如果有任何值在示例中只出现一次,包括太宽的值,我们应该假设这不是我们要处理的。)
stats->stadistinct = ndistinct;
}else{
/* Estimate the number of distinct values using the estimator proposed by Haas and Stokes in IBM Research Report RJ 10025: n*d / (n - f1 + f1*n/N) where f1 is the number of distinct values that occurred exactly once in our sample of n rows (from a total of N), and d is the total number of distinct values in the sample. This is their Duj1 estimator; the other estimators they recommend are considerably more complex, and are numerically very unstable when n is much smaller than N. In this calculation, we consider only non-nulls. We used to include rows with null values in the n and N counts, but that leads to inaccurate answers in columns with many nulls, and it's intuitively bogus anyway considering the desired result is the number of distinct non-null values. Overwidth values are assumed to have been distinct. */
int f1 = ndistinct - nmultiple + toowide_cnt, d = f1 + nmultiple;
double n = samplerows - null_cnt, N = totalrows * (1.0 - stats->stanullfrac), stadistinct;
if (N > 0) stadistinct = (n * d) / ((n - f1) + f1 * n / N); /* N == 0 shouldn't happen, but just in case ... */
else stadistinct = 0;
if (stadistinct < d) stadistinct = d; /* Clamp to sane range in case of roundoff error */
if (stadistinct > N) stadistinct = N;
stats->stadistinct = floor(stadistinct + 0.5); /* And round to integer */
}
/* If we estimated the number of distinct values at more than 10% of the total row count (a very arbitrary limit), then assume that stadistinct should scale with the row count rather than be a fixed value. */ // 如果我们估计相异值的数量超过总行数的10%(非常任意的限制),则假设stadistinct应随行数而定,而不是固定值
if (stats->stadistinct > 0.1 * totalrows) stats->stadistinct = -(stats->stadistinct / totalrows);

MCV / MCF 并不是所有采样的值都会被列入MCV/MCF。首先是如果可以,则将所有采样的记录放到MCV中,如表所有的记录都已经取作采样的时候;其次,则是选取那些出现频率超过平均值的值,事实上是超过平均值的25%;那些出现频率大于直方图的个数的倒数的时候等(详情查看analyze_mcv_list函数)。

/* Decide how many values are worth storing as most-common values. If we are able to generate a complete MCV list (all the values in the sample will fit, and we think these are all the ones in the table), then do so.  Otherwise, store only those values that are significantly more common than the values not in the list. Note: the first of these cases is meant to address columns with small, fixed sets of possible values, such as boolean or enum columns.  If we can *completely* represent the column population by an MCV list that will fit into the stats target, then we should do so and thus provide the planner with complete information.  But if the MCV list is not complete, it's generally worth being more selective, and not just filling it all the way up to the stats target. */ // 确定有多少值值得存储为最常见的值。如果我们能够生成一个完整的MCV列表(样本中的所有值都适合,我们认为这些都是表中的值),那么就这样做。否则,只存储比列表中的值更常见的值。注意:这些情况中的第一种情况用于处理具有较小固定值集的列,例如布尔或枚举列。如果我们可以*完全地*表示符合统计目标的MCV列表的列人口,那么我们应该这样做,从而为planner提供完整的信息。但是,如果MCV列表不完整,则通常值得更具选择性,而不仅仅是将其一直填充到统计目标。
if (track_cnt == ndistinct && toowide_cnt == 0 && stats->stadistinct > 0 && track_cnt <= num_mcv) {
num_mcv = track_cnt; /* Track list includes all values seen, and all will fit */
}else{
int *mcv_counts;
/* Incomplete list; decide how many values are worth keeping */
if (num_mcv > track_cnt) num_mcv = track_cnt;
if (num_mcv > 0) {
mcv_counts = (int *) palloc(num_mcv * sizeof(int));
for (i = 0; i < num_mcv; i++) mcv_counts[i] = track[i].count;
num_mcv = analyze_mcv_list(mcv_counts, num_mcv, stats->stadistinct, stats->stanullfrac, samplerows, totalrows);
}
}

产生MCV / MCF列表值,MCV列表元素就是​​values[track[num_mcv].first].value​​​,MCF列表元素就是​​(double) track[i].count / (double) samplerows​​,分别设置到stats->stanumbers和stats->numvalues中。

if (num_mcv > 0){ /* Generate MCV slot entry */
Datum *mcv_values; float4 *mcv_freqs;
MemoryContext old_context = MemoryContextSwitchTo(stats->anl_context); /* Must copy the target values into anl_context */
mcv_values = (Datum *) palloc(num_mcv * sizeof(Datum)); // MCV存储最常见值的数组
mcv_freqs = (float4 *) palloc(num_mcv * sizeof(float4)); // MCF存储最常见值频率的数组
for (i = 0; i < num_mcv; i++){
mcv_values[i] = datumCopy(values[track[i].first].value, stats->attrtype->typbyval, stats->attrtype->typlen);
mcv_freqs[i] = (double) track[i].count / (double) samplerows;
}
MemoryContextSwitchTo(old_context);
stats->stakind[slot_idx] = STATISTIC_KIND_MCV; // STATISTIC_KIND_MCV为1
stats->staop[slot_idx] = mystats->eqopr;
stats->stacoll[slot_idx] = stats->attrcollid;
stats->stanumbers[slot_idx] = mcv_freqs;
stats->numnumbers[slot_idx] = num_mcv;
stats->stavalues[slot_idx] = mcv_values;
stats->numvalues[slot_idx] = num_mcv;
/* Accept the defaults for stats->statypid and others. They have
been set before we were called (see vacuum.h) */
slot_idx++;
}

计算直方图,会首先排除掉MCV中的值。意思是直方图中的数据不包含MCV/MCF的部分,两者的值是补充关系而且不会重合,但不一定互补(两种加起来未必是全部数据)。这个也与成本的计算方式有关系,请参考row-estimation-examples 。其计算公式相对比较简单,如下:values[(i * (nvals - 1)) / (num_hist - 1)],i指直方图中的第几列;nvals指当前还有多少个值;num_hist则指直方图中还有多少列。计算完成后,kind的值会被置为2。

/* Generate a histogram slot entry if there are at least two distinct values not accounted for in the MCV list.  (This ensures the histogram won't collapse to empty or a singleton.) */ // 如果MCV列表中至少有两个不同的值未考虑,则生成直方图槽条目。(这确保了直方图不会塌陷为空或单例。)
num_hist = ndistinct - num_mcv;
if (num_hist > num_bins) num_hist = num_bins + 1;
if (num_hist >= 2){
MemoryContext old_context;
Datum *hist_values;
int nvals, pos, posfrac, delta, deltafrac;
qsort((void *) track, num_mcv, sizeof(ScalarMCVItem), compare_mcvs); /* Sort the MCV items into position order to speed next loop */ // 将MCV项目排序到位,以加快下一个循环
/* Collapse out the MCV items from the values[] array. Note we destroy the values[] array here... but we don't need it for anything more. We do, however, still need values_cnt. nvals will be the number of remaining entries in values[]. */ // 从 values数组中折叠MCV项。注意,我们在这里销毁了values[]数组……但我们不再需要它了。然而,我们仍然需要value_cnt。nvals将是 values中的剩余条目数
if (num_mcv > 0){
int src,dest,j = 0; /* index of next interesting MCV item */
while (src < values_cnt){
int ncopy;
if (j < num_mcv){
int first = track[j].first;
if (src >= first){/* advance past this MCV item */
src = first + track[j].count; j++; continue;
}
ncopy = first - src;
}else ncopy = values_cnt - src;
memmove(&values[dest], &values[src], ncopy * sizeof(ScalarItem));
src += ncopy;
dest += ncopy;
}
nvals = dest;
}
else nvals = values_cnt;

old_context = MemoryContextSwitchTo(stats->anl_context); /* Must copy the target values into anl_context */
hist_values = (Datum *) palloc(num_hist * sizeof(Datum));

/* The object of this loop is to copy the first and last values[] entries along with evenly-spaced values in between. So the i'th value is values[(i * (nvals - 1)) / (num_hist - 1)]. But computing that subscript directly risks integer overflow when the stats target is more than a couple thousand. Instead we add (nvals - 1) / (num_hist - 1) to pos at each step, tracking the integral and fractional parts of the sum separately. */
delta = (nvals - 1) / (num_hist - 1);
deltafrac = (nvals - 1) % (num_hist - 1);
pos = posfrac = 0;
for (i = 0; i < num_hist; i++){
hist_values[i] = datumCopy(values[pos].value, stats->attrtype->typbyval, stats->attrtype->typlen);
pos += delta;
posfrac += deltafrac;
if (posfrac >= (num_hist - 1)){/* fractional part exceeds 1, carry to integer part */
pos++;
posfrac -= (num_hist - 1);
}
}
MemoryContextSwitchTo(old_context);
stats->stakind[slot_idx] = STATISTIC_KIND_HISTOGRAM; // histogram_bounds(直方图)
stats->staop[slot_idx] = mystats->ltopr; stats->stacoll[slot_idx] = stats->attrcollid;
stats->stavalues[slot_idx] = hist_values; stats->numvalues[slot_idx] = num_hist;
slot_idx++;
}

设置correlation(关联度),计算公式为​​(values_cnt * corr_xysum - corr_xsum * corr_xsum) / (values_cnt * corr_x2sum - corr_xsum * corr_xsum)​​​,其中corr_xysum为所有采样行排序顺序乘物理顺序的积的和,corr_xsum为​​((double) (values_cnt - 1)) * ((double) values_cnt) / 2.0​​​,corr_x2sum为​​((double) (values_cnt - 1)) * ((double) values_cnt) * (double) (2 * values_cnt - 1) / 6.0​​。

if (values_cnt > 1){ /* Generate a correlation entry if there are multiple values */
float4 *corrs = = (float4 *) palloc(sizeof(float4));
double corr_xsum,corr_x2sum;
MemoryContextSwitchTo(MemoryContextSwitchTo(stats->anl_context); /* Must copy the target values into anl_context */);

/* Since we know the x and y value sets are both
* 0, 1, ..., values_cnt-1
* we have sum(x) = sum(y) = (values_cnt-1)*values_cnt / 2
* and sum(x^2) = sum(y^2) = (values_cnt-1)*values_cnt*(2*values_cnt-1) / 6. */
corr_xsum = ((double) (values_cnt - 1)) * ((double) values_cnt) / 2.0;
corr_x2sum = ((double) (values_cnt - 1)) * ((double) values_cnt) * (double) (2 * values_cnt - 1) / 6.0;
corrs[0] = (values_cnt * corr_xysum - corr_xsum * corr_xsum) / (values_cnt * corr_x2sum - corr_xsum * corr_xsum); /* And the correlation coefficient reduces to */

stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION; // correlation(关联度)
stats->staop[slot_idx] = mystats->ltopr; stats->stacoll[slot_idx] = stats->attrcollid;
stats->stanumbers[slot_idx] = corrs;
stats->numnumbers[slot_idx] = 1;
slot_idx++;
}
}

​​https://www.interdb.jp/pg/pgsql03.html​​​ PostgreSQL 内核 ANALYZE 背后的事
PostgreSQL · 特性分析 · 统计信息计算方法