Stock returns are notoriously noisy and as a result, little can be learned from historical data. One of the major buidling blocks of academic finance, the Markowitz portfolio optimization with its mean-variance framework is rarely applied in practice. The reason: The optimization requires parameter estimates as inputs. Since the estimation errors of these inputs are high, the optimization tends to overfit on the historical data and dissapoint in subsequent time periods.

If you think about it carefully, a large part of the noise component of individual stocks can be diversified away. A well-known example comes from asset pricing, where stocks are grouped into portfolios to get better estimates. Portfolios diversify the noise component and thus have easier to estimate parameters.

This post also aims at diversifying estimation errors, but instead of grouping individual stocks into portfolios, the appraoch is different. We ask whether fundamentally similar stocks (company size, value, profitability, etc.) can improve idiosyncratic estimates via clustering. The intuituition is quite easy. Economically similar stocks should have similar return characteristics and covariances to other groups/clusters of stocks.

## Computing Stock Neighbors using KNN

To obtain local measures, I first need to create localities, namely regions of stocks that are fundamentally similar. I will utilize stock characteristics to measure the similarity between two assets 𝑖 and 𝑗. Factor returns and stock level characteristic signals are taken from Dacheng Xiu’s website. To find suitable neighborhoods, the nearest neighbors algorithm is applied.

The nearest neighbors algorithm attempts to find the closest neighbors of an observation 𝑎 given a sample of feature realizations 𝑋 in Euclidean space ℝ by minimizing the Euclidean distance:

The algorithm iteratively processes as follows:

- Calculate the distance between observation 𝑎 and any observation b≠𝑎 from the data
- Add the distance and the label of the example to an ordered collection
- Sort the distances between 𝑎 and all 𝑏 by the distances
- Pick the closest 𝑘 neighbors

These four steps are repeated for all 𝑎 in the sample.

In my case, my observations 𝑎 are one realization of a stock return 𝑟𝑖,𝑡+1 of stock 𝑖 at time 𝑡+1. The feature realizations 𝑋 are stock characteristics at time 𝑡. Step 1 to 4 is performed separately for each cross-section. Once all neighbors for the cross-section at time 𝑡 are found, I can calculate my cluster estimates. Cluster estimates are simply the average estimate of the 𝑘 nearest neighbors. For example, the cluster estimate of average returns E[μ_{i}] can be calculated (with ten fundamental peers) as follows:

## Cluster Estimates

For the empirical results, I calculate the returm moments with a five year rolling window. After obtaining the estimates, I test their out-of-sample predictability for the subsequent year using R^{2}_{OS}/Relative Mean Squared Error. The resulting estimates outperform their single security counterpart by a large margin. Additionally, the cluster estimates outperform other benchmarks that are commonly applied, like simple average stock estimates or a “0 Benchmark” that predicts zero mean returns for each stock.