Introduction
“Parametric” and “non-parametric” are two broad categories of statistical procedures. Parametric statistical procedures assume the population’s distribution (e.g., normal distribution) from which the sample was taken. It relies on the assumed distribution parameters (the means and standard deviations). If the data deviates strongly from the assumptions of the parametric methods, such techniques can lead to the wrong conclusion.
On the contrary, non-parametric statistical procedures do not assume the population’s distribution or parameters. So, if you are not sure about the distributions or parameters of the sample, it is advisable to use non-parametric methods. The one major disadvantage of using non-parametric methods is the interpretation of non-parametric procedures can be more complicated than parametric procedures.
Kernel density estimation (KDE) is a way to estimate a random variable’s probability density function (PDF) in a non-parametric way. In this post, I will give a Python code that uses gaussian_kde() of the scipy package to estimate the pdf. The gaussian_kde() function works for both univariate and multivariate data. By default, it uses Scott’s rule to determine the bandwidth for the kernel density estimation. If you want to provide some bandwidth value manually, you can also use the parameter ‘bw_method‘. The default bandwidth estimation method (Scott’s rule) works best for a unimodal distribution; bimodal or multi-modal distributions tend to be over smoothed.
Bandwidth selection strongly influences the estimate obtained from the KDE. I have used Scott’s rule, Silverman’s rule, a manual value for the bandwidth in the following code. In the plot, you can see that if we select a smaller bandwidth, the pdf is not smooth, and if we choose a bigger value for bandwidth, we get a smooth pdf.
Python code
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots() # Create a figure containing a single axes.
# generate random data points
np.random.seed(1007)
X = np.random.random(5000)
# estimated pdf on a set of points.
points = np.linspace(-2, 2, 100)
# kernel-density estimate using Gaussian kernels and BW method = 'scott'
kde1 = stats.gaussian_kde(X, bw_method='scott')
print("Bandwidth using Scott rule: ", kde1.factor)
# kernel-density estimate using Gaussian kernels and BW method = 'silverman'
kde2 = stats.gaussian_kde(X, bw_method='silverman')
print("Bandwidth using Silverman rule: ", kde2.factor)
# kernel-density estimate using Gaussian kernels and BW method = user provided value
kde3 = stats.gaussian_kde(X, bw_method=0.01)
# plot pdf
p1, = ax.plot(points, kde1.evaluate(points), color="blue")
p2, = ax.plot(points, kde2.evaluate(points), color="red")
p3, = ax.plot(points, kde3.evaluate(points), color="green")
ax.legend([p1, p2, p3], ['BW={0}'.format(round(kde1.factor, 2)), 'BW={0}'.format(round(kde2.factor, 2)),
'BW={0}'.format(kde3.factor)], loc="best")
plt.xlabel("points")
plt.ylabel("pdf")
plt.tight_layout()
plt.show()
Output
Bandwidth using Scott rule: 0.18205642030260802
Bandwidth using Silverman rule: 0.19283850080052542
