Designing an optimal KNN regression model for predicting house price with Boston Housing Dataset

4 min readMar 7, 2021

Hello dear readers, in this article, I have presented Python code for a regression model using the K-Nearest Neighbour Algorithm (KNN) for predicting the price of the house in Boston. The code also contains a function to estimate best value of k from the elbow curve. The code is developed from scratch using only Euclidean distance measure and all points in each neighbourhood are weighted equally.

(Kindly note that this article does not present and describe the KNN algorithm and the concept. Please refer to relevant articles from other posts on medium.com.)

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. There are 506 observations with 13 features (independent variables)like the number of rooms(rm),crime_rate(crim), air pollution variable(nox),cost of public services in each community(tax),pupil-teacher ratio(ptratio),etc. The dependent/target variable is house price that is given in thousand dollars.

Problem Statement: Given a set of features that describe a house in Boston, design an optimal KNN regression model that can predict the house price for any given house.

The code goes here:

The required libraries are loaded and the dataset is imported in this section.

# built in datasets and other required functions are imported from 
# sklearn 
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import math#Loading the Boston dataset
boston=datasets.load_boston()
x=boston.data[:,:]
y=boston.target
print(x.shape,y.shape)tsize=0.30 #30% of total data is used for testing and 70% used for training## splitting the dataset into training and testing sets,
# (parameter random state is fixed at some integer, to ensure the 
# same train and test sets across various runs)
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=tsize,random_state=102)

The code to explore dataset characteristics and viewing few observations (records) is presented here.

## Exploring the dataset characteristics and having glimpse of data
# printing the sizes of training and testing data sets
print(xtrain.shape,ytrain.shape)
print(xtest.shape,ytest.shape)# Print the information contained within the dataset
print("\nKeys of iris_dataset: \n{}".format(boston.keys()))
print(boston['DESCR'][:500] + "\n...")#Print the feature names
print("\nFeature names: \n{}".format(boston['feature_names']))
#Printing the  Few Rows
print("\nFirst five rows of data:\n{}".format(boston['data'][:5]))
#Print the class values few datapoints
print("\nTarget:\n{}".format(boston['target'][:5]))
#Print the dimensions of data
print("\nShape of data: {}".format(boston['data'].shape))

The function for predicting the house price for the given house (feature vector tx) is presented next.

##function to find Euclidean distance
def edist(v1,v2):
    return np.sqrt(np.sum((v1-v2)**2))##function to predict values using knn for given test data tx
def knn_reg(tr_x, tr_y, tx , k):
   
    distances = []
    
    #Find distances between new data and all the training data
    for i in range(tr_x.shape[0]):
        distances.append(edist(tr_x[i], tx))
    
    #sort the distances in ascending order
    distances = np.array(distances)
    inds = np.argsort(distances)
    
    distances = distances[inds]
    tr_y_sorted = tr_y[inds] #sorted values of target variable
    
    #predicted value is the average of first k values of target
    #vector
    value = np.average(tr_y_sorted[:k])
    return value

The function to find Mean Squared Error(MSE) for given value of k is given below.

##Function to find mean squared error for the entire test dataset
def knn_mse(tr_x , tr_y, test_x , test_y , k):
    preds = []
    for i in range(test_x.shape[0]):
        value = knn_reg(tr_x, tr_y, test_x[i] , k)
        preds.append(value)
    
    preds  = np.array(preds)
    err = mean_squared_error(test_y , preds)
    return err

The above function is used to find Mean Squared Error(MSEs) for various values of k as given below.

##Finding MSEs for different values of k 
maxk=int(math.sqrt(xtrain.shape[0])) #maximum value of k 
mse_val = [] #to store rmse values for different k
for k in range(1,maxk):
    error= knn_mse(xtrain , ytrain , xtest , ytest ,k)
    mse_val.append(error) #store rmse values
    print('MSE value for k= ' , k , 'is:', error)

Also the elbow curve is plotted and the optimal value of k is automatically found using find_elbow() function. Note that selecting minimum MSE point is not enough in all the cases, hence the code ensures that an elbow point is selected instead of minimum. See the code below.

##plotting the elbow curve 
k=np.arange(1,maxk)
xl="k"
yl="MSE"
plt.xlabel(xl) 
plt.ylabel(yl)
plt.title("Elbow Curve")
plt.plot(k,mse_val)##finding the k for the elbow point 
ke=find_elbow()
print("Best Value of k using elbow curve is ",ke)
plt.plot(ke,mse_val[ke-1],'rx')
plt.annotate("  elbow point", (ke,mse_val[ke-1]))

Now observe the value of k for elbow point in the graph given below and check whether it is same as that found by the find_elbow() function. Here it is 9 by using both the methods. I am getting the following output:

Best Value of k using elbow curve is 9

In the above plot, elbow point is shown using red cross(x).

Now the model is ready with optimal k and its time to predict the price for the given house. The code for the same is given below.

## Now model is ready to predict the cost for new house with given features in xnew vector and ke as kxnew=np.array([2.7310e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, \ 4.6900e-01, 6.4210e+00, 7.8900e+01, 4.9671e+00, 2.0000e+00, \ 2.4200e+02, 1.7800e+01, 3.9690e+02 ,9.1400e+00])
hcost=knn_reg(xtrain, ytrain, xnew , ke)print("Predicted price of the given house is {:.2f}".format(hcost),\ "thousand dollars")

The output I got here is:

Predicted price of the given house is 24.00 thousand dollars

To conclude, coding from scratch (instead of using library functions for Machine learning algorithms) requires better understanding of the algorithm and also if one attempts coding from scratch, he/she can develop better understanding of the algorithm.

Hope you liked the article. This is my first post on Machine learning using Python, so please give your comments, and suggestions to improve the code and report the mistakes if any.

Designing an optimal KNN regression model for predicting house price with Boston Housing Dataset

Written by Kishor Keshav