Services
- Subjects
- Nursing
- Law
- Management
- Finance
- Accounting
- Statistics
- Engineering
- Psychology
- Business
- View All
Free Samples
Blog
Reviews 4.5/5
Support
- Help & Support
- FAQs
- Our Policies
- Contact Us
- Request Callback
Order Now

B9DA109 Supervised Machine Learning - Regression Assessment Answer

Subject Code : B9DA109
University : Dublin Business School My Assignment Services is not sponsored or endorsed by this college or university.
Subject Name : Machine Learning

Abstract

Multiple linear regression is one widely used approach in machine learning that can be highly helpful for predicting house prices. It enables the simulation of the relationship between the target variable and a number of input variables such as the number of bedrooms, the size of the property, and location, among others. One may utilize the model that results from training it on a dataset with known home prices and their corresponding input variables to make predictions on new data, that is predict the price of a new house based on its features. In this study, multiple linear regression was fact employed in the case of Melbourne house prices to forecast prices based on a set of input factors. The linear regression model was first trained on the Melbourne housing dataset without any optimization approaches. The model's performance was then improved by reducing the mean squared error loss function during training using a variation of the stochastic gradient descent approach. This enhanced the model's capacity to predict housing values with improved accuracy. The model’s performance could be further enhanced by investigating alternative tactics like feature selection or testing various regression techniques. In conclusion, even though linear regression can be a valuable technique for predicting real estate values, it is crucial to continuously assess and enhance the models to get the best outcomes. The ultimate goal is to create a trustworthy and accurate algorithm that can help in real estate market decision-making and price prediction.

Introduction

Due to its practical uses in the real estate sector, predicting house values has been a famous study topic in machine learning. As a result, machine learning can produce precise and effective predictions for house prices. Its algorithms discover patterns and relationships in the data to make predictions. Contextually, regression analysis is one of the most often utilized machine learning techniques for predicting house prices. The models in this class seek to determine the relationship between the dependent variable and the independent variables.

The use of machine learning algorithms to forecast house values has been the subject of extensive research. Thamarai and Malarvizhi (2020) carried out a research project utilizing linear regression to estimate house prices based on lot size, number of bedrooms, and distance from the city center, among other factors.

Besides, decision trees have been routinely utilized to forecast home prices since they are simple to understand and use (Brown, 2020). As an illustration, Mahamood (2013) employed a decision tree algorithm to forecast house values in Istanbul, Turkey, based on traits such as location, property age, and several storeys. They stated that decision trees are suitable for predicting housing prices, especially in areas with distinctive characteristics, and reported an accuracy of 85%. Other algorithms such as neural networks, decision trees, and support vector machines have also been used.

Multiple linear regression is a statistical method used to forecast a numerical outcome variable based on one or more predictor factors. Therefore, multiple linear regression was used to model Melbourne home prices depending on a variety of characteristics. Two models were produced and compared using an array of evaluation metrics.

Data

The data representing house prices and their attributes were obtained from the Kaggle repository (https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market). It had more than 15 amalgams of both continuous and categorical variables. Detailed descriptions of the data are found at https://www.featureranking.com/tutorials/statistics-tutorials/regression-case-study-melbourne-1/#Dataset-Details. The data was then loaded into the python platform for its analysis.

Choice of the dependent variable

The "Price" column, which reflects the sale price of the homes in Melbourne, serves as the dataset's dependent variable. This is the variable that was predicted using the independent variables. The independent variables are all the other columns in the dataset, which are:

Suburb: the suburb where the house is located
Rooms: the number of rooms in the house

Type: the type of dwelling (house, unit, or townhouse)

Method: the method of sale (S = property sold; SP = property sold prior; PI = property passed in; PN = sold prior not disclosed; SN = sold not disclosed; NB = no bid; VB = vendor bid; W = withdrawn prior to auction; SA = sold after auction; SS = sold after auction price not disclosed)
SellerG: the real estate agent who sold the property.
Date: the date of the sale

Distance: the distance from the CBD (central business district) in kilometers
Postcode: the postal code of the suburb

Bedroom2: the number of bedrooms
Bathroom: the number of bathrooms
Car: the number of car spaces

Landsize: the size of the land in square meters
BuildingArea: the size of the building in square meters
YearBuilt: the year the building was constructed.

CouncilArea: the local council governing the area.

Latitude: the latitude of the property
Longitude: the longitude of the property
Region name: the general region (west, north-west, north, north-east, south-east, east, or south) the property belongs to
Property count: the number of properties that exist in the suburb.

The impact of each of these different factors on the house's sale price is possible. The number of rooms, bathrooms, bedrooms, and parking spots are likely to have a positive association with the sale price, however, other factors like the building's age and distance from the CBD could have a negative correlation. The sale price may also be influenced by the type of residence, the manner of sale, and the real estate agent. The property's latitude and longitude may potentially reveal information about its location. Bottom of Form

Data Preparation

The dataset was loaded into the python platform in csv format. The data dimension was checked. It showed that the data had 27247 rows and 19 columns. It was however untidy.

It was thus cleaned and prepared to enhance further analysis. The missing values in each variable were checked. The results of missing values per column are shown below.

The cleaned data had 8887 rows and 21 columns. This indicates a reduction in the number of observations as compared to the original dataset.

A subset of statistics known as descriptive statistics involves the analysis and evaluation of data using numerical and graphical approaches. Its objective is to give a concise, and understandable overview of a dataset's salient characteristics (Rendón-Macías, Villasís-Keever and Miranda-Novales, 2016). Descriptive analysis was carried out to check the nature and patterns in the data. Hence, a box and whisker plot was constructed to the presence of outliers and inliers in the outcome variable. The results are shown below.

The plot above shows that there are a small number of homes that are much more expensive than the majority, which is indicated by outliers in the upper portion of the box plot for home prices. However, incorrect data entry, extreme values in the population being sampled, or peculiar circumstances for specific observations are other factors that may cause outliers in a dataset.

Thereafter, the one-hot encoder method was used to convert all the categorical variables into continuous ones. One-hot encoding is used to enable the inclusion of the categorical variable in a multiple linear regression model (Yu et al., 2022). Categorical variables cannot be directly included in a linear regression model without being transformed into numerical values. Nonetheless, it's crucial to be aware of the possibility of multicollinearity while utilizing it. When two or more independent variables in a regression model have a high degree of correlation, multicollinearity emerges, making it challenging to isolate the specific effects of each variable on the dependent variable.

Feature Selection

The process of choosing a subset of pertinent features to be included in a machine-learning model is known as feature selection. With the reduction of overfitting, acceleration of training, and enhancement of generalization performance, feature selection primarily aims to enhance the accuracy and interpretability of the model. The correlation was carried out to eliminate parameters that could cause multicollinearity. A threshold used was a collinearity index of 0.7. However, all the variable pairs had a collinearity index less than the set limit.

Model Development and Evaluation

The data was standardized. Standardization is the process of changing variables so that they have a mean of zero and a standard deviation of one. This can help the regression model perform better by making the independent variable coefficients comparable in magnitude and minimizing the impacts of outliers and different scales of measurement for the variables (Starovoitov and Golub, 2021).

It is required to standardize the dependent variable when its scale differs from that of the independent variables. When independent variables' scales differ from one another or the regression model contains interactions or polynomial components, standardizing the independent variables is also important. The code below was used.

The dataset was split into two subsets that were used to create and run the model: predictors and a dependent variable. It was further divided into training and testing portions. 80% of the data was given to the training set, which was used to train the model, and the remaining 20% was given to the testing set, which was leveraged to assess the model's performance on fresh data. The code used is shown below.

Finding the line of best fit that minimizes the sum of squared errors when fitting a linear regression model is frequently done using an optimization technique such as the Stochastic gradient descent (SGD). It is a gradient-based optimization technique that minimizes the objective function by iteratively updating the linear regression model's parameters.

In contrast to other optimization techniques, SGD adjusts the parameters based on a small subset of the dataset at each iteration. Thus, SGD is especially helpful for huge datasets that can't be loaded entirely into memory at once. A linear regression model was then fit with and without an optimization algorithm, in this case, stochastic gradient descent. The process followed is shown below.

Model Comparison

The models were compared using evaluation metrics such as:

R-squared (R2): This statistic shows how much of the variance in the target variable can be accounted for by the predictor variables. R2 has a value between 0 and 1; a greater value denotes a better match.
Mean Absolute Error (MAE): This statistic shows the average absolute difference between the target variable's expected and actual values. The average squared difference between the predicted and actual values of the target variable is represented by the statistic known as root mean squared error, or RMSE. Although it is in the same unit as the target variable, it is comparable to MSE. The results are shown below.

The evaluation measures for the two models show that using an optimization technique significantly improves model performance. The model without an optimization technique performs poorly, as shown by its excessively high MSE, MAE, and RMSE. The model with the optimization technique, in comparison, exhibits significantly lower values for these metrics, indicating greater model performance.

For both models, the R2 score, which measures the percentage of variance in the target variable that can be accounted for by the predictor variables is negative. This shows that the models are unable to account for any of the variability in the target variable. It's crucial to keep in mind that R2 may be adverse if the model performs worse than a horizontal line. The model's subpar performance in this situation without an optimization strategy may be the cause of the negative R2 score.

Individual Contribution

As a data scientist, I am aware that cleaning and pre-processing the data are essential processes in getting a dataset ready for analysis. Thus, I started by looking for and correcting any missing values. I also looked for outliers and made sure the dataset was accurate and representative.

I then moved on to exploratory data analysis. Using a boxplot visualization I studied the distribution of the data, discover patterns and trends, and uncovered probable correlations between variables. EDA gave me insights into potential characteristics to incorporate into the regression model and assisted me in locating any outliers in the dataset.

The selection of features came next. I employed correlation analysis to choose which features to include in the model. In conclusion, I used a precise approach to clean and pre-process the dataset. I was able to create a precise regression model by following these procedures, which would then enable me to draw conclusions and insights from the data.

Conclusion

An optimization approach was employed to enhance the performance of the model after applying linear regression to a dataset comprising data on the sales price of homes. The model with an optimization algorithm outperformed the model without an optimization strategy, according to evaluation criteria including MSE, MAE, RMSE, and R2. It is concerning that the models are unable to account for any of the variances in the target variable, as indicated by the negative R2 score. There is therefore a need to further investigate other strategies, such as feature selection or experimenting with various regression algorithms, to enhance the performance of the models. Generally, linear regression can be a potent tool for predicting property prices, but it is crucial to keep assessing and improving the models for the best results.

Further Recommendations

Other approaches could be investigated in order to enhance machine learning's ability to model home prices. To find the best-performing regression algorithm, one strategy is to test various ones. Even though linear regression is a widely used approach, its accuracy may not always be the best. With some datasets, other algorithms like decision trees, random forests, or support vector machines might work better. Several algorithms can be tested in order to determine which one performs the best on the given dataset.

Finding the most crucial factors for predicting property values through feature selection is another approach. This would include examining the correlation between each predictor variable and the target variable and selecting just those with the highest correlation. By removing unnecessary or duplicate characteristics, this technique can assist in reducing the complexity of the model and improving its forecast accuracy.

Adding extra data is another approach that might be used. It is possible that the dataset utilized for this analysis was not complete enough to include all the variables that affect home prices. It would be feasible to increase the model's accuracy by including extra data, such as information on the local economy, demographics of the area's residents, crime rates, or the standard of the area's schools.

Also, it can be advantageous to investigate the usage of ensemble approaches, which combine the predictions of various models to increase their accuracy. Examples of these methods are bagging and boosting. With these techniques, many models are trained on various subsets of the data, and their predictions are then combined to yield a more precise prediction. Particularly when the multiple models have varied strengths and limitations, assembling approaches can frequently outperform individual models.

Finally, it is critical to remember that the performance of the model might be constrained by the quality of the data. The accuracy, representativeness, and bias-freeness of the data used to train and test the model must be guaranteed. No amount of feature selection or optimization will increase the model's accuracy if the data are of poor quality.

References

Brown, R. (2020) Prognostication in the Medieval World, Prognostication in the Medieval World. doi:10.1515/9783110499773.

Mahamood, S.M. (2013) ‘Perundangan Wakaf dan Isu-Isu Berbangkit’, JAWHAR [Preprint].

Rendón-Macías, M.E., Villasís-Keever, M.Á. and Miranda-Novales, M.G. (2016) ‘Descriptive statistics’, Revista Alergia Mexico [Preprint]. doi:10.29262/ram.v63i4.230.

Starovoitov, V. V. and Golub, Y.I. (2021) ‘Data normalization in machine learning’, Informatics [Preprint]. doi:10.37661/1816-0301-2021-18-3-83-96.

Thamarai, M. and Malarvizhi, S.P. (2020) ‘House Price Prediction Modeling Using Machine Learning’, International Journal of Information Engineering and Electronic Business [Preprint]. doi:10.5815/ijieeb.2020.02.03.

Yu, L. et al. (2022) ‘Missing Data Preprocessing in Credit Classification: One-Hot Encoding or Imputation?’, Emerging Markets Finance and Trade [Preprint]. doi:10.1080/1540496X.2020.1825935.

You May Also Like:

IT Dissertation Help

Computer Science Dissertation Ideas