In this post, we have shown you the C# code to process raw data of 10K rows of gender, height and corresponding weight. It takes a while to code it even inC# but it is useful in understanding the model behind (code it from the scratch). In this tutorial, we show you the relevant commands and functions to fit the linear model (weight over height) usingR programming, which is made perfect for this task (statistical analysis).
We still use the same data, the CSV file containing the 10K data
Now, we need to read in theCSV in R using read.csv , like this:
data = read.csv("data.csv", header=T);
Since the first line of the CSV is the labels, so we pass header=T (T shorthand for TRUE) parameter to skip it. The data now is a matrix, which you can use dim function to verify:
dim(data)  10000 5
The nrow and ncol returns the number of rows and columns respectively.
nrow(data)  10000 ncol(data)  5
Now, basically, we need to separate the data into two sets, the training data set and the verification data set. The common mistake for Big-Data-Mining learners is to use themachine learning algorithm on the data set and make prediction on the same data set, in which normally very good results are obtained.
We can split the data by using even and odd indices.
training = data[seq(1, nrow(data), 2), ] verification = data[seq(2, nrow(data), 2), ]
each contains half records:
dim(training)  5000 5 dim(verification)  5000 5
Now, let’s further extract these data into variables.
training_weight = training[,5] training_height = training[,4] verification_weight = verification[,5] verification_height = verification[,4]
The weight data is located at the fifth column and the height data is located at the fourth column (in R, the index starts at ONE, not zero-based).
Linear Fit using lm() function in R
Now, we can use the lm() to fit the linear model using the following:
fit = lm(weight~height)
We are basically constructing the linear model: weight = k * height + b . What we get is:
fit Call: lm(formula = weight ~ height) Coefficients: (Intercept) height -158.101 1.372
That means the model is: weight = 1.372 * height – 158.101 , so what is next? We can plot the fit vividly.
plot(height, weight, col='blue',xlab='height (cm)',ylab='weight (kg') abline(fit,col='red')
This plots the points and fit the line:
How good is the model?
We can make predictions using the linear model and compare the accuracy:
pred_weight = 1.372 * verification_height - 158.101
And the mean error is:
where P is the predicted weight and W is the verification weight.
sqrt(sum((pred_weight-verification_weight)^2)/length(pred_weight))  5.545324
A reasonable good model (linear fit) is achieved with the RMSE 5.5 and mean error -0.07.
Separate males from the females
The above model does not distinguish the males and females, you could do similarly by using the which function, that returns the indices of the vector when conditions are met.
male=data[which(data[,1]=="Male"), ] female=data[which(data[,1]=="Female"), ]
GD Star Rating