Please go to link and download lenses dataset and perform classification

Do you need academic writing help with your homework? Let us write your papers.


Order a Similar Paper Order a Different Paper

Please find lenses dataset from link provided in file and perform the classifications in R.

also find project reference file for an idea on how to write the code

Please go to link and download lenses dataset and perform classification
CS 5630 Data Mining Group Project Data The project is for data classification. Please find the data set from the website https://archive.ics.uci.edu/ml/datasets/Lenses and perform different classification techniques. Project Report Format The final report must be in PDF format using FONT 11 and 1.5-SPACE. A separate text file of your R program is also required. There is no page limit for the project report; you can include any materials into your report as long as they are related to the project, but the report must include the following items (in order): A cover page with the project title, data set name Leave a blank page An introduction to the project Result of decision tree using holdout method (Training data set takes 80% of the data size) Result of decision tree using bagging Result of decision tree using random forest Result of decision tree using boosting Result of Naïve Bayes classifier Result of support vector machine using liner kernel with different costs Result of support vector machine using radial kernel with different costs and gammas Result of support vector machine using polynomial kernels with different costs and gammas Comparison of multiple classification techniques Conclusion of your project References
Please go to link and download lenses dataset and perform classification
CHAPTER 1 1. Basic operations 2 + 3*5 log(7) #natural logarithm log(7, base=10) #logarithm of 5 with base 10 4^2 #4 raised to the second power 3/2 #division sqrt(16) #square root abs(3-7) #absolute value of 3-7 pi #the number pi exp(2) #exponent of 2 with base e a^b #base a with exponent b 15%/%4 #remainder of integer division 2. R object and names There are two main concepts behind the R language: objects and functions. An object can be seen as a storage space with an associated name. Everything in R is stored in an object. Since there are many built-in functions in R, make sure that the object names you assign are not already been used by R. A simple way of checking is to type in the name you want to use. If R returns an error message telling you that such object is not found, it is safe to use this name. For example, concatenating function c is a built-in function used to combine elements, so NEVER assign an object to c! Try exp 3. R data types Basic data type: Numeric: decimal value, 10.34 Integer: integer value, 3 Complex: complex value, 4-5i Logical: TRUE or FALSE, T or F Character: string value Advanced data type: Vector: x=c(1,2,5.3,6,-2) #numeric vector x=c(“one”,”two”,”three”) #character vector x=c(TRUE,TRUE,TRUE,FALSE) #logical vector Matrix: x=matrix(c(1,3,2,10,-6,5),nrow=2,ncol=3) x=matrix(c(1, 3, 2, 10,-6,5, 7, 4, 8),nrow=3,ncol=3, byrow=T) All entries in a matrix must have the same type, either numeric, integer, complex, logical, or character, etc. Data Frame: A data frame is the most important data type. It is a table, a spreadsheet, or a data set. It is more general than a matrix because different columns can have different data types. #creat a data table x with three variables d=c(1,2,3,4) e=c(“red”, “white”, “red”, NA) f=c(TRUE,TRUE,TRUE,FALSE) x=data.frame(d,e,f) #There is no names for variables. Now, we can give name to each variable names(x)=c(“ID”,”Color”,”Passed”) Factor: Factors are variables in R which take on a limited number of different values, such variables are often referred to as categorical variables. R stores categorical variables into a factor. The factor stores a categorical variable as a vector of integers internal. Characters are not supported in machine learning algorithm, and the only way is to convert a string to an integer. #creat a variable gender with 20 “male” entries and 30 “female” entries gender=c(rep(“male”,20), rep(“female”, 30)) #gender is a vector of 50 entries class(gender) #display the class of the variable gender_factor=factor(gender) class(gender_factor) #Now R treats gender_factor as a nominal variable. It stores gender as 20 1s and 30 2s and associates 1=female, 2=male internally (alphabetically). summary(gender_factor) #The number of values in each level A categorical variable can be divided into nominal and ordinal. The next example is for ordinal variables. #Create Ordinal categorical vector day_vector=c(‘evening’, ‘morning’, ‘afternoon’, ‘midday’, ‘midnight’, ‘evening’, ‘morning’) #Convert day_vector to a factor with ordered level factor_day=factor(day_vector, order = TRUE, levels =c(‘morning’, ‘midday’, ‘afternoon’, ‘evening’, ‘midnight’)) #Print the new variable factor_day summary(factor_day) #seq() and rep() provide convenient ways to construct vectors with certain patterns. seq(10) seq(0,1,length=10) #There are 10 values seq(0,1,by=0.1) #There are 11 values rep(1,3) c(rep(1,3),rep(2,2),rep(-1,4)) rep(“Small”,3) c(rep(“Small”,3),rep(“Medium”,4)) rep(c(“Low”,”High”),3) 4. Vector operations x=c(1,3,2,10,5) y=1:5 #create a vector of consecutive integers y+2 2*y y^2 #raise each component to the second power 2^y #raise 2 to the first through fifth power x+y x*y x/y x^y sum(x) cumsum(x) #cumulative sum vector sort(x) #increasing order sort(x, decreasing=T) #decreasing order length(x) #number of elements in x x[3] #the third element of x x[3:5] #the third to fifth element of x, x[-2] #all except the second element x[x>3] #list of elements in x greater than 3 5. Matrix operations x=c(1, 3, 2, 10,5) y=c(1, 2, 3, 4, 5) m=cbind(x,y) m=rbind(x,y) t(m) #transpose of m dim(m) #dimension of matrix m m=matrix(c(1,3,2, 5,-1,2, 2,3,9),ncol=3,byrow=T) m[2,3] #element of m at the second row, third column m[2,] #second row m[,3] #third column m[-1,] #submatrix of m without the first row m[,-1] #submatrix of m without the first column m[-1,-1] #submatrix of m without the first row and column m1=matrix(1:4, ncol=2) m2=matrix(c(10,20,30,40),ncol=2) 2*m1 #scalar multiplication m1+m2 #matrix addition m1*m2 #component-wise multiplication,NOT the usual matrix multiplication m1%*%m2 #usual matrix multiplication 6. Functions for basic statistics x = c(0,4,15, 1, 6, 3, 20, 5, 8, 1, 3) mean(x) #mean median(x) #median var(x) #variance sd(x) #standard deviation quantile(x,0.25) #25th Percentile quantile(x,c(0,0.25,0.5,0.75,1)) summary(x) #five number summary IQR(x) #interquatile range(x) #minimum to maximum 7. R Graphing Bar Chart x = c(1, 3, 6, 4, 9) barplot(x, xlab=”Title on x-axis”, ylab=”Title on y-axis”, main=”Title of the chart”) Scatter Plot (both x and y are numerical variables) x = c(2, 3, 6, 8, 9, 12) y = c(6, 9, 14, 2, 7, 23) plot(x,y) x = c(1, 2, 1, 2, 1, 2) boxplot(x) Histogram x = c(4,4,6,6,16,2,7,9,12) hist(x) hist(x, breaks=c(2, 5, 8, 9, 14, 17)) hist(x, prob=T) #if probability instead of frequency is desired 8. Distributions The prefix “d” is for probability density value (pdf), “p” for cumulative probability (cdf), “q” for quantile, and “r” for random numbers from the distribution. Normal distributions rnorm(10) #generates 10 random numbers according to standard #Normal distribution. set.seed(10) rnorm(10) rnorm(20, mean=10, sd=3) #generates 20 random numbers according to Normal #distribution with mean 10 and standard deviation 3. x=seq(-8,8,length=500) y=dnorm(x,mean=0,sd=1) plot(x,y) pnorm(90,100,10) #return the area below x=90 pnorm(90,100,10,lower.tail=F) #return the area above x=90 qnorm(0.95,100,10) #return the critical value for 95th percentile qnorm(c(.025,.975),100,10) #critical value for 25th and 97.5th percentiles 9. Read external data sets #Load external text file. Pay attention to the options header and na.strings md=read.table(“G:/My Drive/Teaching/Data/CS5630/Auto.data”) #open a viewable table fix(md) md=read.table(“G:/My Drive/Teaching/Data/CS5630/Auto.data”, header=T) md=read.table(“G:/My Drive/Teaching/Data/CS5630/Auto.data”, header=T, na.strings =”?”) pairs(md) #scatterplot matrix for all variables #scatterplot matrix for selected variables pairs(~mpg+displacement+horsepower+weight+acceleration, md) pairs(md[, c(1,3, 4, 5, 6)]) #If the values of the external file are separated by tab or comma, you need to provide sep options: sep=”t” or sep=”,” #load external csv file md=read.csv(“G:/My Drive/Teaching/Data/CS5630/Auto.csv”, header=T, na.strings =”?”) 10. User defined R function Syntex: function_name=function(arg1, arg2, … ){ statements return(object) #An object in the function is local to the function. The object returned can be any data type. #it is not necessary to include the return statement. R automatically returns whichever variable #is on the last line of the body of the function. fahrenheit_to_celsius=function(temp_F) { temp_C=(temp_F-32)*5/9 return(temp_C) CHAPTER 2 Distance functions a = c(22, 1, 42, 10); b = c(20, 0, 36, 8); Euclidean Distance: sqrt(sum((a-b)^2)); City Block, Manhattan, taxicab distance: sum(abs(a-b)); Minkowski distance: (sum((a-b)^r))^(1/r); Supremum distance: max(abs(a-b)); library(pracma) d1 = c(3, 2, 0, 5, 0, 0, 0, 2, 0, 0) d2 = c(1, 0, 0, 0, 0, 0, 0, 1, 0, 2) Cosine similarity: sum(d1*d2)/(Norm(d1)*Norm(d2)) Correlation measure: cor(d1, d2) library(philentropy) a = c(22, 1, 42, 10); b = c(20, 0, 36, 8); u = c(12, 18, 34, 5); Create disimilarity matrix x = rbind(a,b,u) distance(x, method = “euclidean”) distance(x, method = “manhattan”) To find out which methods are implemented in distance(), you can consult the getDistMethods() function. getDistMethods() a = c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0) b = c(0, 0, 0, 1, 0, 1, 1, 0, 0, 1) x=rbind(a, b) distance(x, method = “jaccard”) distance(x, method = “cosine”) Note that Jaccard distance and Jaccard coefficients are different. JC = 1-JD CHAPTER 3 #R programming for classifications library(ISLR)#Load the data set Carseats library(tree)#tree library is for making decision trees of classifications #To check the values of the data set Carseats fix(Carseats) #To check the description of all attributes ?Carseats #The tree library is for making decision trees of classifications attach(Carseats) #To discretize the numerical attrribute Sales High=ifelse(Sales<=8,"No","Yes") #Use the data.frame() function to merge High with the rest of the Carseats data Carseats=data.frame(Carseats,High) Carseats=Carseats[,-1] dim(Carseats) #Convert the response variable as a factor. Carseats$High=factor(Carseats$High, levels=c("Yes","No")) #To make a decision tree that classify the attribute High based on all other attributes tree.carseats=tree(High~.,Carseats) #The summary() function lists the attributes that are used as internal nodes #in the tree, the number of terminal nodes, and the (training) error rate. #Note that a small residual mean deviance indicates a tree that provides #a good fit to the data. summary(tree.carseats) #Use the plot() function to display the tree structure, and the text() #function to display the node labels. The argument pretty=0 instructs R #to include the category names for any catergorical attributes, rather than #simply displaying a letter for each category plot(tree.carseats) text(tree.carseats,pretty=0) #Type the name of the tree object, R prints output corresponding #to each branch of the tree. R displays the split criterion, #the number of observations in that branch, the deviance, the overall #prediction for the branch (Yes or No) tree.carseats #The predict() function is used to test the model. The argument #type="class" instructs R to return the actual class prediction tree.pred=predict(tree.carseats,type="class") #Create the confusion matrix table(tree.pred,High) #Correct prediction rate mean(tree.pred==High) #Error prediction rate mean(tree.pred!=High) #In order to properly evaluate the performance of a classification tree on #these data, we must estimate the test error rather than simply computing #the training error. We split the observations into a training set and a test #set, build the tree using the training set, and evaluate its performance on #the test data. #Random select a sample of 200 observations of the data set as a training set and the rest #of the data set as a test set. set.seed(3) train=sample(1:nrow(Carseats), 200) Carseats.train=Carseats[train,] Carseats.test=Carseats[-train,] High.test=High[-train] tree.carseats=tree(High~.,Carseats.train) tree.pred=predict(tree.carseats,Carseats.test,type="class") table(tree.pred,High.test) mean(tree.pred!=High.test) #Now we consider whether pruning the tree might lead to improved #results. The function cv.tree() performs cross-validation in order to #determine the optimal level of tree complexity. The argument FUN=prune.misclass #indicates that we want the error rate to guide the cross-validation #and pruning process. set.seed(2) cv.carseats=cv.tree(tree.carseats,FUN=prune.misclass) cv.carseats #Despite the name, dev corresponds to the cross-validation error #rate in classification. We plot the error rate as a function of size plot(cv.carseats$size,cv.carseats$dev,type="b") #Check the tree with size 6 prune.carseats=prune.misclass(tree.carseats,best=6) plot(prune.carseats) text(prune.carseats,pretty=0) tree.pred=predict(prune.carseats,Carseats.test,type="class") table(tree.pred,High.test) mean(tree.pred!=High.test) #Check the tree with size 10 prune.carseats=prune.misclass(tree.carseats,best=10) plot(prune.carseats) text(prune.carseats,pretty=0) tree.pred=predict(prune.carseats,Carseats.test,type="class") table(tree.pred,High.test) mean(tree.pred!=High.test) #Check the tree with size 12 prune.carseats=prune.misclass(tree.carseats,best=12) plot(prune.carseats) text(prune.carseats,pretty=0) tree.pred=predict(prune.carseats,Carseats.test,type="class") table(tree.pred,High.test) mean(tree.pred!=High.test) #Check the tree with size 13 prune.carseats=prune.misclass(tree.carseats,best=13) plot(prune.carseats) text(prune.carseats,pretty=0) tree.pred=predict(prune.carseats,Carseats.test,type="class") table(tree.pred,High.test) mean(tree.pred!=High.test) #Bagging and Random Forests The exact results obtained in this section may depend on the version of R and the version of the randomForest package installed on your computer. Recall that bagging is simply a special case of a random forest with m = p. Therefore, the randomForest() function can be used to perform both random forests and bagging. library(randomForest) #Some arguments of function randomForest(): #mtry: the number of variables randomly sampled as candidates at each split. #ntree: the number of trees to grow. #ntree indicates the number of trees are generated by bagging #mtry indicates the number of variables are used at each split. #Check with 100 trees set.seed(3) tree.carseats=randomForest(High~.,Carseats.train, ntree=100,mtry=10) tree.pred=predict(tree.carseats,Carseats.test,type="class") table(tree.pred,High.test) mean(tree.pred!=High.test) #Check with 500 trees set.seed(3) tree.carseats=randomForest(High~.,Carseats.train, ntree=500, mtry=10) tree.pred=predict(tree.carseats,Carseats.test,type="class") table(tree.pred,High.test) mean(tree.pred!=High.test) #Check with 1000 trees set.seed(3) tree.carseats=randomForest(High~.,Carseats.train, ntree=1000, mtry=10) tree.pred=predict(tree.carseats,Carseats.test,type="class") table(tree.pred,High.test) mean(tree.pred!=High.test) #By default, randomForest() uses about sqrt(p) variables when building a random forest of classification trees. sqrt(10)=3.16. set.seed(3) tree.carseats=randomForest(High~.,Carseats.train, ntree=500, mtry=4) tree.pred=predict(tree.carseats,Carseats.test,type="class") table(tree.pred,High.test) mean(tree.pred!=High.test) set.seed(3) tree.carseats=randomForest(High~.,Carseats.train, ntree=500, mtry=3) tree.pred=predict(tree.carseats,Carseats.test,type="class") table(tree.pred,High.test) mean(tree.pred!=High.test) set.seed(3) tree.carseats=randomForest(High~.,Carseats.train, ntree=500, mtry=5) tree.pred=predict(tree.carseats,Carseats.test,type="class") table(tree.pred,High.test) mean(tree.pred!=High.test) #To check the variable importance in bagging and random forest, use functions importance #and varImpPlot. The mean decrease in Gini index for each variable is diaplayed. importance(tree.carseats) varImpPlot(tree.carseats) #Boosting requires the Generalized Boosted Regression Models (gbm) package library(gbm) #For boosting technique the trees are grown sequentially: each tree is grown using #information from previously grown trees. Boosting does not involve bootstrap sampling; #instead each tree is fit on a modified version of the training data set. #For binary classification, the response variable should be 0 or 1 if using Bernoulli distribution. Carseats$High=ifelse(Carseats$High=="Yes",1,0) set.seed(3) train=sample(1:nrow(Carseats),200) Carseats.train=Carseats[train,] Carseats.test=Carseats[-train,] High.test=High[-train] #The option n.trees specifies the number of trees set.seed(3) tree.carseats=gbm(High~., Carseats.train, distribution="bernoulli",n.trees=100) #Note that predict function provides the probability of getting Yes or No, therefore, we need to #set up a threshhold in order to get the correct confusion table. tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=100, type="response") tree.pred=ifelse(tree.pred.prob>0.5, “Yes”, “No”) table(High.test, tree.pred) mean(tree.pred!=High.test) set.seed(3) tree.carseats=gbm(High~., Carseats.train, distribution=”bernoulli”,n.trees=200) tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=200, type=”response”) tree.pred=ifelse(tree.pred.prob>0.5, “Yes”, “No”) table(High.test, tree.pred) mean(tree.pred!=High.test) set.seed(3) tree.carseats=gbm(High~., Carseats.train, distribution=”bernoulli”,n.trees=300) tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=300, type=”response”) tree.pred=ifelse(tree.pred.prob>0.5, “Yes”, “No”) table(High.test, tree.pred) mean(tree.pred!=High.test) set.seed(3) tree.carseats=gbm(High~., Carseats.train, distribution=”bernoulli”,n.trees=400) tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=400, type=”response”) tree.pred=ifelse(tree.pred.prob>0.5, “Yes”, “No”) table(High.test, tree.pred) mean(tree.pred!=High.test) tree.carseats=gbm(High~., Carseats.train, distribution=”bernoulli”,n.trees=500) tree.pred.prob=predict(tree.carseats, Carseats.test, n.trees=500, type=”response”) tree.pred=ifelse(tree.pred.prob>0.5, “Yes”, “No”) table(High.test, tree.pred) mean(tree.pred!=High.test) CHAPTER 4 #K-Nearest Neighbors (KNN) Classifier #knn() function requires 4 inputs: #A matrix containing the predictors associated with the training data #A matrix containing the predictors associated with the test data #A vector containing the class labels for the training data #A value for k, the number of observations in nearest neighbors to be used by the classifier library(class) library(ISLR) #Smarlet data set consists of percentage returns for the S&P 500 stock index over 1250 days, #from the beginning of 2001 until the end of 2005. For each date, we have recorded #the percentage returns for each of the five previous trading days, Lag1 through Lag5. ?Smarket attach(Smarket) fix(Smarket) train=(Year<2005) trainSet=cbind(Lag1,Lag2)[train,] testSet=cbind(Lag1,Lag2)[!train,] train.Label=Direction[train] test.Label=Direction[!train] knn.pred=knn(trainSet,testSet,train.Label,k=1) table(knn.pred,test.Label) mean(knn.pred!=test.Label) knn.pred=knn(trainSet,testSet,train.Label,k=3) table(knn.pred,test.Label) mean(knn.pred!=test.Label) knn.pred=knn(trainSet,testSet,train.Label,k=5) table(knn.pred,test.Label) mean(knn.pred!=test.Label) #Naive Bayes classification #The e1071 library contains implementations for a number of classiification methods including #Naive Bayes classification and Support Vector Machine. library(e1071) attach(Carseats) High=ifelse(Sales<=8,"No","Yes") Carseats=data.frame(Carseats,High) Carseats=Carseats[,-1] Carseats$High=factor(Carseats$High,levels=c("Yes","No")) #Fitting the Naive Bayes model NB.model=naiveBayes(High~., Carseats) #What does the model say? Print the model summary NB.model #Prediction on the dataset NB.prediction=predict(NB.model,Carseats) #Performance on the entire data set table(NB.prediction,High) mean(NB.prediction!=High) #Performance on test data set set.seed(3) train=sample(1:nrow(Carseats), 200) Carseats.train=Carseats[train,] Carseats.test=Carseats[-train,] High.test=High[-train] NB.model=naiveBayes(High~.,Carseats.train) NB.prediction=predict(NB.model,Carseats.test) table(NB.prediction,High.test) mean(NB.prediction!=High.test) #Support Vector Classifier #The cost argument controls the tradeoff between classification of training points accurately #and a smooth decision boundary. It allows us to specify the cost of a violation to the margin. #When the cost argument is small, the margins will be wide and many support #vectors will be on the margin or will violate the margin. We get lower variance and high bias. #When the cost argument is large, then the margins will be narrow and there will be few support #vectors on the margin or violating the margin. We get higher variance and lower bias. #the kernel means transforming data into another dimension that has a clear dividing margin between #classes of data. svm.model=svm(High~., Carseats.train, kernel="linear", cost=1) svm.prediction=predict(svm.model,Carseats.test) table(svm.prediction, High.test) mean(svm.prediction!=High.test) #The function tune.svm() performs cross validation. By default, tune.svm() #performs 10-fold cross-validation. set.seed(3) tune.out=tune.svm(High~.,data=Carseats.train, kernel="linear", cost=c(0.0001,0.01,0.1,1,5,10,100,1000)) #We can access the cross-validation errors for each of these models using the summary() command summary(tune.out) #We see that cost=5 results in the lowest cross-validation error rate. #The tune.svm() function stores the best model obtained. bestmodel=tune.out$best.model summary(bestmodel) svm.prediction=predict(bestmodel,Carseats.test) table(svm.prediction, High.test) mean(svm.prediction!=High.test) #In order to use a nonlinear kernel, we use a different value of the parameter kernel. To fit an SVM #with a polynomial kernel we use kernel="polynomial", and to fit an SVM with a radial kernel we use #kernel="radial". In the former case we also use the degree argument to specify a degree for the #polynomial kernel, and in the latter case we also specify a value of gamma for the radial basis #kernel. #The parameter gamma defines how far the influence of single training example reaches. If the value #of gamma is high, the decision boundary will depend on points close to the decision boundary and #nearer points carry more weights than far away points. The decision boundary becomes more wiggly #in this case. If the value of gamma is low, then far away points carry more weights than nearer #points and thus the decision boundary becomes more like a straight line. svm.model=svm(High~.,data=Carseats.train, kernel="radial", gamma=1, cost=1) svm.prediction=predict(svm.model,Carseats.test) table(svm.prediction, High.test) mean(svm.prediction!=High.test) #If we increase the value of cost, we can reduce the number of training errors. However, this comes #at the price of a more irregular decision boundary that seems to be at risk of overfitting the data. #We can perform cross-validation using tune.svm() to select the best choice of #gamma and cost for an SVM with a radial kernel. set.seed(3) tune.out=tune.svm(High~.,data=Carseats.train, kernel="radial", cost=c(0.0001,0.01,0.1,1,5,10,100,1000), gamma=c(0.05,0.5,1,2,3,4,5)) summary(tune.out) bestmodel=tune.out$best.model summary(bestmodel) svm.prediction=predict(bestmodel,Carseats.test) table(svm.prediction, High.test) mean(svm.prediction!=High.test) #Use polynomial kernel svm.model=svm(High~.,data=Carseats.train, kernel="poly", degree=2, gamma=1, cost=1) svm.prediction=predict(svm.model,Carseats.test) table(svm.prediction, High.test) mean(svm.prediction!=High.test)

Save your time - order a paper!

Get your paper written from scratch within the tight deadline. Our service is a reliable solution to all your troubles. Place an order on any task and we will take care of it. You won’t have to worry about the quality and deadlines

Order Paper Now

Our team of vetted writers in every subject is waiting to help you pass that class. With keen editors and a friendly customer support team, we guarantee custom-written, original, high-quality papers. Get top grades.


Order a Similar Paper Order a Different Paper