I read a blog online know about how data scientist accomplish their projects on machine learning or artificial intelligence. I found that the most important things for data scientist is to gather data and to know about the data.
First thing first, you should know what type of problem you want to solve therefore first define your problem. For this tutorial, we have to predict the flower classes
Second, gather data, if you are working on financial problem for example stock exchange they are available on different website however, there are many data you could not find on web easily or you have to pay for those data. For this tutorial, we will get data from http://www.ics.uci.edu/
Third step is to load data in memory using any programming language you liked.
# Load dataset url = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data” names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’] dataset = pandas.read_csv(url, names=names)
After getting and cleaning data to your need. You can know about your data.
# Step 3: Know about your data # Dataset summary print(dataset.describe()) # box and whisker plots dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False) plt.show() # histograms dataset.hist() plt.show() # scatter plot matrix scatter_matrix(dataset) plt.show()
Now we have to separate data into parts, for example training set and cross validation data set.
# Step 4: seperate your data into parts for example training dataset and cross validaton dataset # Step 4: seperate your data into parts for example training dataset and cross validaton dataset # Split-out validation datasetarray = dataset.values X = array[:, 0:4] Y = array[:, 4] validation_size = 0.20 seed = 7 # Test options and evaluation metric seed = 7 scoring = 'accuracy' X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
In this step you have to evaluate models, the problem is how can you know what type of model should we use. Since, you know about the problem, in this case it is predication problem therefore you can choose from predication models, but again what is the best predication model for your data set. This is what you need to find out.
# Step 5: Evaluate your alogrithm and improve your alogrithm # Spot Check Algorithms models =  models.append(('LR', LogisticRegression())) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC())) # evaluate each model in turn results =  names =  for name, model in models: kfold = model_selection.KFold(n_splits=10, random_state=seed) cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
Now, you can compare the results of different models. You can select the different model according to the result. In my case KNN has the much higher accuracy therefore, I will use KNN for predication
# Make predictions on validation dataset knn = KNeighborsClassifier() knn.fit(X_train, Y_train) predictions = knn.predict(X_validation) print(accuracy_score(Y_validation, predictions)) print(confusion_matrix(Y_validation, predictions)) print(classification_report(Y_validation, predictions))
To build a machine learning project, you have to invest more time in data. If data is not good or if data is not enough, you model will not work or they will not perform well.
I had learned these types from another blog, the reference is provided for that blog.