A Review of the Structure and Application of Scikit-Learn Datasets in Machine Learning Model Development
Abstract
Scikit-learn is a widely used Python library that provides a diverse range of datasets alongside robust machine learning algorithms. This paper presents a comprehensive review of the structure and practical applications of key datasets available in Scikit-learn, emphasizing their role in developing and evaluating machine learning models. The datasets are categorized into built-in, fetchable, and synthetic types, each suited for different research and educational purposes. Through Exploratory Data Analysis (EDA), visualization, and baseline modeling on representative datasets such as Iris, Breast Cancer Wisconsin, and California Housing, this study highlights how these datasets facilitate various machine learning tasks, including classification and regression. Insights into dataset characteristics like class distribution, feature separability, and target variability guide the selection and optimization of algorithms. Overall, this review underscores the value of Scikit-learn datasets as foundational resources for prototyping and education, while also advocating for integration with larger, real-world datasets to address complex industrial challenges.
Keywords:
Scikit-learn, Machine learning datasets, Data preprocessingReferences
- [1] Fisher, R. (1936). UCI machine learning repository: Iris data set. http://archive.ics.uci.edu/ml/datasets/Iris
- [2] Iris dataset. https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html
- [3] Wine dataset. https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html
- [4] Breast Cancer Wisconsin Dataset. https://scikit-learn.org/stable/modules/generated/ sklearn.datasets.load_breast_cancer.html
- [5] California Housing Dataset. https://scikit-learn.org/stable/modules/generated/ sklearn.datasets.fetch_california_housing.html
- [6] 20 Newsgroups Dataset. https://scikit-learn.org/stable/modules/generated/ sklearn.datasets.fetch_20newsgroups.html
- [7] Labeled Faces in the Wild (LFW) Dataset. https://scikit-learn.org/stable/modules/generated/ sklearn.datasets.fetch_lfw_people.html
- [8] Make_classification() Synthetic Dataset. https://scikit-learn.org/stable/modules/generated/ sklearn.datasets.make_classification.html
- [9] Make_regression() synthetic dataset. https://scikit-learn.org/stable/modules/generated/ sklearn.datasets.make_regression.html
- [10] Make_blobs() synthetic dataset. https://scikit-learn.org /stable/modules/generated /sklearn.datasets.make_blobs.html