Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools

Autores: Yvan Pereira dos Santos Brito, Carlos Gustavo Resque dos Santos, Rodrigo Santos do Amor Divino Lima, Tiago Davi Oliveira de Araújo, Bianchi Serique Meiguins,
Journal: IEEE Access
Ano: 2020
Páginas: 82917-82928
DOI: 10.1109/ACCESS.2020.2991949

Data generators are applications that produce synthetic datasets, which are useful for testing data analytics applications, such as machine learning algorithms and information visualization techniques. Each data generator application has a different approach to generate data. Consequently, each one has functionality gaps that make it unsuitable for some tasks (e.g., lack of ways to create outliers and non-random noise). This paper presents a data generator application that aims to fill relevant gaps scattered across other applications, providing a flexible tool to assist researchers in exhaustively testing their techniques in more diverse ways. The proposed system allows users to define and compose known statistical distributions to produce the desired outcome, visualizing the behavior of the data in real-time to analyze if it has the characteristics needed for efficient testing. This paper presents in detail the tool functionalities and how to create datasets, as well as a usage scenario to illustrate the process of data creation.