What is EDA in machine learning

Exploratory Data Analysis - An Important Step in Data Science

One of the great challenges that data scientists have to face is how data can be used to solve certain problems. A wide variety of companies approach them again and again who want to see their problems solved with the help of machine learning and AI. Many of these tasks are not related to data science problems at all. So when a company asks a data scientist to examine a collection of data, they must first ask themselves whether the problem can be solved by data analysis or not. And if so, how best to go about it.

The exploratory data analysis

Exploratory Data Analysis (EDA) is an approach that extracts information from a data collection and summarizes the main characteristics of that data. It is considered a crucial step in any data science project (in Figure 1 it is shown as the second step after the problem analysis in the CRISP data model). Many underestimate the importance of data preparation and exploration. The EDA is essential for a clearly defined and structured data science project and should be carried out before statistical or ML model phases. In this blog post we will focus on how an EDA works and what technologies we use for data analysis and visualization at Camelot. Fig. 1: Process of a data science project

Data preparation

Data prep involves cleansing and organizing the real-world data and is known to take up more than 80% of a data scientist's time. Real or raw data are unsorted, values ​​are missing, but there are numerous duplicates and, in some cases, incorrect information. Most machine learning algorithms cannot handle missing values, so the data must first be converted and sanitized. Common solutions for cleaning up missing values ​​are deleting rows, linear interpolation, using mean values, etc. Depending on the meaning and number of missing values, one of these solutions can be used. At Camelot, we mainly use Python and R (programming languages ​​that are often used by data scientists in their daily work) for data preparation and preprocessing.

Data exploration

Once we have a sanitized dataset, we need to analyze the data, summarize its properties, and visualize it. Analyzing the data is a step-by-step process that takes place between the data science team and the company's experts. It can help both sides to identify and build the most important features in order to then develop suitable machine learning models. An essential part of data exploration is data conversion. Here is an example: Let us imagine a forecast problem in the logistics area for a certain number of deliveries to different locations by different suppliers. One option for data conversion is "filtering". It is possible to filter for a specific delivery location or a specific group of suppliers and use the filtered data to create a forecast in order to obtain information quickly. Another tactic is "aggregation". If we have daily data, weekly or even monthly aggregation gives us a new data set that gives us information about existing seasonal changes and trends. Once the data has been enriched in this way, we can visualize it. The data visualization makes it possible to quickly gain an overview of the data. It helps data scientists and the various stakeholder groups in a company to agree on processes and the required data quality. This is an important feedback loop in the CRISP method that helps to better understand problems at hand. From a data processing perspective, this helps to quickly discover patterns and outliers and to be able to decide how to deal with the existing problem. Python libraries like Matplotlib and Seaborn are powerful visualization tools, especially for technical discussions and internal iterations to define the project scope. Power BI (a business intelligence solution from Microsoft) can also be used to create interactive dashboards for prototype development and as a proof of concept for customers. To ensure quality and integration for companies, we use other powerful enterprise tools such as SAP Cloud Analytics. Frequently used plots for the EDA are e.g. B .:

  • Histograms: to check the distribution of a particular variable
  • Scatter plots: to check the dependency between two variables
  • Maps: to show the distribution of a variable on a regional or world map
  • Correlation representations of properties (heat map): to illustrate the dependencies between different variables
  • Time series charts: to identify trends and seasonal changes in time-dependent data

An example of a Power BI dashboard is shown in Figure 2. Fig. 2: Example of a Power BI dashboard After the EDA process has been completed, the modeling phase can begin. This includes the creation of statistical models and the development of machine learning models. Even if the EDA already contains some basic statistical analyzes, full statistical modeling does not take place until the modeling phase, which could be the subject of another post in the future.

Conclusion

In summary, one can say that exploratory data analysis is a crucial step in any data science project. The main pillars of EDA are data cleansing, data preparation, data exploration and data visualization. Various exploratory tools (Python and R) and enterprise applications (Power BI, SAP Cloud Analytics, Tableau, etc.) are available to carry out EDA, each of which offers a unique set of tools.

We thank Dr. Ghazzal Jabbari and Frank Kienle for contributing to this article.

author