Resource

Mining and combining

Once the relevant data sources are gathered, the preparation, integration and exploration of the data can begin. This is the most time-consuming step in the framework. According to Forrester, one-third of analysts spend more than 40% of their time vetting and validating data.

When preparing your data, the first step is to assess the content of your data sources. Namely, do the files contain only quantitative data or qualitative data as well? If the data is quantitative, is the data discrete or continuous?

3 DATA SCALES

Quantitative data is scaled in one of three ways, depending on whether that data is discrete or continuous:

ONE – INTERVAL With an interval scale, the data has a standardized order. So, the difference between each level on the scale is the same. There is no zero point.

TWO – RATIO An absolute and meaningful zero is a unique feature of the ratio scale. Ratio scale data can be multiplied or divided.

THREE – CIRCULAR Whether it is time, days of the week, months of years, data linked to duration has a circular scale. Circular scale data is a special type of interval scale data.

After preparing the data, the next step is integration. A common way to integrate multiple sources is by linking them. To link the files, the sources need at least one field in common. This field is called a key. The presence of one key is often sufficient but sometimes, when the linking of two files, multiple keys are required.

4 TYPES OF JOINS – AND UNIONS

In database terms, the link between two files is called a “join”. There are four types of joins:

Another way to integrate data is to use a “union”. A union is a method for combining data by appending rows of one table onto another table.

If two tables are “unioned” together, then the data from the first table is in one set of rows, and the data from the second table in another set. The rows contain the same results.

So, what’s the difference between a union and join? With a union, you are adding records. With a join, you are adding columns.

When tables have the same number of fields and the same field names, then a union generates an integrated table, which matches the layout of the original tables. When one or more of the tables contain extra fields or field names that are not identical, then a union produces a table containing a column for each unique field name. The field name is “null” for records belonging to the original table that do not contain that field name.

In most cases, the choice between a union or join is self-evident. You can also use these three factors as guidance:

ONE A join adds new fields, whereas a union adds additional records.

TWO When the file structure is different, a join is the logical choice. With a union, the structure of the files is similar or the same.

THREE A join can only be performed when a key is present in the table, which needs to be integrated. With a union, this is not required.

There are cases, however, when the choice is still not immediately obvious. Here, you must consider the pros and cons of both options. A factor to take into account is the impact on your KPIs.

5 LEVELS OF DATA VALIDATION

When the preparation and integration of the data are complete, the data is explored. However, before the exploration can begin, there are different kinds of validation to perform. Some of these validations typically occur during the integration and transformation of the data, but some are done after the completion of these two steps. There are five different validations:

EXPLORATORY DATA ANALYSES (EDA)

Once the data is prepared, it is ready to be explored. The aim is to gain insights into the distribution of the available fields and to detect trends in order to discover relationships between fields. The univariate analysis provides many insights into the distribution of fields: measurements like median, mode, skewness, and the existence of outliers. Outliers are important and can have a major impact on the analysis. So, it is vital to correct or delete them.

The multivariate analysis provides insights into the relationships between two or more fields. In the case of the scatter plot, it can also be helpful to detect outliers. The correlation matrix is typically a good first step to find which fields are correlated with the targeted variable. However, correlation does not automatically indicate causality between two fields.

There is also a more advanced exploratory analysis. The two most common types are principal component analysis (PCA) and cluster analysis. PCA identifies the most important fields for a model. This is especially helpful when there are many fields available. Cluster analysis detects which clusters or distinct groups within the data are distinguishable. It builds an understanding of the hidden relationships in the data and is sometimes the goal of a data science project.

DEEP INTO DOMAIN

While domain experts interact throughout a data project, this collaboration is essential at this stage. During the exploration of the data, the Data Scientist can familiarize themselves with the data and detect trends, abnormalities, patterns and correlations. These insights now need to be explained or put into context by domain experts.

Domain expertise is also required during other stages of the 8-step process, including step 1 (asking questions), step 2 (data landscaping), and step 6 (modeling).

Wikipedia defines bias as:

“Is knowledge of a specific, specialized discipline or field,
in contrast to general knowledge, or domain-independent
knowledge.
The term is often used in reference to a more general
discipline, as, for example, in describing a software engineer
who has general knowledge of programming, as well as
domain knowledge about the pharmaceutical industry.
People who have domain knowledge, are often considered
specialists or experts in the field.”

Thus, domain knowledge is knowledge about a specific, specialized discipline or field related to a certain (business) domain that somebody has acquired through years of experience.

3 TYPES OF DOMAIN KNOWLEDGE

There are three types of domain knowledge:

ONE – CONTEXT OF THE PROBLEM. Domain knowledge builds a further understanding of the problem. The context of the problem is discussed during the first step of the 8-step model (ask the right questions). But, as more insights into the problem are gained, domain experts can provide further insights. Previous attempts to solve the problem can also provide further insights into the context of the problem.

TWO – SPECIALIZED INFORMATION OR EXPERTISE. After the exploration of the data, a domain expert can interpret the resulting insights or validate the (tentative) conclusions for the Data Scientist. The Data Scientist can also review the proposed features with the domain expert. Once a model is created, the domain expert can help interpret the results or finetune the model and then validate the model to assess its effect on vulnerable groups.

THREE – KNOWLEDGE OF DATA COLLECTION. Domain experts are also a useful resource to identify potential data sources. As a result, you can get a better understanding of how data was collected or identify data quality issues.

The interaction between the Data Scientist and the domain expert is ideally a two-way street. The Data Scientist can tap into the knowledge of the domain expert. But the Data Scientist can also reveal their data-based insights to the domain expert.