Resource

Data Understanding

Once you have an overview of the available data sources, you must select the best qualified. To achieve this, you must gather and assess these sources to understand the exact content, structure, and quality of the data. When the data sources contain personal data, the data must be anonymized.

Data quality and data management are not sexy topics. In almost all data science projects, this area is a hard nut to crack.

6 THINGS ABOUT DATA QUALITY

The importance of good quality data is often overlooked. Good quality data requires thorough assessment, validation, and correction by expert analysts and Data Scientists. This is an important step because the quality of your data has a direct impact on the accuracy of your resulting business decisions. Simply put, if your data quality is poor, your business will make the wrong decisions and, ultimately, see revenue losses.

Excessive data correction also takes time. As a result, your Data Scientists are focused on data correction, instead of value-added tasks.

Many people take a one-dimensional view of data quality. But data quality has six different dimensions:

Some data quality issues have more of an impact than others. Validity issues can be corrected by converting the data into the appropriate format. Uniqueness issues are also easy to resolve, although it is advisable to find the underlying cause for the duplication. The other issues are more complicated and time-consuming to resolve.

Data quality is also context dependent. The quality of your data might be sufficient for one purpose but inadequate for another.

You can improve data quality when the source is used or refreshed. This is not a comprehensive approach – it is better to address the problems in a structural way, using a combination of data management and data governance. It’s a route not often taken, since it digresses from the objective of the exercise and doesn’t seem to deliver immediate value. However, in the long run, it pays to feedback the kind of data issues you have encountered into the operational processes and how adjustments can improve your data quality.

10 THINGS ABOUT DATA MANAGEMENT

Data management and data governance are intrinsically linked. For data management, you manage your data to achieve specific goals. These goals may include improving the usability of the data, making the data available and managing data security.

For data governance, you must ensure the data is managed appropriately across your people, processes and technologies. To achieve this, you must include your senior management, who can act as an executive sponsor and data owner. Senior management can give the mandate to staff to operationally manage data governance, while the responsibility remains at the management level. The Data Management of Body of Knowledge (DMBOK) from the Data Management Association (DAMA) is a popular data management and governance framework.

6 ELEMENTS OF GDPR

DMBOK covers data privacy from a data management perspective. The legal side of data protection and privacy is in the EU regulated through the General Data Protection Regulation (GDPR). The primary aim of the GDPR is to give individuals control over their personal data and to simplify the regulatory environment for international business by unifying the regulation within the EU. According to the GDPR, there are six different legal grounds where it is lawful to process personal data:

It is important to note that there is no lawful basis if you can reasonably achieve the same purpose without the processing of personal data.

What is personal data? Article 4 of the GDPR provides the official definition (see above). In summary, all data that identifies a natural person or makes a person identifiable is considered as personal data.

So, data including social security numbers, names, email addresses are all personal data. Data not considered to be personal data includes company data, aggregated personal data, a generic email address or data about deceased persons.

The GDPR takes the protection of personal data very seriously. However, you can still analyze and store data sources that originally contain personal data. A simple technique is to store the data on a higher aggregation level. You could aggregate data about consumers or households at a postcode level, for example. GDPR does not consider aggregated data as personal, as long as individuals cannot be traced.

5 THINGS ABOUT ANONYMIZATION

When aggregation is not a viable option, anonymization is another option. With anonymization, all personally identifiable information is stripped from the data source. There are different anonymization techniques:

GENERALIZATION Replace exact values with a more general value. For example, replacing age with an age category or address information with the province.

SUPPRESSION (OR MASKING) Personal data is deleted or altered in such a way that the individual can no longer be identified.

DATA SWAPPING (OR PERMUTATION) A technique used to rearrange the dataset attribute values so they do not correspond with the original records.

PERTURBATION This technique modifies the original dataset slightly by applying techniques that round numbers and adding random noise. The range of values must be proportional to the perturbation.

SYNTHETIC DATA Synthetic data is algorithmically manufactured data. This synthetic data is used to create artificial datasets instead of altering the original dataset. The process involves creating statistical models based on patterns found in the original dataset.

According to GDPR, data is only truly anonymized when the anonymization is irreversible. Anonymization has also some disadvantages. It not only limits your ability to derive value and insights from your data, but you can no longer enrich the data by linking it through a personal identifier.

These disadvantages do not exist with pseudonymization. Pseudonymization is a method that replaces private identifiers with fake identifiers or pseudonyms. Pseudonymization preserves statistical accuracy and data integrity and allows for linking with other databases that have used the same algorithm for pseudonymization.

Although data privacy is protected with pseudonymization, GDPR still considers pseudonymized data as personal data since it is possible to re-identify a person.