Data wrangling is a method of gathering, choosing and transforming data to answer an analytical question. Often referred to as “munging” or data cleaning, data wrangling makes up approximately 80 percent of a data scientist’s time, with the rest devoted to modeling or exploration.
Why Is Data Wrangling Necessary?
There is going to be a wide range in quality between different data sets. Some will be big data streams that contain unstructured data. Others will be structured (eg data fields are clear and consistent) but will include duplicate or irrelevant data. Other datasets may be in good condition, but so large as to require metrics which have been rolled up in a data warehouse or star or snowflake schema to allow analytic queries.
Steps to Data Wrangling
- Gather data from sources inside and outside the organization.
- Document sources and limitations.
- Clean the blanks, nulls, duplicates and other errors.
- Combine data into a single table.
- Create new data sets by calculating fields and categorizing.
- Eliminate outliers and illogical results by visually plotting the data.
The Challenges of Data Wrangling
Data wrangling is something of the unspoken grunt work of data science. It takes time to clean data to the point that it can be used for analytics. These are some of the challenges you will face when data wrangling:
- Obtaining access to data: A data scientist should have permission to access data. If they don’t, they must provide instructions for scrubbed data and hope the request is granted.
- Clarifying the use case: Data is dependent upon the question you’re looking to answer, so the use case must be clarified to choose the right data sets.
- Understanding the data: You need to understand what fields are required or are unnecessary or incomplete. You should use some basic queries to determine if the data makes sense, or if bad or missing data will skew your queries.
- Identifying data relationships to determine how entities are related to one another via keys.
- Avoiding selection bias: Selection bias is a problem that occurs in data science. Selection bias remediation can be difficult, but it’s important to be sure that the sample data is representative of the implementation sample.