"Data Odyssey: Navigating Messy Datasets Part One"
- Roshae Sinclair

- May 6, 2023
- 2 min read

In order to enhance my proficiency in data analysis, diligent practice is imperative. Initially, I limited myself to working with relatively clean datasets. However, upon receiving counsel from seasoned professionals, I began to focus on manipulating and refining extremely disordered datasets. The rationale behind this approach is that it enables me to hone my skills in data cleaning, which is a critical task that takes up a substantial portion of a data analyst/scientist's workload.
As a result, I embarked on a quest to locate untidy datasets by scouring various online platforms such as Kaggle. Eventually, I chanced upon a data analysis blog that furnished me with a CSV file containing the findings of a salary survey called "Ask a Manager," authored by Alison Green.
This dataset was highly disordered, characterized by verbose column headings, numerous missing values, qualitative data erroneously labelled as quantitative, and an overall lack of consistency. This lack of uniformity was particularly evident in the Country column, where variations of the USA were variously represented as US, USA, United States and other misspelled variations.

Upon analyzing the first ten rows, it was evident that significant cleaning and wrangling were necessary before the data could be analyzed. Consequently, I created a comprehensive to-do list that outlined the steps required to prepare the dataset for further analysis.
The first and most crucial step in this process was Data Cleaning and Wrangling.
Uncovering the story behind the numbers.

Prior to commencing the cleaning process, I typically start by reviewing the dataset to gain a comprehensive understanding of its content. To facilitate this, I was provided with the survey form utilized to collect the data, which can be accessed via this link.

The dataset comprises of numerous checkbox questions, resulting in a higher level of consistency. However, some null values are present due to individuals not selecting a checkbox, particularly in fields such as state, years of professional work experience, gender, and ethnicity. Notably, most individuals chose multiple options when responding to the race question, making race determination challenging.
Moreover, individuals were presented with a drop-down menu to select their highest level of education, minimizing the potential for errors. However, certain inputs necessitated individuals to type in their responses, constituting the primary issue with the data. For instance, the United States of America (USA) was written in over 100 variations, necessitating extensive cleaning. Additionally, several annual salaries were significantly above/below the typical annual salary for a specific country. Furthermore, the columns allowing users to enter additional information contained a significant number of missing values and were deemed irrelevant.
This dataset requires extensive cleaning to ensure its usability.


Comments