FIFA '21 Data Cleaning Challenge

Introduction

A data cleaning challenge was organized in the data tech space by Promise Chinonso. The FIFA data was a messy one. As a participant in this challenge, I used Microsoft Excel for the cleaning. A data dictionary for the project was given which aided in navigating through the cleaning process.

About the Data

The messy data was gotten from Kaggle.com. It had 188979 rows and 77 columns. The columns contained in this dataset were Name, Long Name, photo URL, player URL, Age, OVA, POT, Team, Contract, ID, Height, Weight, foot, BOV, Growth, Joined, Loan Date End, Value, Wage, Release Clause, Attacking, Crossing, Finishing, Heading Accuracy, Short Passing, Volleys, Skill, Dribbling, Curve, IFK Accuracy, Long Passing, Ball Control, Movement, Acceleration, Speed, Agility, Reactions, Balance, Power, Shot Power, Jumping, Stamina, Strength, Long Shots, Mentality, Aggression, Interceptions, Positioning, Vision, Penalties.

Steps took in cleaning the data

The data was cleaned using MS Excel, so the first step in cleaning this data was duplicating it. Duplicating the sheet helps to trace back where you're coming from. It also helps in case a mistake is made, you always have the original data to work with.

Going through the dataset, the things noticed were

Wrong Data Types
Inconsistency with Units (some columns had different measurements embedded in them)
Null Values
Unwanted columns

This data was cleaned by a lot of people and one thing I noticed was some had special characters in their columns. However, in mine, I loaded the data on WPS before excel, so automatically all the special characters were removed before it was moved to Excel.

Columns cleaned

The data had a "Name" column and a "Long Name" column. Most of the names were Italian and as such, they had "A Circumflex (◌̂)". This was going to bring some wrong analysis when insights will be thrown upon. Assuming all names were to be sorted in ascending order (A-Z), names with circumflexes queued behind the letter Z, even the letter A because of the circumflex.

The player Url had the names of the players without circumflexes, the column was split into several others by using "/" as a delimiter to extract the names into First Name and Last Name.

The Player URL

The Playerurl column

The extracted names from the Playerurl.

OVA and POT (Overall Analysis and Players Potential)

The overall analysis and the player's potential were to be in percentage as stated in the data dictionary, but these columns were in the wrong type format. To fix this I had to standardize the column by changing the data type to Percentage

Age Column

The age column was also in the wrong data type. And since this data is 2 years old, I decided to add "2 years" to make the age up to date.

Age Before

Age After ( Data Type = Number, Age up to date)

Club

The club column consisted of the clubs each player was under, however in entering this column errors were made. The error noticed was there were too many spaces before the club name and a few of the clubs had the number "1" before them. To clean up the spaces I used the Trim and clean function alongside the query to remove the number one.

Club Before

Club After

Height and Weight

These columns were tedious. The Height column had both "cm" and "feets" measurements embedded in the column. The weight column had both "pounds" and "kilogram" measurements. The "feet and inches" were converted to "cm", while the "pounds" were converted to "Kg"

Weight column before

Height column before

Contract

The contract column was a show for when the contract of the player for the club starts and ends. This column had "~" in between two years. This column was split into two "Contract start" and "Contract end" Splitting the column would be a lot easier for analysis. However, some players were "on loan" and some were "free" after the column was split I changed all the "On loan" and "Free" to null

Contract Before

Contract After

Values, Wages, Release Clause

The values, wages and release clause columns had different denominations in them constituting "Millions (M)", "Thousands(K)" and "Tens". To convert all these to their accurate type, I removed the Pounds sign and then multiplied each value by their respective denomination. An IF statement was run to multiply all values ending with "M" in millions and values ending with "K" in thousands.

Query Ran for Values (=IF(Right(cell number,1)="M",left(cell number,len(cell number)-1)*1000000,IF(Right(cell number,1)="K",Left(cell number,len(cell number)-1)*1000""))

This query was adjusted to also suit the wages and Release Clause

Value column before

Release clause before

Value, Wage and Release clause after. After being multiplied their currency was then added

Nationality

The nationality column was in its right state. It needed no cleaning so I proceeded to the next column

Joined

The joined column was the one out of many which were in its correct form and this column states the date each player joined their respective clubs. This column needed no change.

Joined date column

Weak Foot Rating, International reputation, Skill Moves

These columns contained ratings of 1-5. The values in this column had a "star" sign at the back of each. To clean these, the "star" sign which signified the rating was removed

Before

After

Hits

The hits column, quite similar to the wages column had numbers in a different denomination. (In thousands and tens)