Introduction
A data cleaning challenge was organized in the data tech space by Promise Chinonso. The FIFA data was a messy one. As a participant in this challenge, I used Microsoft Excel for the cleaning. A data dictionary for the project was given which aided in navigating through the cleaning process.
About the Data
The messy data was gotten from Kaggle.com. It had 188979 rows and 77 columns. The columns contained in this dataset were Name, Long Name, photo URL, player URL, Age, OVA, POT, Team, Contract, ID, Height, Weight, foot, BOV, Growth, Joined, Loan Date End, Value, Wage, Release Clause, Attacking, Crossing, Finishing, Heading Accuracy, Short Passing, Volleys, Skill, Dribbling, Curve, IFK Accuracy, Long Passing, Ball Control, Movement, Acceleration, Speed, Agility, Reactions, Balance, Power, Shot Power, Jumping, Stamina, Strength, Long Shots, Mentality, Aggression, Interceptions, Positioning, Vision, Penalties.
Steps took in cleaning the data
The data was cleaned using MS Excel, so the first step in cleaning this data was duplicating it. Duplicating the sheet helps to trace back where you're coming from. It also helps in case a mistake is made, you always have the original data to work with.
Going through the dataset, the things noticed were
Wrong Data Types
Inconsistency with Units (some columns had different measurements embedded in them)
Null Values
Unwanted columns
This data was cleaned by a lot of people and one thing I noticed was some had special characters in their columns. However, in mine, I loaded the data on WPS before excel, so automatically all the special characters were removed before it was moved to Excel.
Columns cleaned
The data had a "Name" column and a "Long Name" column. Most of the names were Italian and as such, they had "A Circumflex (◌̂)". This was going to bring some wrong analysis when insights will be thrown upon. Assuming all names were to be sorted in ascending order (A-Z), names with circumflexes queued behind the letter Z, even the letter A because of the circumflex.
The player Url had the names of the players without circumflexes, the column was split into several others by using "/" as a delimiter to extract the names into First Name and Last Name.
The Playerurl column
The extracted names from the Playerurl.
OVA and POT (Overall Analysis and Players Potential)
The overall analysis and the player's potential were to be in percentage as stated in the data dictionary, but these columns were in the wrong type format. To fix this I had to standardize the column by changing the data type to Percentage
Age Column
The age column was also in the wrong data type. And since this data is 2 years old, I decided to add "2 years" to make the age up to date.
Age Before
Age After ( Data Type = Number, Age up to date)
Club
The club column consisted of the clubs each player was under, however in entering this column errors were made. The error noticed was there were too many spaces before the club name and a few of the clubs had the number "1" before them. To clean up the spaces I used the Trim and clean function alongside the query to remove the number one.
Club Before
Club After
Height and Weight
These columns were tedious. The Height column had both "cm" and "feets" measurements embedded in the column. The weight column had both "pounds" and "kilogram" measurements. The "feet and inches" were converted to "cm", while the "pounds" were converted to "Kg"
Weight column before
Height column before
Contract
The contract column was a show for when the contract of the player for the club starts and ends. This column had "~" in between two years. This column was split into two "Contract start" and "Contract end" Splitting the column would be a lot easier for analysis. However, some players were "on loan" and some were "free" after the column was split I changed all the "On loan" and "Free" to null
Contract Before
Contract After
Values, Wages, Release Clause
The values, wages and release clause columns had different denominations in them constituting "Millions (M)", "Thousands(K)" and "Tens". To convert all these to their accurate type, I removed the Pounds sign and then multiplied each value by their respective denomination. An IF statement was run to multiply all values ending with "M" in millions and values ending with "K" in thousands.
Query Ran for Values (=IF(Right(cell number,1)="M",left(cell number,len(cell number)-1)*1000000,IF(Right(cell number,1)="K",Left(cell number,len(cell number)-1)*1000""))
This query was adjusted to also suit the wages and Release Clause
Value column before
Release clause before
Value, Wage and Release clause after. After being multiplied their currency was then added
Nationality
The nationality column was in its right state. It needed no cleaning so I proceeded to the next column
Joined
The joined column was the one out of many which were in its correct form and this column states the date each player joined their respective clubs. This column needed no change.
Joined date column
Weak Foot Rating, International reputation, Skill Moves
These columns contained ratings of 1-5. The values in this column had a "star" sign at the back of each. To clean these, the "star" sign which signified the rating was removed
Before
After
Hits
The hits column, quite similar to the wages column had numbers in a different denomination. (In thousands and tens)
Before
"1.6k" was fixed to 1600
Conclusion
The FIFA dataset was quite challenging and as such unlocked new levels of cleaning, I never thought I had. The messy dataset is now ready for analysis.