Data Clean Up and Analysis
Tools used: OpenRefine, Python, Python Libraries (Pandas and Matplotlib)
I decided to refer back to my data science class from freshman year and highschool and brush off in using python script and python libraries to clean, reorganize and render the data
Initial Goal: To find some trend on the frequency of heat.
Steps taken:
- Uploaded the raw CSV file into OpenRefine to delete columns that restart the process and to appropriately split and align columns
- Used python script to read csv file
- Deleted the rows of new process data
- Converted the heat binary to float to see the mean and the percentage of how often heat is turned on
- Got a percentage of the amount of time heater was on (~40%) minority of the time
- Exported the cleaned csv data
- Tried to plot the trends of heat with other variables (this is still a work in progress)
I feel good about the output. I think with more studying and time, I could focus on a wider array of insights and provide a more accurate analysis. Unfortunately, when dealing with data this large, data processing tools are fundamental, and I’m afraid that being rusty might have affected the quality of the analysis. I was reminded of how interesting and frustrating data analysis is, but it was a great opportunity to sit down and write code for data after such a long time.
The biggest struggle I had was the amount of data, and the cleaning process took a long time. I couldn’t do it through Python alone, so I had to get my hands dirty and edit using OpenRefine. One piece of feedback I have is that if we had totals and averages of the variables, it would have been easier to catch trends. If the data collector had an automated process that added up sums and averages of the collected data, it would have helped immensely.
(To be Edited: I will embed a source code of my process once I figure out the problems with my github account)
Comments
Post a Comment