My Reddit Data
CompleteData Science Python Reddit PRAW Altair Pandas Jupyter Notebook
Working on a Data Science Project
I took an Introduction to Data Science course in my university and a project about analyzing a personal dataset was one of the assignments. I’ve decided to analyze my Reddit usage data.
Analyzing My Reddit Usage Data
For the details of the project and the source code, you can check out the GitHub repository.
For the details of the findings and the final report check out it’s website
Why Reddit?
The project required us to analyze a personal dataset. I’ve decided to analyze my Reddit usage data because it’s basically the only personal dataset that I have that is not too personal or complicated like YouTube or browser history.
Where Does the Data Come From?
Reddit allows you to download your data. You can request your data from here. On top of the data that you can request, I’ve also used Reddit’s API to collect some additional data.
In addition to the data from Reddit I have used annotations to add some additional information about subreddits and used my own class scheme to compare with.
What Did I Do?
Most of the project was cleaning and filtering the data and then visualizing it to see any pattern and form hypothesis. I’ve used Pandas and Altair for this. After forming some hypothesis I’ve used statistical methods to test them. For the final report see the website.
Limitations
The limitations can be generally separated into two categories:
Data Sourced Limitations
Data Completeness: Considering my time on Reddit and the way I interact with data is somewhat limited, the data stays a bit short for some of the possible analysis.
Subjectivity: Like I have stated, Reddit does not have a native system to classify subreddit. Even though I tried to solve this using human annotators, they may have subjective biases affecting the accuracy of subreddit tags.
Personal Limitations
Privacy: Since the project is based on a personal data, I have to be careful about the privacy of the data. There were some details that I did not want to share, so I did not use them in the analysis.
Knowledge: Especially in the first stages of the project, I did not have enough knowledge about data science and data analysis. So, even though I tried to improve the project and the analysis as I learned more, there are still some parts inherited from the early stages of the project and some parts that I could not improve. With more knowledge and experience, and some knowledge about machine learning, I believe the project can be improved.
Future Work
Longitudinal Analysis: Since data is on my usage of a social media platform, the analysis can be updated with newer data over a more extended period to observe changes in behavior and interests over time.
Advanced Analysis Techniques: With more knowledge and experience on data science and machine learning, if I continue to take courses on the subject, I might return and improve some things that I am not able to do now.
Final Report Website: I have created a website for the final report. However, due to my limited knowledge especially on using Altair, I could not create the website as detailed as I wanted and combine with interactive visualizations. I believe there is some room for improvement there.