Eye2Data: Exploratory Data Analysis

What is Exploratory Data Analysis? This typically means scanning the data for discovering any obvious patterns or identifying main characteristics of dataset with the help of visualizations. It is one of the critical steps in Data Science Process as insights from the this analysis are vital for scoping the Modeling and Prediction efforts.

So, let's have a quick tour into few exploratory data analysis which can be done for large datasets. Here, two major social networks- Twitter and Stack Overflow are considered as datasets.

Datasets and Basic Statistics:

Stack Overflow (community of active users working on data science projects)

1 year of data about data science discussions posted by ~41K users
~90K posts with Time, Location and Tags associated with these posts

Twitter (community of ordinary users interested in data science concepts)

1 month of tweets streamed for ~43K data science users
~1.5 Million tweets with Time, Location, Hashtags and Re-tweet information

So, basic statistics for data exploration could be:

Total rows (posts or tweets)
Total unique users
Total hashtags
Time period or duration for data
Histograms, Rate (Avg, Min, Max) of data generation (posts/week, tweets/day)

Temporal Analysis

This is used to quickly find any temporal trends/patterns in the dataset, along with identifying some outliers occurring due to some special events (like festivals, Superbowl etc.). For example, below time series chart shows the frequency of key words commonly used in tweets related to data science. This clearly indicates that a lot of people are tweeting about "Big Data" for the selected time period. There is an interesting pattern here- Twitter activity decreases on 4th-5th and 12th-13th September of 2015. Thus, one can hypothesize that- "Data Science users are less active on Twitter on Weekends as compared to Weekdays," .This can be related to the fact that most of the users are exposed to data science concepts and technology at work (i.e. during business days).

Spatial Analysis

This is used to find any spatial patterns in the dataset. For example- identifying key areas which have dense population of data points. Below analysis is done for Stack Overflow dataset, and it shows the the distribution of posts related to some data science languages (R, Python and Java) all across the globe. So, Active Stack Overflow users for "R" language (displayed in blue) are concentrated in East USA, Central Europe and Indian sub continent.

Word Cloud Analysis

This is used to visually represent Hashtags from textual data, to depict keyword metadata a.k.a Tags on websites, or to visualize free form text. These can be created using any visualization tool like Tableau, R package like Word Clouds or by using online Word Cloud Generator, Wordle etc. Below word cloud is created for both Twitter and Stack Overflow datasets. This leads to an interesting observation- Users of Stack Overflow post about specific data science technology issues, softwares or packages (like nodejs, apachespark, mongodb etc). However, Twitter users tweet about general data science concepts (like big data, machine learning, data mining etc)

Distribution Analysis

This is used to identify the distribution of data-points across specific areas. For example, lets consider 3 data science languages- Python, Java and R. Now, below chart shows the usage distribution of these languages across specific Twitter Hashtags like- #BigData, #IOT (Internet-Of-Things), #Cloud, #DataScience and Jobs. We can clearly see that tweets for Big Data, IOT, and Cloud are dominated by the Java language. However, Python and R languages are more popular with the Data Science tweets. Also, the job market on twitter seems to be skewed towards Java and Python.

Demographic Analysis

This is typically used for customer segmentation which develops an understanding of the age, sex, and racial composition of a population. Below pie chart describes the Age Groups of Stack Overflow users from their data science posts. One can see that the age distribution is skewed towards 20-30 age group with more than 68% young Stack Overflow users.

So, these were the some of the basic analysis one can do to get a good understanding of the given datasets. Observations from these analysis can be used to form or change hypothesis which can alter the Prediction Modeling efforts.

Complex social networks like Twitter and Stack Overflow require advanced visualization techniques for getting deeper insights from Social Network Analysis. These network can be visualized using:

Templates for MS Excel like NodeXL
Softwares like Gephi or GraphViz
JavaScript library like D3.js or R,Python packages like igraph
Sophisticated network databases like Neo4j

Stay tuned for next post which will discuss the visualizations created for Social Networks like Stack Overflow and Twitter.

Eye2Data

Pages

Saturday, December 26, 2015

Exploratory Data Analysis