Saturday, December 26, 2015

Exploratory Data Analysis

What is Exploratory Data Analysis? This typically means scanning the data for discovering any obvious patterns or identifying main characteristics of dataset with the help of visualizations. It is one of the critical steps in Data Science Process as insights from the this analysis are vital for scoping the Modeling and Prediction efforts.

So, let's have a quick tour into few exploratory data analysis which can be done for large datasets. Here, two major social networks- Twitter and Stack Overflow are considered as datasets.

Datasets and Basic Statistics:

  • Stack Overflow (community of active users working on data science projects)
    • 1 year of data about data science discussions posted by ~41K users
    • ~90K posts with Time, Location and Tags associated with these posts
  • Twitter (community of ordinary users interested in data science concepts)
    • 1 month of tweets streamed for ~43K data science users
    • ~1.5 Million tweets with Time, Location, Hashtags and Re-tweet information
  • So, basic statistics for data exploration could be:
    • Total rows (posts or tweets)
    • Total unique users
    • Total hashtags
    • Time period or duration for data
    • Histograms, Rate (Avg, Min, Max) of data generation (posts/week, tweets/day)

Temporal Analysis

This is used to quickly find any temporal trends/patterns in the dataset, along with identifying some outliers occurring due to some special events (like festivals, Superbowl etc.). For example, below time series chart shows the frequency of key words commonly used in tweets related to data science. This clearly indicates that a lot of people are tweeting about "Big Data" for the selected time period. There is an interesting pattern here- Twitter activity decreases on 4th-5th and 12th-13th September of 2015. Thus, one can hypothesize that- "Data Science users are less active on Twitter on Weekends as compared to Weekdays," .This can be related to the fact that most of the users are exposed to data science concepts and technology at work (i.e. during business days).



Spatial Analysis

This is used to find any spatial patterns in the dataset. For example- identifying key areas which have dense population of data points. Below analysis is done for Stack Overflow dataset, and it shows the the distribution of posts related to some data science languages (R, Python and Java) all across the globe. So, Active Stack Overflow users for "R" language (displayed in blue) are concentrated in East USA, Central Europe and Indian sub continent.


Word Cloud Analysis

This is used to visually represent Hashtags from textual data, to depict keyword metadata a.k.a Tags on websites, or to visualize free form text. These can be created using any visualization tool like Tableau, R package like Word Clouds or by using online Word Cloud GeneratorWordle etc. Below word cloud is created for both Twitter and Stack Overflow datasets. This leads to an interesting observation- Users of Stack Overflow post about specific data science technology issues, softwares or packages (like nodejs, apachespark, mongodb etc). However, Twitter users tweet about general data science concepts (like big data, machine learning, data mining etc)


Distribution Analysis

This is used to identify the distribution of data-points across specific areas. For example, lets consider 3 data science languages- Python, Java and R. Now, below chart shows the usage distribution of these languages across specific Twitter Hashtags like- #BigData, #IOT (Internet-Of-Things), #Cloud, #DataScience and Jobs. We can clearly see that tweets for Big Data, IOT, and Cloud are dominated by the Java language. However, Python and R languages are more popular with the Data Science tweets. Also, the job market on twitter seems to be skewed towards Java and Python. 


Demographic Analysis

This is typically used for customer segmentation which develops an understanding of the age, sex, and racial composition of a population. Below pie chart describes the Age Groups of Stack Overflow users from their data science posts. One can see that the age distribution is skewed towards 20-30 age group with more than 68% young Stack Overflow users.


So, these were the some of the basic analysis one can do to get a good understanding of the given datasets. Observations from these analysis can be used to form or change hypothesis which can alter the Prediction Modeling efforts. 

Complex social networks like Twitter and Stack Overflow require advanced visualization techniques for getting deeper insights from Social Network Analysis. These network can be visualized using: 
  • Templates for MS Excel like NodeXL
  • Softwares like Gephi or GraphViz
  • JavaScript library like D3.js or R,Python packages like igraph
  • Sophisticated network databases like Neo4j
Stay tuned for next post which will discuss the visualizations created for Social Networks like Stack Overflow and Twitter.

Saturday, December 5, 2015

Big Learnings from Big Data.!!


As D.J. Patil, the Chief Data Scientist of the US Office of Science and Technology put it, “Data Science is a team sport, which requires a lot of collaboration and team-work.” As the Eye2Data team started working on the big data project, we realized that Data collection, Data cleansing and Prediction Modelling are some of the key components of a typical data science project; and all these teams have to work together in tandem with a lot of collaboration.

It all started with “Network Thinking”



This fascinating course started from the great mind refactoring: network thinking. We were encouraged to go beyond the 5 V’s (Volume, Veracity, Velocity, Variety and Value). The methodology of Big Data analysis can be enumerated into the following aspects:
  • Datafication 
  • Analyzing the population in its entirety instead of sampling data
  • Interconnecting data sources to get richer and more realistic insights
  • Capitalizing on the spatial-temporal characteristics of a Big Data phenomenon, and 
  • Tapping correlation instead of chasing causality 

The idea of thinking in network terms presented by Dr. Barabasi tremendously expanded our horizon of viewing this world. Just as he said, “we always lived in a connected world, except we were not so much aware of it.” We will never be independent with environment and people around us. Nor anything in the world. The interaction and connection between each others play an important role in framing the presentations. As data science practitioners, our goal is to find out these relationships and associations, to describe, to measure, and to quantify the processes.

Think about a simple online-marketing application where you wish to visitors based on their tendency to purchase products in your web-page. Traditionally, older recommender systems systems look at user features, their click and navigational behavior and accordingly recommend based on their current and past purchase history. This could be done using predictive modeling techniques: data mining (classification/regression modeling), natural language processing (keyword search behavior) and/or generative modeling approaches (collaborative filtering and markov family of models). Till now, we have been looking at the user as a singular entity whose online behavior is determined by his/her current and previous actions. The learning is based on the logic of modeling user behavior and accordingly customizing recommendations for variation in user characteristics across the population.


An exciting new way of looking at recommender systems during the last few years, which is not only simple but also quite effective is looking at users, products, websites and companies as entities. Users ‘visit’ websites to lookup or buy products. Products are placed in different positions in a web page which have several links to other pages. Companies would like to place their products in positions that can maximize their profits through increased sales. Here, we see the interconnectedness by simply looking at the what, why, who, how and when of a generalized e-commerce example that we have considered. Such a rich multi-mode network can be simplified to variety of single mode networks to answer different business questions. For instance, a user-user network could be created if users purchased ‘k’ or more common products during a time interval, hence creating a communities of similar users. Such implicit networks capture the inherent connectedness of everything in life. One can very well see that the collaborative filtering and other generative processes are just other ways of thinking on similar lines, but more mathematically involved and prone to errors. 

Our project was inspired by the idea of network thinking. To begin with, we were interested in describing the current status of data science communities in Stack Overflow. The most basic idea was to summarize the behavior of people in this field. However, we incorporated the first level of network thinking by taking the connection of users into consideration. The network graph was built to capture the information exchanged between users. Network metrics found from this approach helped us to generate much more insights. Then, the second level of network thinking was to connect user groups from Twitter to users from Stack Overflow. Twitter is a more real-time platform comparing to Stack Overflow. We expected the performance of tweets could provide a good indicator to help us predict the performance of posts on Stack Overflow. 

Social Media and Big Data


One of the key learnings we had was right at the start of the project. When we started off and as we were deliberating over different project ideas - we had two distinct routes by which we could do the project - either identify a good quality dataset and brainstorm on a business question around it; or identify a business problem, and try to find datasets which can help address them. While for academic projects, identifying the data first is a wee bit easier and safer approach, taking the second approach actually provides a much wider scope for imagination which will be very handy in the long run, especially as we look out to work with innovative companies.


Social media is definitely one of the largest data sources for big data. It’s also the low hanging fruit for network science. While we probably have social media to thank for opening the doors of network science analysis, the scope and the domains where network science can be use is immensely vast. It’s a step away from traditional thinking, in which identifying inter-relations between the actors was difficult to infer. While creating network visualizations helped us gain insights into the relationships within the data, the different network metrics helped gain valuable insights related to influence and importance of key actors within the network which would otherwise be difficult to infer, especially in large datasets. 

The 80/20 Data Principle



As Andreas Weigend says- Big Data is like crude oil, in the sense that big data itself is practically useless and needs to be refined before its true value can be extracted. The importance of data collection and cleaning cannot be emphasized enough. Public data sources are data rich but information poor! The project made us think about the practical aspects of data collection and how major component of a Big Data project is focused around data preprocessing. The process of data purification for noise and its implications in improvement of accuracy made us strive hard for improving the quality of our datasets. The quality of the results greatly depends on the quality of the data sources and the data cleaning. These are also the most time consuming and challenging phases of building any analytic application. Data cleaning is all the more important in the case of unstructured and semi-structured data. Unstructured data normally tend to require a whole new set of data cleaning procedures to clean and convert them in structured format.

Tools and Technology Selection


Big data is like the wild west, lots of opportunities, there’s lot of hidden value but it’s just like digging for gold, you should know where to find it. With the right analytics platform, you can find the gold faster; and the platform does the digging for you. The technology stack for big data projects is different from traditional applications. Unlike traditional software applications where the focus is more on the logical design and structure, the focus in big data projects is on data. Due to the sheer size of the data, a lot of consideration has to go into the data storage and data processing techniques. The volume of data is too large for comprehensive analysis, and the range of potential correlations and relationships between disparate data sources are too great for any analyst to test all hypotheses and derive all the value buried in the data. For our project, while MongoDB served us well for the storage of unstructured tweets, MySQL served our purpose for the more structured StackOverflow data. For data processing, while it’s possible to use traditional technologies like Java and .Net, it requires a lot more of effort, custom code development and time. On the other hand, programming languages like R and Python, aided by it’s powerful packages for data processing make it a great candidate for data processing. 

Hypothesize. Analyze. Evolve. Repeat.


Projects evolve with time. Unlike traditional software development, where the end product and result is much easier to visualize, data science projects, by its nature is exploratory for most part. While in some cases, methodologies and approaches which may have appeared feasible at the start may run into roadblocks, while on the other hand, mid-way through the project one may identify new data source or methodology which can improve the project drastically. When we started, we started with a dream about leveraging big data to predict trends like salary and job postings using social media and network science, we ended up with a more practical and focused objective of predicting the user interaction for data science languages. Data Science project ideas, thus evolve and mature with time. 

The field of Big Data is still in its early inception stages of theory, but is fast evolving to be highly beneficial across industry domains. It is beyond simple applications of data mining, statistical model fitting, event monitoring or learning through expert systems. Big Data solicits a sound understand of the complexity of the underlying phenomenon and using innovative methods to utilize huge amounts of heterogeneous information. The hands-on experience through the homework assignments and project tasks validated our belief that most of the data will not be put to use in the modeling, but rather, only be a means. We also learnt that there is no single way to analyze as every problem is different and hugely ridden by data quality and availability.

We also learnt hands-on experience of using a variety of tools such as MongoDB, Hadoop, Neo4j, R, Python and Gephi for various portions of our analysis. We learnt that to use a single tool, is not the most efficient way to execute such a project. 

To conclude, following quote by Stuart Ellman perfectly describes our experience about the role of network analysis in this big data project-