As D.J. Patil, the Chief Data Scientist of the US Office of Science and Technology put it, “Data Science is a team sport, which requires a lot of collaboration and team-work.” As the Eye2Data team started working on the big data project, we realized that Data collection, Data cleansing and Prediction Modelling are some of the key components of a typical data science project; and all these teams have to work together in tandem with a lot of collaboration.
It all started with “Network Thinking”
This fascinating course started from the great mind refactoring: network thinking. We were encouraged to go beyond the 5 V’s (Volume, Veracity, Velocity, Variety and Value). The methodology of Big Data analysis can be enumerated into the following aspects:
- Datafication
- Analyzing the population in its entirety instead of sampling data
- Interconnecting data sources to get richer and more realistic insights
- Capitalizing on the spatial-temporal characteristics of a Big Data phenomenon, and
- Tapping correlation instead of chasing causality
The idea of
thinking in network terms presented by Dr. Barabasi tremendously expanded our horizon of viewing this world. Just as he said, “we always lived in a connected world, except we were not so much aware of it.” We will never be independent with environment and people around us. Nor anything in the world. The interaction and connection between each others play an important role in framing the presentations. As data science practitioners, our goal is to find out these relationships and associations, to describe, to measure, and to quantify the processes.
Think about a simple online-marketing application where you wish to visitors based on their tendency to purchase products in your web-page. Traditionally, older recommender systems systems look at user features, their click and navigational behavior and accordingly recommend based on their current and past purchase history. This could be done using predictive modeling techniques: data mining (classification/regression modeling), natural language processing (keyword search behavior) and/or generative modeling approaches (collaborative filtering and markov family of models). Till now, we have been looking at the user as a singular entity whose online behavior is determined by his/her current and previous actions. The learning is based on the logic of modeling user behavior and accordingly customizing recommendations for variation in user characteristics across the population.
An exciting new way of looking at recommender systems during the last few years, which is not only simple but also quite effective is looking at users, products, websites and companies as entities. Users ‘visit’ websites to lookup or buy products. Products are placed in different positions in a web page which have several links to other pages. Companies would like to place their products in positions that can maximize their profits through increased sales. Here, we see the interconnectedness by simply looking at the what, why, who, how and when of a generalized e-commerce example that we have considered. Such a rich multi-mode network can be simplified to variety of single mode networks to answer different business questions. For instance, a user-user network could be created if users purchased ‘k’ or more common products during a time interval, hence creating a communities of similar users. Such implicit networks capture the inherent connectedness of everything in life. One can very well see that the collaborative filtering and other generative processes are just other ways of thinking on similar lines, but more mathematically involved and prone to errors.
Our project was inspired by the idea of network thinking. To begin with, we were interested in describing the current status of data science communities in Stack Overflow. The most basic idea was to summarize the behavior of people in this field. However, we incorporated the first level of network thinking by taking the connection of users into consideration. The network graph was built to capture the information exchanged between users. Network metrics found from this approach helped us to generate much more insights. Then, the second level of network thinking was to connect user groups from Twitter to users from Stack Overflow. Twitter is a more real-time platform comparing to Stack Overflow. We expected the performance of tweets could provide a good indicator to help us predict the performance of posts on Stack Overflow.
Social Media and Big Data
One of the key learnings we had was right at the start of the project. When we started off and as we were deliberating over different project ideas - we had two distinct routes by which we could do the project - either identify a good quality dataset and brainstorm on a business question around it; or identify a business problem, and try to find datasets which can help address them. While for academic projects, identifying the data first is a wee bit easier and safer approach, taking the second approach actually provides a much wider scope for imagination which will be very handy in the long run, especially as we look out to work with innovative companies.
Social media is definitely one of the largest data sources for big data. It’s also the low hanging fruit for network science. While we probably have social media to thank for opening the doors of network science analysis, the scope and the domains where network science can be use is immensely vast. It’s a step away from traditional thinking, in which identifying inter-relations between the actors was difficult to infer. While creating network visualizations helped us gain insights into the relationships within the data, the different network metrics helped gain valuable insights related to influence and importance of key actors within the network which would otherwise be difficult to infer, especially in large datasets.
The 80/20 Data Principle
As
Andreas Weigend says- Big Data is like crude oil, in the sense that big data itself is practically useless and needs to be refined before its true value can be extracted. The importance of data collection and cleaning cannot be emphasized enough. Public data sources are data rich but information poor! The project made us think about the practical aspects of data collection and how major component of a Big Data project is focused around data preprocessing. The process of data purification for noise and its implications in improvement of accuracy made us strive hard for improving the quality of our datasets. The quality of the results greatly depends on the quality of the data sources and the data cleaning. These are also the most time consuming and challenging phases of building any analytic application. Data cleaning is all the more important in the case of unstructured and semi-structured data. Unstructured data normally tend to require a whole new set of data cleaning procedures to clean and convert them in structured format.
Tools and Technology Selection
Big data is like the wild west, lots of opportunities, there’s lot of hidden value but it’s just like digging for gold, you should know where to find it. With the right analytics platform, you can find the gold faster; and the platform does the digging for you. The technology stack for big data projects is different from traditional applications. Unlike traditional software applications where the focus is more on the logical design and structure, the focus in big data projects is on data. Due to the sheer size of the data, a lot of consideration has to go into the data storage and data processing techniques. The volume of data is too large for comprehensive analysis, and the range of potential correlations and relationships between disparate data sources are too great for any analyst to test all hypotheses and derive all the value buried in the data. For our project, while MongoDB served us well for the storage of unstructured tweets, MySQL served our purpose for the more structured StackOverflow data. For data processing, while it’s possible to use traditional technologies like Java and .Net, it requires a lot more of effort, custom code development and time. On the other hand, programming languages like R and Python, aided by it’s powerful packages for data processing make it a great candidate for data processing.
Hypothesize. Analyze. Evolve. Repeat.
Projects evolve with time. Unlike traditional software development, where the end product and result is much easier to visualize, data science projects, by its nature is exploratory for most part. While in some cases, methodologies and approaches which may have appeared feasible at the start may run into roadblocks, while on the other hand, mid-way through the project one may identify new data source or methodology which can improve the project drastically. When we started, we started with a dream about leveraging big data to predict trends like salary and job postings using social media and network science, we ended up with a more practical and focused objective of predicting the user interaction for data science languages. Data Science project ideas, thus evolve and mature with time.
The field of Big Data is still in its early inception stages of theory, but is fast evolving to be highly beneficial across industry domains. It is beyond simple applications of data mining, statistical model fitting, event monitoring or learning through expert systems. Big Data solicits a sound understand of the complexity of the underlying phenomenon and using innovative methods to utilize huge amounts of heterogeneous information. The hands-on experience through the homework assignments and project tasks validated our belief that most of the data will not be put to use in the modeling, but rather, only be a means. We also learnt that there is no single way to analyze as every problem is different and hugely ridden by data quality and availability.
We also learnt hands-on experience of using a variety of tools such as MongoDB, Hadoop, Neo4j, R, Python and Gephi for various portions of our analysis. We learnt that to use a single tool, is not the most efficient way to execute such a project.
To conclude, following quote by Stuart Ellman perfectly describes our experience about the role of network analysis in this big data project-