Eye2Data: 2015

Saturday, December 26, 2015

Exploratory Data Analysis

What is Exploratory Data Analysis? This typically means scanning the data for discovering any obvious patterns or identifying main characteristics of dataset with the help of visualizations. It is one of the critical steps in Data Science Process as insights from the this analysis are vital for scoping the Modeling and Prediction efforts.

So, let's have a quick tour into few exploratory data analysis which can be done for large datasets. Here, two major social networks- Twitter and Stack Overflow are considered as datasets.

Datasets and Basic Statistics:

Stack Overflow (community of active users working on data science projects)

1 year of data about data science discussions posted by ~41K users
~90K posts with Time, Location and Tags associated with these posts

Twitter (community of ordinary users interested in data science concepts)

1 month of tweets streamed for ~43K data science users
~1.5 Million tweets with Time, Location, Hashtags and Re-tweet information

So, basic statistics for data exploration could be:

Total rows (posts or tweets)
Total unique users
Total hashtags
Time period or duration for data
Histograms, Rate (Avg, Min, Max) of data generation (posts/week, tweets/day)

Temporal Analysis

This is used to quickly find any temporal trends/patterns in the dataset, along with identifying some outliers occurring due to some special events (like festivals, Superbowl etc.). For example, below time series chart shows the frequency of key words commonly used in tweets related to data science. This clearly indicates that a lot of people are tweeting about "Big Data" for the selected time period. There is an interesting pattern here- Twitter activity decreases on 4th-5th and 12th-13th September of 2015. Thus, one can hypothesize that- "Data Science users are less active on Twitter on Weekends as compared to Weekdays," .This can be related to the fact that most of the users are exposed to data science concepts and technology at work (i.e. during business days).

Spatial Analysis

This is used to find any spatial patterns in the dataset. For example- identifying key areas which have dense population of data points. Below analysis is done for Stack Overflow dataset, and it shows the the distribution of posts related to some data science languages (R, Python and Java) all across the globe. So, Active Stack Overflow users for "R" language (displayed in blue) are concentrated in East USA, Central Europe and Indian sub continent.

Word Cloud Analysis

This is used to visually represent Hashtags from textual data, to depict keyword metadata a.k.a Tags on websites, or to visualize free form text. These can be created using any visualization tool like Tableau, R package like Word Clouds or by using online Word Cloud Generator, Wordle etc. Below word cloud is created for both Twitter and Stack Overflow datasets. This leads to an interesting observation- Users of Stack Overflow post about specific data science technology issues, softwares or packages (like nodejs, apachespark, mongodb etc). However, Twitter users tweet about general data science concepts (like big data, machine learning, data mining etc)

Distribution Analysis

This is used to identify the distribution of data-points across specific areas. For example, lets consider 3 data science languages- Python, Java and R. Now, below chart shows the usage distribution of these languages across specific Twitter Hashtags like- #BigData, #IOT (Internet-Of-Things), #Cloud, #DataScience and Jobs. We can clearly see that tweets for Big Data, IOT, and Cloud are dominated by the Java language. However, Python and R languages are more popular with the Data Science tweets. Also, the job market on twitter seems to be skewed towards Java and Python.

Demographic Analysis

This is typically used for customer segmentation which develops an understanding of the age, sex, and racial composition of a population. Below pie chart describes the Age Groups of Stack Overflow users from their data science posts. One can see that the age distribution is skewed towards 20-30 age group with more than 68% young Stack Overflow users.

So, these were the some of the basic analysis one can do to get a good understanding of the given datasets. Observations from these analysis can be used to form or change hypothesis which can alter the Prediction Modeling efforts.

Complex social networks like Twitter and Stack Overflow require advanced visualization techniques for getting deeper insights from Social Network Analysis. These network can be visualized using:

Templates for MS Excel like NodeXL
Softwares like Gephi or GraphViz
JavaScript library like D3.js or R,Python packages like igraph
Sophisticated network databases like Neo4j

Stay tuned for next post which will discuss the visualizations created for Social Networks like Stack Overflow and Twitter.

Saturday, December 5, 2015

Big Learnings from Big Data.!!

As D.J. Patil, the Chief Data Scientist of the US Office of Science and Technology put it, “Data Science is a team sport, which requires a lot of collaboration and team-work.” As the Eye2Data team started working on the big data project, we realized that Data collection, Data cleansing and Prediction Modelling are some of the key components of a typical data science project; and all these teams have to work together in tandem with a lot of collaboration.

It all started with “Network Thinking”

This fascinating course started from the great mind refactoring: network thinking. We were encouraged to go beyond the 5 V’s (Volume, Veracity, Velocity, Variety and Value). The methodology of Big Data analysis can be enumerated into the following aspects:

Datafication
Analyzing the population in its entirety instead of sampling data
Interconnecting data sources to get richer and more realistic insights
Capitalizing on the spatial-temporal characteristics of a Big Data phenomenon, and
Tapping correlation instead of chasing causality

The idea of thinking in network terms presented by Dr. Barabasi tremendously expanded our horizon of viewing this world. Just as he said, “we always lived in a connected world, except we were not so much aware of it.” We will never be independent with environment and people around us. Nor anything in the world. The interaction and connection between each others play an important role in framing the presentations. As data science practitioners, our goal is to find out these relationships and associations, to describe, to measure, and to quantify the processes.

Think about a simple online-marketing application where you wish to visitors based on their tendency to purchase products in your web-page. Traditionally, older recommender systems systems look at user features, their click and navigational behavior and accordingly recommend based on their current and past purchase history. This could be done using predictive modeling techniques: data mining (classification/regression modeling), natural language processing (keyword search behavior) and/or generative modeling approaches (collaborative filtering and markov family of models). Till now, we have been looking at the user as a singular entity whose online behavior is determined by his/her current and previous actions. The learning is based on the logic of modeling user behavior and accordingly customizing recommendations for variation in user characteristics across the population.

An exciting new way of looking at recommender systems during the last few years, which is not only simple but also quite effective is looking at users, products, websites and companies as entities. Users ‘visit’ websites to lookup or buy products. Products are placed in different positions in a web page which have several links to other pages. Companies would like to place their products in positions that can maximize their profits through increased sales. Here, we see the interconnectedness by simply looking at the what, why, who, how and when of a generalized e-commerce example that we have considered. Such a rich multi-mode network can be simplified to variety of single mode networks to answer different business questions. For instance, a user-user network could be created if users purchased ‘k’ or more common products during a time interval, hence creating a communities of similar users. Such implicit networks capture the inherent connectedness of everything in life. One can very well see that the collaborative filtering and other generative processes are just other ways of thinking on similar lines, but more mathematically involved and prone to errors.

Our project was inspired by the idea of network thinking. To begin with, we were interested in describing the current status of data science communities in Stack Overflow. The most basic idea was to summarize the behavior of people in this field. However, we incorporated the first level of network thinking by taking the connection of users into consideration. The network graph was built to capture the information exchanged between users. Network metrics found from this approach helped us to generate much more insights. Then, the second level of network thinking was to connect user groups from Twitter to users from Stack Overflow. Twitter is a more real-time platform comparing to Stack Overflow. We expected the performance of tweets could provide a good indicator to help us predict the performance of posts on Stack Overflow.

Social Media and Big Data

One of the key learnings we had was right at the start of the project. When we started off and as we were deliberating over different project ideas - we had two distinct routes by which we could do the project - either identify a good quality dataset and brainstorm on a business question around it; or identify a business problem, and try to find datasets which can help address them. While for academic projects, identifying the data first is a wee bit easier and safer approach, taking the second approach actually provides a much wider scope for imagination which will be very handy in the long run, especially as we look out to work with innovative companies.

Social media is definitely one of the largest data sources for big data. It’s also the low hanging fruit for network science. While we probably have social media to thank for opening the doors of network science analysis, the scope and the domains where network science can be use is immensely vast. It’s a step away from traditional thinking, in which identifying inter-relations between the actors was difficult to infer. While creating network visualizations helped us gain insights into the relationships within the data, the different network metrics helped gain valuable insights related to influence and importance of key actors within the network which would otherwise be difficult to infer, especially in large datasets.

The 80/20 Data Principle

As Andreas Weigend says- Big Data is like crude oil, in the sense that big data itself is practically useless and needs to be refined before its true value can be extracted. The importance of data collection and cleaning cannot be emphasized enough. Public data sources are data rich but information poor! The project made us think about the practical aspects of data collection and how major component of a Big Data project is focused around data preprocessing. The process of data purification for noise and its implications in improvement of accuracy made us strive hard for improving the quality of our datasets. The quality of the results greatly depends on the quality of the data sources and the data cleaning. These are also the most time consuming and challenging phases of building any analytic application. Data cleaning is all the more important in the case of unstructured and semi-structured data. Unstructured data normally tend to require a whole new set of data cleaning procedures to clean and convert them in structured format.

Tools and Technology Selection

Big data is like the wild west, lots of opportunities, there’s lot of hidden value but it’s just like digging for gold, you should know where to find it. With the right analytics platform, you can find the gold faster; and the platform does the digging for you. The technology stack for big data projects is different from traditional applications. Unlike traditional software applications where the focus is more on the logical design and structure, the focus in big data projects is on data. Due to the sheer size of the data, a lot of consideration has to go into the data storage and data processing techniques. The volume of data is too large for comprehensive analysis, and the range of potential correlations and relationships between disparate data sources are too great for any analyst to test all hypotheses and derive all the value buried in the data. For our project, while MongoDB served us well for the storage of unstructured tweets, MySQL served our purpose for the more structured StackOverflow data. For data processing, while it’s possible to use traditional technologies like Java and .Net, it requires a lot more of effort, custom code development and time. On the other hand, programming languages like R and Python, aided by it’s powerful packages for data processing make it a great candidate for data processing.

Hypothesize. Analyze. Evolve. Repeat.

Projects evolve with time. Unlike traditional software development, where the end product and result is much easier to visualize, data science projects, by its nature is exploratory for most part. While in some cases, methodologies and approaches which may have appeared feasible at the start may run into roadblocks, while on the other hand, mid-way through the project one may identify new data source or methodology which can improve the project drastically. When we started, we started with a dream about leveraging big data to predict trends like salary and job postings using social media and network science, we ended up with a more practical and focused objective of predicting the user interaction for data science languages. Data Science project ideas, thus evolve and mature with time.

The field of Big Data is still in its early inception stages of theory, but is fast evolving to be highly beneficial across industry domains. It is beyond simple applications of data mining, statistical model fitting, event monitoring or learning through expert systems. Big Data solicits a sound understand of the complexity of the underlying phenomenon and using innovative methods to utilize huge amounts of heterogeneous information. The hands-on experience through the homework assignments and project tasks validated our belief that most of the data will not be put to use in the modeling, but rather, only be a means. We also learnt that there is no single way to analyze as every problem is different and hugely ridden by data quality and availability.

We also learnt hands-on experience of using a variety of tools such as MongoDB, Hadoop, Neo4j, R, Python and Gephi for various portions of our analysis. We learnt that to use a single tool, is not the most efficient way to execute such a project.

To conclude, following quote by Stuart Ellman perfectly describes our experience about the role of network analysis in this big data project-

Thursday, November 12, 2015

Heterogeneous network analysis - What after visualization

In the previous blog, we discussed about visualization techniques for meaningful interpretation of Heterogeneous networks after understanding what they are how they are different from simple social networks.

Heterogeneous information network formulation has been an active topic of interest since 2009 with the explosion of graph analysis. Networked thinking and the advent of graph databases [1] has resulted in rapid methodological advancement in network analytics. We will discuss few of the recent techniques proposed and discuss them with respect to techniques in homogeneous or single mode networks.

Network metrics:

Network metrics such as centrality (e.g., degree, closeness, eigen vector), clustering coefficient and distance measures (e.g., diameter/path length) are widely used in single mode networks. With regards to heterogeneous networks, as we discussed earlier, the common method is to convert them to single mode network as per requirements and then separately analyze each resultant network. All these metrics are not directly applicable to the native heterogeneous network, as all node types and connection types differ. For instance, consider a heterogeneous network of friends:soccer-teams:study-groups:locality network. Here friends, soccer teams, study groups and area of residence/localities may be difference node types. Each relationship may be directed/undirected based on the pair of nodes connected. The network is suggestive of a typical college student's off-school lifestyle where he/she interacts with other entities such as extra-curricular groups, study-groups, neighbors. It may be interesting to understand if there is any pattern formation in the network such as students from a particular locality always forming diverse study groups or students from a soccer-team coming from different localities but having many common friends in general. How to capture such real-life phenomena using network metrics. It is still a very open-ended question with lots of methods and metrics proposed to arrive at convincing answers. Certainly, the world that we live is connected world with intricate heterogeneity as well as repetitive patterns. As one could put it, it is like finding diamonds in a coal mine. Only here, the diamonds could be of different colors and shapes.

Currently, graph database querying is a method to identify patterns, but graph structure ontology can certainly make things easier than randomized querying in search of localized insights in Big Data networks.

We will begin with a simple metric proposed in the working paper [2]

1) Diversity degree:

The intuition behind this metric is to capture how connected a node is with other nodes having different types. The diversity is measured as sum of normalized value of sum of diversity in nodes and edges for a given node. Simply put, for a node that is connected to nodes of j different node types through edges of k different types, where the total graph has m total node types and n edge types, the diversity degree for this particular node is j/m + k/n.

The utility of this diversity is demonstrated by computing the metric values for two heterogeneous datasets and stating that it identifies an anomaly in the graph with respect to one of the nodes of a specific node type, which wouldn't have been identified otherwise in a flattened graph.

Though the idea of this metric seems interesting, the utility, external validity and optimal semantic representation still seems questionable. What kind of diversity does this metric address? Why is it of importance in actual use-cases. More robust demonstrations of the approach towards arriving at the metric as well as different scenarios at which the metric may be useful is to be expected, if the metric is expected to be adopted in the field.

2) Meta-paths:

This idea of a meta-path is suggested by Jiawei Han and his team (famous for his Data mining book as well as graph coding technique that revolutionized frequent subgraph mining).

The idea is to define a network schema over the basic heterogeneous networks such that the implicit aggregation/abstraction done while converting heterogeneous networks to any homogeneous network is formalized as a metric. For instance, in an co-author publication network where several authors collaborate for papers and it may be heterogeneous network of node types: venue, author, paper and publication company; meta-paths are defined for each pair of nodes of same type. For instance, one meta path is author-paper-author and other may be author-paper-venue-paper-author. The authors suggest that the edge weights for the meta-paths could be simple path counts, random-walk based measures or using Path-sim [4]. Round trip-meta paths are possible where a node is connected to itself (a self-loop) as the path traverses different node types. Hence, a good measure for finding a realistic meta path between two nodes of same type known as path sim is proposed. The path sim is the path instance (presence of a multi-hop path) between two nodes of the same type normalized by each individual self-loop path instance.

Using the meta-path measure, relationship/link mining is done to find out the top-n co-authors for a given author.

Other applications such as clustering and classification are also discussed using the meta-path metric formalization thus demonstrating its robust utility.

We thus see that there is a lot of potential for the industry as well as academia to study heterogeneous networks and use it for deriving useful insights. It is also very interesting to see the rapid growth rate of research contributions in this area and how it is eventually transforming the way we look at the networked world.

References

[1] db-engines.com - http://db-engines.com/en/ranking/graph+dbms

[2] Powers, S., & Sukumar, S. R. Defining Normal Metrics for mining Heterogeneous graphs at large scales

[3] Sun, Y., & Han, J. (2013). Meta-path-based search and mining in heterogeneous information networks. Tsinghua Science and Technology, 18(4).

[4] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, PathSim: Meta path-based top-k similarity search in heterogeneous information networks, in Proc. 2011 Int. Conf. Very Large Data Bases (VLDB’11), Seattle, WA, USA, Aug. 2011.

Friday, October 9, 2015

An overview of multi-modal networks

Network analysis or social network analysis (SNA) is a data science methodology that captures the social structure of a system through the use of network and graph theory. It characterizes networked structures in terms of nodes and the ties or edges (relationships or interactions) that connect them. Examples of social structures commonly visualized through social network analysis include social media networks, friendship and acquaintance networks, kinship, disease transmission, organizational structure, international trade and sexual relationships. In principle, networks are either observable as a natural phenomenon or created by identification and operationalization of implicit relationships. These networks normally have the same type of nodes with similar attribute profile and data characteristics.

However, in many real-world settings, the networks consist of not one but several different types, or modes, of nodes. Multi-modal (social) networks are those nodes belong to different types or entity classes. They are heterogeneous networks that are not constrained to homogeneous network formations. Examples include co-authorship networks that contain not just authors, but also

the venues they attend and the journals they publish in; organizational charts that contain employees as well as the departments they belong to; and information retrieval processes that involve both databases and the people who access them; medical databases where patients may interact with physicians and may be associated with standard set of disease codes and treatment procedure codes. Bi-partite networks are a special case of multi-modal networks with 2 types of nodes, which are the most popular after the unimodal networks that are the de facto standard in (social) network analysis.

Uni-modal or homogeneous network realizations have been done in case of online social networks (interpersonal), biological networks (gene-gene interactions) or economic networks (international strategic relationships). Community detection, network characterization using metrics and integration with other data science methodologies are approaches commonly adopted for simple unimodal networks, but formalization in multi-modal networks is still in its early development stages.

A standard approach to analyzing multimodal networks has been to transform them into unimodal social networks either through projection or through separation. For example, in a study analyzing user subscription in online brand pages in facebook; the ties between brand pages and its subscribers are transformed (or projected) to ties among brands with edge weights given by the number of common users active in a pair of brand pages.

I) Multimodal network visualization:

Ghani et al. (2013) have summarized the various visualization strategies for multi-modal network visualization as follows:

1) Compound Network Visualization: Treat the multimodal graph as a unimodal graph and color the node types differentially.

2) Eliminating modes through projection: Several graph visualization packages offer features to collapse nodes of one node type to edges between nodes of another node type; particularly in case of bipartite networks.

3) Linked network visualization: A third approach is to use multiple views, each of which renders a

different mode of the graph separately (linked network visualization). Between-mode ties are visualized using visual links or brushing (when nodes are selected in one view, corresponding nodes in another view are highlighted).

One of the approaches in the review is an extension of parallel node-link bands (PNLB) for bipartite networks that was customized and implemened a tool named MMGraph.

Figure 1: Parallel node-link bands visualization of a researchers-projects dataset done by Ghani et al. (2013)

The authors also defined some metrics such as multimodal degree centrality, betweenness and closeness centrality for nodes in one mode with reference to its interactions with nodes of other modes.

II) Multi-modal network analysis:

Bauer et al. (2013) analyze diseases, drugs, medical devices and procedures and demonstrate advantages of network-based approaches as compared to traditional approaches such as expert inputs based propensity score matching.

They constructed a homogeneous network for group foramtion and then represented it in a multimodal network as shown in figure 2.

Figure 2: Drugs, disease, device, procedure are different node types in the multimodal network developed by Bauer et al. (2013)

By effective use of the width, length of node dimensions, edge thickness, node type coloring, the researchers were effectively able to identify the node entities clustering in the cilostazol treatment concept quite distinctly from the control group of patients.

Multimodal networks have tremendous potential for theoretical developments as well as practical applications. The world of networks is not always a simplistic representation of similar entities. Theoretical development in uni-modal, undirected, unweighted and attribute-less nodes in a network will have to transform into a complete comprehensive representation of the multi-faceted eco-system that we are currently experience. Visualization alone can aid decision-making as indicated by the two use-cases above, but problem-solvers should not restrain themselves.

Instead, we should look at it like a pandora's box which is just being opened in the last few years and with tremendous potential of changing the ways that we perceive social networks.

References:

1) Ghani, S., Kwon, B. C., Lee, S., Yi, J. S., & Elmqvist, N. (2013). Visual analytics for multimodal social network analysis: A design study with social scientists. Visualization and Computer Graphics, IEEE Transactions on, 19(12), 2032-2041.

2) Borgatti, S. P., & Everett, M. G. (1997). Network analysis of 2-mode data. Social networks, 19(3), 243-269.

3) Bauer-Mehren, A., LePendu, P., Iyer, S. V., Harpaz, R., Leeper, N. J., & Shah, N. H. (2013). Network analysis of unstructured EHR data for clinical research. AMIA Summits on Translational Science Proceedings, 2013, 14.

Saturday, September 12, 2015

Applications of Big Data in different domains

The term "big data" has been around for decades. A Quora posting provides an example of its usage dating back to 1987. In the 1990s, technologists referred to the big data as the growth in data volume, pointing to a relatively new data source known as the Internet, and discussed its impact on storage systems. Thanks to Moore's Law, computational power and storage became cheaper and more accessible, enabling the Big Data sources to keep rather than discard data.

In the 2000s, the emphasis was on meaningful integration of data from different sources. By this time, many processes across functions such as supply-chain, market research and strategic planning were getting directly or indirectly connected to tangible/quantifiable information. For instance, strategic plans were starting to be backed up by historical evidence through data rather than qualitative judgements based on prior experience of a few managers. The number of sources of Big Data has multiplied in the last few years and their capability to generate continuous data has increased exponentially.

Below, we present some use-cases and perspectives for Big Data application across different areas.

Environment

Environment in general, can be classified into (a) the micro-environment or indoor spaces where humans spend most of their time - be it office, home or community centers and (b) the macro-environment or the ecological system. Sensors have been used for both types of environments in the past, such as for applications with satellites monitoring global weather changes or household thermostats. There are exciting applications of Big Data analytics in this domain such as IBM-Alberta University’s real time analysis of Environment (video) or Microsoft China’s Air quality monitoring in smart buildings (article).

Sensors are able to minutely record human movements, air quality, light and several other factors. We all know that indoor environment conditions are closely related with the human eco-system. Our mental state and wellbeing is closely related to characteristics of the environment that we are exposed to. Workers in factory settings often complain of various health problems due to extreme environment. Workspaces with more natural light and better ventilation has been hypothesized to improve employee satisfaction as well as efficiency. Now imagine collecting data about the human state and relating to everything that she is exposed to, during a typical day in her life. This can be conceptualized as adding the human component to the IoT paradigm, where currently several devices interact with each other. For such a human-environment interactive system, devices not only interact with each other; but also accept and deliver signals at real time to the humans. This could imply the possibility of futuristic applications that assist human well-being by using Big Data. Imagine a thermostat sensing unusual variability in the ambient temperature of a person in addition to controlling room temperature. Then, such a device could be designed to predict if a person is going to fall ill based on inputs from medical records and the temporal pattern of body temperature (assuming near-body temperature variability is proportional to the inner body temperature). These data analytical applications complement applications supported by improving technology such as alert systems. Such a proposed application apparently capitalizes on a subset of Big Data applications for environment data. Anomaly detection, unusual pattern detection, prediction of health conditions, natural disaster prediction are top-of-the-mind applications of Big Data analysis for environment that will be driven by the expedient progress of environment sensing technologies as well as data science algorithms. Big Data for environment and assisted human living is the future which has been projected and an immanent part of the datafication revolution that we are currently experiencing.

Epidemiology

Epidemiology is a branch of medicine science that closely related to public health. ‘Epi’ means

upon or befall, ‘demo’ means the people, and ‘ology’ means the study of. So literally epidemiology is the study of what falls upon the people. The classic definition of epidemiology is the study of distribution and determinants of disease frequency of human populations. It precedes the larger spectrum of environment studies on human well-being and the underlying theories are therefore more mature. Epidemiology is a comparative discipline. By making comparison among different groups, an epidemiologist would be able to identify causal factors of diseases.

Big Data techniques can be used in creative empirical studies pertaining to the area of epidemiology. Epidemiological data has high variability, volume and veracity. For example, the spatial and temporal information are available for a large range of epidemiological factors. The populations at risk for any health outcome are broken down into several groups by using timestamp and location co-ordinated in the datasets. As part of developing innovative Big Data solutions, the medical symptoms or concerns among those groups could be collected by data crawlers over the web. Data analysis models comparing the spatial-temporally identified groups, could provide evidence for developing new theories. Big Data methodology such as network thinking could be used to integrate different sources of information together to explore reasons, detect outbreak, and provide surveillance of epidemic disease. For instance, Google Flu Trends uses frequency of certain search keywords rather than survey data provided by CDC, providing several weeks faster detection and surveillance of influenza outbreaks. Another project, Toronto-based BioDiaspora, models the spread of infection in a different way, using global airline data to predict and track the spread of diseases based on the origins, travel routes, and destinations of commercial flights. These are few applications demonstrating, how Big Data could re-model the way we traditionally have approached epidemiological research problems in the past with experiments and observational studies.

However, there are some concerns about the use of Big Data among epidemiologists. In Dr. Antoine Flahaut’s speech - “Big Data in Public Health Research: Will it Kill Epidemiology?”, he presented several challenges and threats that epidemiology would face with the introduction of big data in research [2]. The ethical issues with confidentiality and privacy, and data quality issues, which meant that the data was not originally collected for epidemiology research, were the most concerning ones. Nevertheless, Big Data does help in improving quality and efficiency of the health care system, creating new job demand in computer science, mathematics, and public health, and inventing new paradigm in public health research. It is an emerging trend of using Big Data in epidemiology research and applications. Let’s wait and see the what happens!

Government / Economy

In today’s world it’s very hard to find good sources of data. However, government sector has been a very vast yet overlooked and underutilized source of granular data. And, economist have been sophisticated data users for a long time. Big Data can be effectively used to analyze these large administrative datasets collected from various sources like healthcare, finance (tax), insurance and census data.The patterns and findings discovered from this analysis can be used to form or alter economic policies to improve government operations.

John Wennberg and colleagues at Dartmouth analyzed large samples of Medicare claims to discover that medicare spending per enrollee does not depend on health status or prices and is not correlated with measured health outcomes. This research was pivotal in Affordable Care Act in 2009, and has became a leading evidence for inefficiency in the US healthcare system. In a similar way big data and predictive modeling can be used to improve targeting of government services. Imagine a Medicare system where every individual has a health-care score based on his likely response to a treatment and a mediclaim policy covering the treatment only if this score exceeds a particular threshold.

These large scale administrative data sets have the power to allow better measurements of economic effects and outcomes. Chetty, Friedman, and Rockoff did an interesting case study in 2011 on the long-term effects of better teachers. They analyzed records of 2.5 million New York City schoolchildren and their earnings 20 years later. The aim of the study was to check if a teacher’s “value added” lessons had a lifelong impact on the earnings of their students. The “value added” teachers were measured by the amount of improvement in test scores. The study gave very striking and interesting results that - replacing a teacher in the bottom 5% with an average teacher raises the lifetime earnings of students by a quarter of a million dollars ($ 250,000) in present value terms. This is just one of the many real world case studies mentioned in “The Data Revolution and Economic Analysis” by Liran Einav and Jonathan Levin. Imagine the endless possibilities and changes in economic policies if more government data is made available to researchers. Considering these examples one can imagine the not so distant futuristic governments being run by analysts by exploring OpenData.

Higher Education Analytics

There is a growing interest in Big Data in the field of higher education. On the administrative side, colleges and universities are harnessing the power of Big Data and predictive analytics to improve student performance, increase institutional effectiveness, and to launch the online college experience. We have also recently seen a trend in using Big Data throughout numerous student engagement points of the college experience - from recruitment all the way to understanding alumni giving- see Schmarzo’s blog post on this topic.

There has also been a recent surge in the use of data science methodologies to launch, expand, and improve online education. Online education presents a change in postsecondary education’s core function of teaching by restructuring the access and delivery of courses. Online education is a radical innovation that disruptively departs from the existing practices and processes of traditional face-to-face higher education. And to this day, there are large debates about whether online education is as effective as traditional on-campus schooling. While this debate is likely to persist for many years to come, the vast amount of user interaction data that online education offers has given universities and colleges the ability to improve the online learning experience in real-time. From measuring the timing of how long students take to complete an exam, to identifying topics that students are struggling with, etc. Online education is producing a vast amount of data that was previously unavailable to those studying education-data that has the potential to better deliver curriculums, allow institutions to personalize a student’s educational experience, and improve student learning if harnessed. See Big Data and Online Education for more on the possibility of applying big data techniques to online educational models as a way to refine the learning experience of online students.

Manufacturing

The manufacturing industry was oblivious to data for some time because it was a different field altogether where more focus was given on process optimization in terms of operations. But now, there is a paradigm shift happening in the manufacturing industry too. Data is now considered an asset. One such example is the chip manufacturing giant Intel. They use Big Data in chip validation. This involves a lot of testing of the chip design wherein hundreds of sensors collect data in a timely manner. This huge amount of structured and unstructured data is being used by intel to optimize the design process and the time-to-production/time-to market.

To delve a little into detail, there are no ideal rules defining when a chip should be launched in the market. If a chip is not tested it will have bugs and if it is excessively tested then it will be delayed in the market and the company might lose it's edge. By using the sensor data on the physical and logical state of a processor, it can be understood how the testing tools are doing. Bid Data analysis can help in debug process by using clustering defects and performing root cause analysis on the massive amount of historical sensor data. These insights can give a better idea on how to improve the design and the testing process of a chip.

Aviation

Aviation is one of those few domains which has actually being dealing with big data, even before the term 'Big Data' gained traction, thanks to the tons of data collected from sensors fitted on aircrafts and the sheer volume of flights and passengers. However, what's changed in the last decade or so is how the data is being used. From improving operational efficiency to attracting more customers - big data is used in a big way by aerospace companies and airlines. For example, Boeing uses the ecoDemonstrator Program to harness big data by testing ways to use data to save fuel and flying time. Southwest Airlines analyses passenger traffic to determine services to have on specific routes to attract more customers.

According to article, based on analysis of International Air Transport Association (IATA), the leading cost of flight delays are due to airline-controlled processes like maintenance. For every hour that an aircraft is grounded, the airline stands to lose an average of $10,000. With such large monetary repercussions riding, it is all the more important for airlines to prevent the need for unscheduled maintenance. The amount of data generated by the sensors on the aircraft is staggering. The super-sized Airbus 380-1000, for example, is fitted with 10,000 sensors on each of its wings. A Boeing 737 generates 20 terabytes of engine information every hour. By collecting in-flight aircraft information and relaying it to maintenance personnel on the ground, the maintenance crews can be ready with the parts and information to quickly make any necessary repairs when the plan arrives at the gate. The data from the sensors also helps identify recurring faults and trends and proactively plan for future maintenance. Pilots too can use insights provided by satellites, weather sensors and ground data to make real-time decisions to save fuel and improve safety.

Aviation companies have always wanted to make better real-time decisions based on insights from information collected. The recent progress in big data technologies and tools has empowered them to better process the information collected and make smarter fact-driven decisions.

References

The Data Revolution and Economic Analysis, Liran Einav, Stanford University and NBER, Jonathan Levin, Stanford University and NBER
www.ted.com/talks/ben_wellington_how_we_found_the_worst_place_to_park_in_new_york_city_using_big_data?language=en
business.financialpost.com/executive/strategy/the-rise-of-the-digital-epidemiologist-using-big-data-to-track-outbreaks-and-disasters?__lsa=5cb4-4e28
www.youtube.com/watch?v=TGCyvkfMxiQ
www.intel.com.au/content/dam/www/public/us/en/documents/white-papers/mining-big-data-In-the-enterprise-for-better-business-intelligence.pdf
www.boeing.com/innovation/#/environment/big-data-for-a-better-planet
fortune.com/2014/06/19/big-data-airline-industry/
www.forbes.com/sites/sap/2015/02/19/how-big-data-keeps-planes-in-the-air/

Friday, August 28, 2015

About us

Hi,

We are a bunch of nerds who love data science and research. We are diverse in our work experiences, academic background, career interests and approaches, but the love for Big Data binds us.

Let us start out by introducing ourselves before posting on Big Data and other fun stuff...

Karthik Srinivasan aka 'The earnest learner'

I have found my recent love in Data science and take interest in problems in domains such as healthcare, environment, data privacy and manufacturing. After trying out my hands on different things in life, I finally found my true calling: to do research in data science. It just keeps getting more interesting, day by day!

Regarding e-networking, I spend more time in tweeting @karthikarizona and on stackexchange @Earnest_learner rather than writing blogs as of now (but the pattern is transcient).

Mangesh Jadhav

I am second year graduate student with a keen interest in building information systems. Having worked in the Business Intelligence domain for 4 years I am always looking for ways to get insights out of data. When I am not working on assignments, I am usually seen on soccer field or badminton court at the UofA. Few days back, I tried jumping off an airplane (a.k.a SkyDiving) and now I am really looking forward to take a deep dive into the world of Big Data!

Ajit Umale

Hello readers! I am a second year Master's student at the MIS department at the University of Arizona. I have a Computer Science background wherein I have worked on Software development and Web application development projects. I decided to pursue MIS because it is great way to join the gap between technology and real world problems. Over the past one year, I got some experience working on such projects. I also worked on Data Analysis and Business Intelligence projects. And finally it has brought me to the world of Big Data. I am pretty excited to know the endless possibilities that Big Data has. It can really solve some really interesting and difficult problems! And that's why I am excited about it.

Apart from academics, I like reading. A lot of reading on various topics! I like watching good movies. One thing I am passionate about is Poetry. Be it reading or writing it. That's it about me! Stay tuned to this blog as we explore the exciting world of Big Data over the next few months!

Yongcheng Zhan

Hello! I am a second year PhD at MIS department. MIS is a quite amazing domain with the combination of technology and business. And data is exactly the bridge connecting these two worlds. The more I study in this field, the more I like to play with data. My current research focuses on the electronic cigarette, with the help of user generated content from social media. It is related to data of course, but not BIG enough. So this semester I am willing to try and learn more on Big Data, for not only the analytical methods, but also some deep insights of data science.

Are you ready for the journey in the data world from now on? I am quite excited with eager anticipation.

Joe Koolippurackal

Howdy!

I'm a second year MS-MIS Graduate Student at University of Arizona with keen interests in Data Science. Prior to joining the Masters program, I had been working as a Software Developer for 4 years, working on Microsoft programming and BI technologies.

Ever since childhood, I always wondered if there's a way to eliminate uncertainties. Be it when our Maruti 800 broke down without any warning and left us stranded on the road, or when a sudden downpour spoilt our game of cricket, or while patiently waiting for delayed trains without ETA, always wondered if there was a way to know it before it would happen. I look upto Data Science as the power to tackle this childhood nemesis of mine - 'Uncertainty'!

I'm excited and fascinated about the different applications of, and the types of problems Data Science can help us solve. The way I look at it, learning Data Science is a really long road. I want to take that road, and leave a trail along the way.

And by the way, I love my hot mocha. :)

joek@email.arizona.edu | www.linkedin.com/in/joekoolippurackal | www.facebook.com/joe.koolippurackal | @joek_04

Pages

Saturday, December 26, 2015

Exploratory Data Analysis

Datasets and Basic Statistics:

Temporal Analysis

Spatial Analysis

Word Cloud Analysis

Distribution Analysis

Demographic Analysis

Saturday, December 5, 2015

Big Learnings from Big Data.!!

It all started with “Network Thinking”

Social Media and Big Data

The 80/20 Data Principle

Tools and Technology Selection

Hypothesize. Analyze. Evolve. Repeat.

Thursday, November 12, 2015

Heterogeneous network analysis - What after visualization

Network metrics:

1) Diversity degree:

2) Meta-paths:

References

Friday, October 9, 2015

An overview of multi-modal networks

I) Multimodal network visualization:

II) Multi-modal network analysis:

References:

Saturday, September 12, 2015

Applications of Big Data in different domains

Friday, August 28, 2015

About us

Karthik Srinivasan aka 'The earnest learner'

Mangesh Jadhav

Ajit Umale

Yongcheng Zhan

Joe Koolippurackal