Thursday, November 12, 2015

Heterogeneous network analysis - What after visualization

In the previous blog, we discussed about visualization techniques for meaningful interpretation of Heterogeneous networks after understanding what they are how they are different from simple social networks.

Heterogeneous information network formulation has been an active topic of interest since 2009 with the explosion of graph analysis. Networked thinking and the advent of graph databases [1] has resulted in rapid methodological advancement in network analytics. We will discuss few of the recent techniques proposed and discuss them with respect to techniques in homogeneous or single mode networks.

Network metrics:


Network metrics such as centrality (e.g., degree, closeness, eigen vector), clustering coefficient and distance measures (e.g., diameter/path length) are widely used in single mode networks. With regards to heterogeneous networks, as we discussed earlier, the common method is to convert them to single mode network as per requirements and then separately analyze each resultant network. All these metrics are not directly applicable to the native heterogeneous network, as all node types and connection types differ. For instance, consider a heterogeneous network of friends:soccer-teams:study-groups:locality network. Here friends, soccer teams, study groups and area of residence/localities may be difference node types. Each relationship may be directed/undirected based on the pair of nodes connected. The network is suggestive of a typical college student's off-school lifestyle where he/she interacts with other entities such as extra-curricular groups, study-groups, neighbors. It may be interesting to understand if there is any pattern formation in the network such as students from a particular locality always forming diverse study groups or students from a soccer-team coming from different localities but having many common friends in general. How to capture such real-life phenomena using network metrics. It is still a very open-ended question with lots of methods and metrics proposed to arrive at convincing answers. Certainly, the world that we live is connected world with intricate heterogeneity as well as repetitive patterns. As one could put it, it is like finding diamonds in a coal mine. Only here, the diamonds could be of different colors and shapes.  

Currently, graph database querying is a method to identify patterns, but graph structure ontology can certainly make things easier than randomized querying in search of localized insights in Big Data networks.

We will begin with a simple metric proposed in the working paper [2]

1) Diversity degree:


The intuition behind this metric is to capture how connected a node is with other nodes having different types. The diversity is measured as sum of normalized value of sum of diversity in nodes and edges for a given node. Simply put, for a node that is connected to nodes of j different node types through edges of k different types, where the total graph has m total node types and n edge types, the diversity degree for this particular node is j/m + k/n.

The utility of this diversity is demonstrated by computing the metric values for two heterogeneous datasets and stating that it identifies an anomaly in the graph with respect to one of the nodes of a specific node type, which wouldn't have been identified otherwise in a flattened graph.

Though the idea of this metric seems interesting, the utility, external validity and optimal semantic representation still seems questionable. What kind of diversity does this metric address? Why is it of importance in actual use-cases. More robust demonstrations of the approach towards arriving at the metric as well as different scenarios at which the metric may be useful is to be expected, if the metric is expected to be adopted in the field.


2) Meta-paths:


This idea of a meta-path is suggested by Jiawei Han and his team (famous for his Data mining book as well as graph coding technique that revolutionized frequent subgraph mining).  

The idea is to define a network schema over the basic heterogeneous networks such that the implicit aggregation/abstraction done while converting heterogeneous networks to any homogeneous network is formalized as a metric. For instance, in an co-author publication network where several authors collaborate for papers and it may be heterogeneous network of node types: venue, author, paper and publication company; meta-paths are defined for each pair of nodes of same type. For instance, one meta path is author-paper-author and other may be author-paper-venue-paper-author. The authors suggest that the edge weights for the meta-paths could be simple path counts, random-walk based measures or using Path-sim [4]. Round trip-meta paths are possible where a node is connected to itself (a self-loop) as the path traverses different node types. Hence, a good measure for finding a realistic meta path between two nodes of same type known as path sim is proposed. The path sim is the path instance (presence of a multi-hop path) between two nodes of the same type normalized by each individual self-loop path instance.

Using the meta-path measure, relationship/link mining is done to find out the top-n co-authors for a given author.

Other applications such as clustering and classification are also discussed using the meta-path metric formalization thus demonstrating its robust utility.

We thus see that there is a lot of potential for the industry as well as academia to study heterogeneous networks and use it for deriving useful insights. It is also very interesting to see the rapid growth rate of research contributions in this area and how it is eventually transforming the way we look at the networked world.


References


[1] db-engines.com - http://db-engines.com/en/ranking/graph+dbms 

[2] Powers, S., & Sukumar, S. R. Defining Normal Metrics for mining Heterogeneous graphs at large scales

[3] Sun, Y., & Han, J. (2013). Meta-path-based search and mining in heterogeneous information networks. Tsinghua Science and Technology, 18(4).

[4] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, PathSim: Meta path-based top-k similarity search in heterogeneous information networks, in Proc. 2011 Int. Conf. Very Large Data Bases (VLDB’11), Seattle, WA, USA, Aug. 2011.