Ting Guo, Jia Wu, Xingquan Zhu, Chengqi Zhang
ACM Transactions on Knowledge Discovery from Data
Graphs are popularly used to represent objects with shared dependency relationships. To date, all existing graph clustering algorithms consider each node as a single attribute or a set of independent attributes, without realizing that content inside each node may also have complex structures. In this article, we formulate a new networked graph clustering task where a network contains a set of inter-connected (or networked) super-nodes, each of which is a single-attribute graph. The new super-node representation is applicable to many real-world applications, such as a citation network where each node denotes a paper whose content can be described as a graph, and citation relationships between papers form a networked graph (i.e., a super-graph). Networked graph clustering aims to find similar node groups, each of which contains nodes with similar content and structure information. The main challenge is to properly calculate the similarity between super-nodes for clustering. To solve the problem, we propose to characterize node similarity by integrating structure and content information of each super-node. To measure node content similarity, we use cosine distance by considering overlapped attributes between two super-nodes. To measure structure similarity, we propose an Attributed Random Walk Kernel (ARWK) to calculate the similarity between super-nodes. Detailed node content analysis is also included to build relationships between super-nodes with shared internal structure information, so the structure similarity can be calculated in a precise way. By integrating the structure similarity and content similarity as one matrix, the spectral clustering is used to achieve networked graph clustering. Our method enjoys sound theoretical properties, including bounded similarities and better structure similarity assessment than traditional graph clustering methods. Experiments on real-world applications demonstrate that our method significantly outperforms baseline approaches.