Topic detection with recursive consensus clustering and semantic enrichment.
Extracting meaningful information from short texts like tweets has proved to be a challenging task. Literature on topic detection focuses mostly on methods that try to guess the plausible words that describe topics whose number has been decided in advance. Topics change according to the initial setup of the algorithms and show a consistent instability with words moving from one topic to another one. In this paper we propose an iterative procedure for topic detection that searches for the most stable solutions in terms of words describing a topic. We use an iterative procedure based on clustering on the consensus matrix, and traditional topic detection, to find both a stable set of words and an optimal number of topics. We observe however that in several cases the procedure does not converge to a unique value but oscillates. We further enhance the methodology using semantic enrichment via Word Embedding with the aim of reducing noise and improving topic separation. We foresee the application of this set of techniques in an automatic topic discovery in noisy channels such as Twitter or social media.
Download Files