Recently, there has been a lot of interest in looking at language from a network or graph-based perspective. Networks are a natural representation for many linguistic structures and almost all levels of language have been examined using graph-based methods. Network representations have been used for tasks such as document summarization, word sense disambiguation, and information retrieval.
Using graph-based methods, we look at latent semantic structure in lexical networks. Lexical networks are generated from collections of documents, with edges in the network corresponding the the similarity between the two documents. The standard cosine similarity measure is used. A collection of networks can be generated by varying the cosine value, and this collection of networks is called a latent network or semantic similarity network.
We look at cosine distributions and network structure across different collections of documents. In particular, the network structure of semantically cohesive collections is compared to semantically diverse collections. We also examine the predicted cosine distribution of documents of varying lengths and vocabulary sizes based on a Zipfian language model.
For comparison, we also examine the growth of several non-lexical networks.
Small lexical network with a cosine threshold of 0% |
Small lexical network with a cosine threshold of 20% |
Small lexical network with a cosine threshold of 30% |
Small lexical network with a cosine threshold of 39% |
Degree distribution as a function of cosine threshold for a larger cosine
network.