Images, snippets, snapshots, math

View Gabriele Lami's profile on LinkedIn

venerdì 26 marzo 2010

Similarity Matrix in Text Mining

( in this post i'm testing for latex math formulas )
What is a similarity matrix ( in text mining ) and why is important?


We have to start from a corpus composed by k documents:
$$ \left\{ D_i \right\}_{i=1}^k $$
 ( A corpus is merely collection of documents )


A way to find semantic structures in the corpus is to study the occurrence and the
co-occurrence for every pair of words in the corpus.
A good tool to find something interesting is a similarity matrix.


To define a similarity matrix we must define the similarity between two objects ( words )
$$ s(w_i, w_j ) $$
a similarity matrix becomes simply a matrix that contains the ratio of similarity between
the objects of index i and j for the generic position {i,j}

a good similarity matrix can follow from this definition: $$ s(w_i,w_j) = \dfrac {c(w_i,w_j) } { f(w_i) \cdot f(w_j) }$$ where:  $$ c(w_i,w_j) $$ is the co-occurrence between two words ( the number of documents containing both
words )
and:  $$f(w_i) $$
is the occurrence of the word.



The created matrix is symmetric and could be visualized using a undirected weighted graph.
The nodes represents the words and the similarity between the two words is given by the
weight between two nodes.

this visualization is nearly useless (easily more than 10000 nodes!!!) .


A way to handle this info is to position k points in an n-dimensional space so that the mutual
distance between a couple of elements previously defined could reflect the weight between
the related pair of words.$$w_i \mapsto p_i | s(w_i,w_j) = \dfrac{1}{||p_i - p_j||}  \forall i,j \leq k$$
( higher weight - closer distance )

a problem related with this approach is that is not always operable (matrix could not be
compatible with metrics constraints ) and the requested dimension of the target space
is a $$O(k^2).$$
so we need to use a technique to reduce the dimension preserving the significant information
( reducing the dimension brings a certain loss of information).


Using the representation in an n-dimensional space is important to analyze clusters of points.
A cluster could be defined as a subset of points whose mutual distances are much smaller
than the average distance of the complete set.

A cluster is a reflection of some kind of statistical structure of the corpus.

Structures able to create a cluster can either be:
  1. language related rules ( eg: syntactic structures ) or
  2. semantic meanings ( eg: topics )