Machine Learning & Information Retrieval
One of the most fundamental, and fun properties of Machine Learning is its close correlation to the concept of data compression - if we can identify significant concepts (clusters of users, for example) then we can represent a large dataset with fewer bits. However, this logic also works in reverse! If we can represent our data with fewer bits (compress our data), then we have identified ’significant’ concepts! I bet you see where we’re headed - SVD’s allow us to compress a large matrix by approximating it in a smaller-dimensional space.
SVD’s found wide application in the field of Information Retrieval (IR) where this process is often referred to as Latent Semantic Indexing (LSI). In these applications the columns of the matrix are the documents, and the rows are the individual words. Running SVD allows us to collapse this matrix into a smaller-dimensional space where highly correlated items (for example, words that often occur together) are captured as a single feature. Essentially, we are discarding the noise, and keeping the signal. In practice, the IR guys usually collapse their ginormous matrices to 100, 200, or 300 dimensions (from original 10000+) and then perform similarity calculations. In case you’re curious, this same method has also found many uses in image compression and computer vision applications
About this entry
You’re currently reading “Machine Learning & Information Retrieval,” an entry on Vakul.NET
- Published:
- 01.18.07 / 1am
- Category:
- General
2 Comments
Jump to comment form | comments rss [?] | trackback uri [?]