[Home]StatisticallyImprobablePhrase

MeatballWiki | RecentChanges | Random Page | Indices | Categories

Amazon's Search Inside! feature included the results of an algorithm they called StatisticallyImprobablyPhrase?s.

Statistically Improbable Phrases, or ”SIPs”, are the most distinctive phrases in the text of books in the Search Inside! program. To identify SIPs, our computers scan the text of all books in Search Inside. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside books, that phrase is a SIP in that book.

They likely used a TfIdf? algorithm based on n-grams (probably 2-grams and 3-grams). They have the advantage of having a huge corpus to even out the document frequencies across all genres, so they can really pull out phrases that are characterizing of the individual text.


Discussion

MeatballWiki | RecentChanges | Random Page | Indices | Categories
Edit text of this page | View other revisions
Search: