Here is proof that Microsoft hates blogs. Read the following carefully-
[0014]The classifier 40 analyzes the features and generates an indication or prediction as to whether the web page that provided the features is a blog page or not. The indication or prediction may a “yes” or “no”, for example, indicating that the web page is a blog page or not. Alternately, the prediction may be a number or percentage, such as “95%”, indicating the likelihood or probability that the web page is a blog page, for example.
[0015]Categories of features have been identified that are useful for determining the classification of a web page. In addition to those described herein, it is contemplated that additional features and categories of features may be used in the classification of a web page.
[0016]One category of features is where the page is hosted, e.g., if a page is hosted in a known blog hosting DNS domain, such as MSN Spaces (e.g., spaces.msn.com), Blogspot (e.g., blogspot.com), Yahoo 360, LiveJournal, Typepad, Xanga, MySpace, Multiply, or Wunderblogs, for example. If the web page is hosted on one of these blog hosting sites, for example, it is likely a blog page. The blog hosting sites listed here are examples, and the classifier can base its prediction on these and/or other sites, alone or in combination with other features.
[0017]Another category of features is the non-HTML markup words and phrases contained in the web page. If the web page contains the word “blogroll” or “metaphilter”, for example, it is likely a blog page. Moreover, the number of occurrences of certain terms or words in a web page may indicate that it is a blog page. Terms or words that may be counted include “blog”, “powered by”, “permalink”, “trackback”, “comment”, “comments”, “blogad”, and “posted at”, for example. Desirably, the classifier and its prediction are language independent. Accordingly, the non-English equivalents of these words may also be counted. Desirably, the feature extractor does the counting (e.g., as it parses a web page). The number of occurrences of these words in a web page may be used by the classifier in generating its prediction.
[0018]The targets of outgoing links in the web page may also be considered as a category of features. Links in a web page that likely indicate a blog page include links to http://www.movabletype.com/, http://wordpress.org/, and http://www.blogger.com/, for example.
[0019]Furthermore, the particular strings and/or substrings in a URL for a web page may be considered as a category of features. For example, if the string “blog” occurs in the URL for the web page, that web page may likely be considered to be a blog page.
[0020]Moreover, if the web page contains an ATOM feed or an RSS feed, it is likely a blog page. RSS is a commonly used protocol to share the contents of blogs, and RSS feeds are sources of RSS information about websites. RSS is being supplemented by a newer, more complex protocol called ATOM.
AND FINALLY THIS
[0002]Search engines are increasingly implementing features that restrict the results for queries to be from blog pages. The website www.blogcensus.net gives information on an effort to index blogs, though this was apparently discontinued in late 2003. At that time, the site stated that it had indexed 2.8 million blogs. Currently, Technorati claims to be tracking 43.2 million blog sites. It is currently difficult for search engines to identify blog pages, regardless of the source of the content in a blog page.
[0003]A machine learning classifier is trained with features that are used to classify web pages as either blog or non-blog. Categories of features include (1) where the page is hosted, e.g., a page is hosted in a known blog hosting domain, (2) the non-HTML markup words and phrases contained in the web page; (3) the targets of outgoing links in the web page; (4) the particular strings and/or substrings in a uniform resource locator (URL) for a web page; and (5) if the web page contains an ATOM feed or an RSS feed. Some or all of the features in some or all of the categories may be used by the classifier, either in an initial classification, or in a subsequent classification in order to refine the initial classification.
The above is from US Patent Application 20070294252 filed by Microsoft according to Bill Slawski.



Recent Comments