Autonomous Topic-Based Website Categorization
Examensarbete för masterexamen
Internet has influenced many aspects of our social, economical, educational and professional life. Because of the unique communication means it offers, the internet has grown dramatically since its advent. On the other hand, the ever growing volume of data on the internet has given rise to the demand for establishment of structure on this data. Ranking and indexing web pages by search engines, creation of hierarchical taxonomies of web resources, research on autonomous web page and website classification, are examples of attempts for construction of such structure. This project includes a study of autonomous website classification. This process has been researched for various purposes and on different levels, especially to improve search engines and directory services. However, the idea of this project comes from a different active area on the internet, i.e. online advertisement. One of the most common sorts of online advertisement are banner ads which are basically published randomly; however, ad servers try to use algorithms to improve the effectiveness of banner ads by publishing them intelligently. One way to do this is to correlate topic of ads and websites they are placed on. The current project is an attempt towards classification of websites based on their main topic. This work contains a brief study of different web page and website categorization methods conducted to date, as well as implementation of a classification algorithm and analysis of its effectiveness. The implementation consists of creating a graph model of websites and leveraging their link structure for pruning noisy web pages. In addition, a brief description of text classification methods and its relation to the purpose of this project is presented. In this study textual content as well as hyperlink information contained in a website are used to construct a vector space model which is applied for classification by support vector machines (SVM) learning model.
Interaktionsteknik , Interaction Technologies