WebDoc

Start date: 2014
End date: -
Funder: Internally funded
Primary investigator: Serge Sharoff

Description

The project aims to classify web pages automatically. There are many different kinds of documents on the web, from games to shopping pages to journalism to blogs. Different sorts of page have quite different uses and characteristics. A query for 'Venice' results in pages of various types, referring to recent news, information about history, guidebooks, hotel lists, opinions about hotels and restaurants, etc. For many applications (language teaching, machine translation, information retrieval and extraction) it is also important to have the possibility of selecting a subcorpus according to specific parameters, such as encyclopedic knowledge vs. instructions, texts written for professionals vs. for the general public, or opinions vs. factual text.

In this project we will work on different language families, so that the method can be shown to be portable to further languages. We will be testing the approach using webpages in English, Chinese, German and Russian.

Hand in hand with classifying pages, we need to identify the categories we shall classify them into. The web is new, and this is not an area that has been widely researched to date. We shall adopt an iterative approach by classifying samples of web pages to see which pages fit the existing classification scheme, and amending the scheme to allow for those that do not.