Tutorial
Configuration
The LCS categorization service is currently configured as follows:
- Training set of 550K English EP patent applications between 1985 and 2006
- Document representation comprises the English title, abstract and inventor(s)
- Decapitalisation and removal of some special characters such as ‘{‘,’}’ are applied in terms of patent pre-processing
- A variant of the common tf-idf term weight heuristic
- Global and local term selection
- Balanced Winnow algorithm
The LCS prototype performs pre-categorization in the International Patent Classification (IPC) at the hierarchy levels section, class and subclass. Given a plain text document or snippet the categorization service returns a set of IPC subclass predictions along with a confidence level (aka score).
Input
Any plain English text, be it an entire document or a text snippet, can serve as an input for the categorization service. At the moment, LCS only supports ASCII encoding. Any characters that are not supported by ASCII will be ignored during the categorization.
Output
The LCS categorization services returns a set of the predicted IPC subclasses where each predictions is coupled with a score. The score is a measure of how reliable the subclass assignment to the given document actually is. The meaning can be derived from the underlying categorization algorithm Balanced Winnow. In fact, Balanced Winnow estimates a linear separator for every class, separating the term space into a positive and a negative sub space. If it is located in the positive sub space, the document will be assigned with the respective class.
Based on Balanced Winnow the score is defined as follows:
- The categorizer dismisses a prediction if the score is less than 1.0
- The categorizer accepts a prediction if the score is equal or larger than 1.0
- The higher a score is the more reliable is the class assignment. The score refers to a distance measure of the unseen document within the feature space and the linear class separator.
Note that the interpretation of a score can differ if categorization algorithms other than Balanced Winnow are used.