Locating Product Information from the Web using Simhash Fingerprints
Abstract
We considerthe problem of creating efficient search schemes that are specialized for product information; this is a very important issue given the explosive growth of commercial websites and Internetbased services. We share the observation in PEWeb [24], that products are almost always displayed in range of similar-look info pieces showing features and prices for customers to choose and so, the webpage DOM tree would have similar subtrees in the parts corresponding to the product show areas.
We propose to use a special hash function, namely Simhash [18], for identifying the product regions. Our basic idea is that sub-trees (in the webpage DOM tree) with similar structures would have similar Simhash fingerprints (separated just by a few bits). To eliminate possible miscalls in the first phase using Simhash, we also combine with a decision tree approach which gives us more flexibility especially with product websites developed by Vietnamese companies which prefer certain display formats not very popular worldwide. Compared to PEWeb, our scheme can be more refined and flexible where we have more options to adjust the scheme. This improvement in preciseness is strongly supported by experimental results.