Xác định đặc điểm tác giả bài viết diễn đàn tiếng Việt dựa trên âm tiết và vần
Author profiling is the task of identifying characteristics of the author just based on a text document. In the previous works, there are a number of linguistic features such as character-based, word-based, grammar-based (often grouped as style-based), and content-based features (content words) have been exploited. The previous results showed that content-based features often achieved better results than style-based features. However, using content-based features is considered as a domain-specific approach, because the content words chosen often have meaning related to the studied domain. In this work, we investigate the use of syllables and rhymes as features for author profiling of Vietnamese text. They are parts of words, but have much less meaning than words, especially the rhymes. Therefore, these features can be considered much less domain-dependent than content words. We experimented on forum post datasets using machine learning approach. With improvement up to 8% compared with baseline results on style-based features, our method shows a new promising approach on author profiling.
