Description
Currently, in the field of computational linguistics, statistically based approaches to solving issues in natural language processing have been playing a more significant role than ever before. One of the foremost challenges in natural language processing is anaphora resolution, especially with regard to machine translation. In languages such as Japanese, anaphoric expressions are often deleted in discourse, giving rise to what are called zero pronouns. The issue of Japanese zero pronoun resolution has been discussed extensively in syntactic, discourse and computationally-oriented studies. There are two kinds of approaches to zero pronoun resolution computationally: rule based approaches and statistically based approaches. Rule based approaches seem to lack the ability to capture semantic aspects of anaphoric expressions; on the other hand, incorporating semantic information in statistical approaches is costly and limited. The goal of this thesis is to compare the accuracy of a rule based model with a statistical model in the detection of zero subject pronouns using only a samall amount of available data. The approach that is pursued is a statistically based approach, using what is called a Maximum Entropy model to resolve zero subject pronouns. This study is essentially comprised of three procedures: (1) zero subject pronoun detection, (2) potential antecedent detection, and (3) Maximum Entropy classification. The results of the study highlight the existence of a class of remote antecedents that cannot be reliably detected by the model, suggesting that a much larger amount of data, linguistic knowledge (rules), and semantic information are required to achieve a higher level of accuracy. However, the model presented in this thesis seems to capture the heuristic rules that are outlined in previous studies without including them as constraints. This suggests that certain rules are not necessary in anaphora resolution.