1 Department of Artificial Intelligence, Artificial Intelligence Technology Institute, Kim Il Sung University, Pyongyang, Democratic People’s Republic of Korea.
2 Department of Foreign Language, Kim Il Sung University, Pyongyang, Democratic People’s Republic of Korea.
World Journal of Advanced Research and Reviews, 2026, 29(03), 1008-1015
Article DOI: 10.30574/wjarr.2026.29.3.0477
Received on 17 January 2026; revised on 25 February 2026; accepted on 27 February 2026
Byte Pair Encoding (BPE) is widely recognized as an effective approach for machine translation across multiple languages. However, in morphologically rich languages such as Korean, BPE can lead to excessive segmentation, which harms word semantics and creates semantic confusion during the training. This semantic confusion ultimately leads to an overall degradation in translation quality. Subword segmentation is an effective solution to the vocabulary problem in neural machine translation. This paper proposes a method to optimize the Korean subword vocabulary for neural machine translation, based on the fact that a Korean subword vocabulary created with the BPE training algorithm contains many compositional subwords. The optimized Korean subword vocabulary demonstrates experimentally stabilized translation performance by maintaining a balanced distribution while removing unnecessary compositional subwords.
Korean Translation; NMT; Subword Vocabulary; BPE Learning Algorithm; Vocabulary Optimization
Preview Article PDF
Kim Ryonghyok, Kim Kwanghyok, An Songil, Ryang Cholho and Choe Jinhyok. Korean Subword vocabulary optimization by removing compositional words in neural machine translation. World Journal of Advanced Research and Reviews, 2026, 29(03), 1008-1015. Article DOI: https://doi.org/10.30574/wjarr.2026.29.3.0477.