The ability to perceive acoustic signals accurately and acquire the rules of language-specific sound patterning is a challenging yet crucial component of both first and second language acquisition. L1 learners build up phonological knowledge to help their phonemic perception, and L2 learners are influenced by L1 phonology affecting how they perceive unfamiliar speech sounds. Syllable structure is a part of this phonological knowledge and different languages can have different syllable structure preferences. As a result, we expect to see this language-specific syllable structure influence speech perception. Some evidence of the influence of syllable structure differences in non-native speech perception comes from measuring McGurk effect frequencies. The previous work argued that the cross-linguistic difference in the McGurk fusion rate observed in different parts of the syllable reflected the listeners’ language-specific syllable structure preferences, but did not explain how the syllable structure induced these differences. The current research study focuses on extending this result to a new language and examining the mechanisms by which syllable structure can influences the learners’ speech perception. In addition to the audio-visual incongruent stimuli (which induce a McGurk effect), audio-visual congruent stimuli and audio-only stimuli were tested, to compare phonetic and phonological explanations for the differences. Eighteen native English speakers and eighteen native Japanese speakers were presented with a series of nonsense words with different syllable structures and asked to write what they heard. The results showed that both language groups showed a similar pattern of McGurk fusion rates in onset and coda. However, in the audio-only condition, the two language groups showed a significant difference in perception accuracy. I argue this indicates that the McGurk differences is not an exclusive result of acoustic informativity (Phonetic-Superiority Hypothesis) or interference of a listener’s phonological knowledge (Phonological-Superiority Hypothesis). I sketch a cognitively inspired cue integration framework, which rationally integrates different information modalities, as a third explanation to the current results.