This thesis presents a learning-based audio-visual approach for detecting active speakers in videos of group meetings. The video inputs often include an unknown number of humans with low-resolution facial regions, which pose great challenges to traditional vision methods. The key idea of this effort is to employ both visual inputs and audio signals to boost detection accuracy. The major contributions are threefold. Firstly, a learning-based object detection method is developed to localize human faces in videos and classify them as either speaker or non-speaker according to their facial features. Secondly, the synchronized audio signals are processed to estimate the existence of the speaker over time. Thirdly, a comprehensive fusion strategy for speaker detection is introduced to combine the results of visual analysis and the results of audio analysis. The proposed methods were evaluated on a newly collected video dataset that includes hundreds of YouTube videos. Experimental results showed that the audio-visual method achieved substantial improvements over visual methods.