Extracting foreground regions can provide contextual information for a variety of computer vision tasks, including object detection, visual tracking, semantic segmentation etc., in surveillance systems. Traditional methods in the literature suffer from multiple challenges such as background clusters, objects overlapping in the visual field, shadows, lighting changes, fast-moving objects, and objects being introduced or removed from the scene. To address these issues, this work presents a learning-based method for subtracting background regions in individual video frames. The proposed method utilizes the recently developed fully convolutional networks (FCNs), which take input of arbitrary size and produce correspondingly-sized output. The network trained end-to end, pixel-to-pixel, was able to predict the foreground pixels with lesser noise and better generalization to the extent that exceeds the result of the traditional methods. Integrating with transfer learning and image pyramids techniques further enhance the stability of the models. The performance of the models was compared for different scenarios.