Efficient implementations of elementary operations, such as reciprocal, square root, and reciprocal square root, can result in significantly more compact and higher throughput designs. Designers have to carefully choose the best suited algorithms among numerous available techniques for custom implementations of these elementary functions that meet their design requirements. This thesis first explores the design trade-offs for alternative high-performance implementations of these three operations in single-precision representation. Various trade-offs at the algorithmic and architectural levels are considered, such as the convergence rate of the algorithm, numerical stability and accuracy, inherent parallelism, resource requirements, latency, as well as the ability to share hardware datapath among the implementations of the required functions. The comparative study and quantitative analysis of high-performance architectures should help designers to select an efficient implementation of the most suitable algorithm for their custom designs. Another goal of the thesis is to implement reciprocal, square root, and reciprocal square root operations in fixed-point and floating-point formats with arbitrary precisions. These elementary functions, developed in MATLAB, are then modeled in Verilog hardware description language and implemented on a field-programmable gate array (FPGA). The implementation results should provide valuable insight for choosing suitable architectures for efficient realization of these elementary functions. Finally, the optimized designs are compared with other state-of-the-art implementations of these modules.