Emerging image and signal processing applications involve several matrix-based algorithms that are extremely sensitive to round-off error in computations. Implementing these applications on fixed-point (FxP) processors can significantly increase their design-time and may also result in reduced signal-to-noise (SNR) ratios. However, due to the high area and power overhead of floating-point (FP) hardware, low-power DSPs typically do not provide hardware support for floating-point (FP) arithmetic. Moreover, the long latency of FP operations can also reduce the performance of executing signal processing applications. In this paper, we propose a block-floating-point-based fused multiply-add (BFP-FMA) unit with reduced area and power overhead that is tailored to the needs of signal processing applications. Since dot-product instructions are commonly employed in matrix-based kernels, we employ our proposed BFPFMA unit to reduce the latency of dot-product operations by a factor of two. Our proposed FMA unit can improve the performance of executing key DSP kernels by as much as 40%, while reducing energy consumption by 28%. Exploiting BFP arithmetic also allows us to reduce the area and power of the FMA units by 33% and 41%, respectively.