Albeit low-power, mixed-signal circuitry suffers from significantoverhead of Analog to Digital (A/D) conversion, limited range forinformation encoding, and susceptibility to noise. This paper aimsto address these challenges by offering and leveraging the following mathematical insight regarding vector dot-product-the basicoperator in Deep Neural Networks (DNNs). This operator can bereformulated as a wide regrouping of spatially parallel low-bitwidthcalculations that are interleaved across the bit partitions of multipleelements of the vectors. As such, the computational building blockof our accelerator becomes a wide bit-interleaved analog vectorunit comprising a collection of low-bitwidth multiply-accumulatemodules that operate in the analog domain and share a single A/Dconverter (ADC). This bit-partitioning results in a lower-resolutionADC while the wide regrouping alleviates the need for A/D conversion per operation, amortizing its cost across multiple bit-partitionsof the vector elements. Moreover, the low-bitwidth modules requiresmaller encoding range and also provide larger margins for noisemitigation. We also utilize the switched-capacitor design for ourbit-level reformulation of DNN operations. The proposed switchedcapacitor circuitry performs the regrouped multiplications in thecharge domain and accumulates the results of the group in its capacitors over multiple cycles. The capacitive accumulation combinedwith wide bit-partitioned regrouping reduces the rate of A/D conversions, further improving the overall efficiency of the design.With suchmathematical reformulation and its switched-capacitorimplementation, we define one possible 3D-stacked microarchitecture, dubbed BiHiwe1, that leverages clustering and hierarchicaldesign to best utilize power-efficiency of the mixed-signal domainand 3D stacking. We also build models for noise, computational nonidealities, and variations. For ten DNN benchmarks, BiHiwe delivers5.5×speedup over a leading purely-digital 3D-stacked accelerator Tetris, with a mere of less than 0.5% accuracy loss achieved bycareful treatment of noise, computation error, and various forms ofvariation. Compared to RTX 2080 TI with tensor cores and Titan XpGPUs, all with 8-bit execution, BiHiwe offers 35.4×and 70.1×higherPerformance-per-Watt, respectively. Relative to the mixed-signalRedEye, ISAAC, and PipeLayer, BiHiwe offers 5.5×, 3.6×, and 9.6×improvement in Performance-per-Watt respectively. The results suggest that BiHiwe is an effective initial step in a road that combinesmathematics, circuits, and architecture.