High-performance embedded systems often include one or more embedded processors tightly coupled with more specialized accelerators. These accelerators improve both performance and energy efficiency because they are specialized for specific (or specific classes of) computations. Data communication between the accelerator and memory, however, is a potential bottleneck for both performance and energy-efficiency. In this paper, we compare and evaluate, for the first time, the impact of L1 data cache design on performance and energy consumption of embedded processor-accelerator systems with shared memory. For this evaluation, we consider data cache design parameters such as size, associativity, and port count, as well as L1 cache sharing between the processor and accelerator. We demonstrate the potential of configurable caches to exploit diversity in cache requirements across hybrid software/hardware applications to significantly improve energy-efficiency while maintaining high performance. Guided by these studies, we propose two techniques for improving energy-efficiency of the cache hierarchy in processor-accelerator systems. The first technique adds configurability to the accelerator-cache interface to allow the accelerator to either share the processor's L1 data cache or use its own private L1 cache. The second technique modifies the L1 cache structure to provide a configurable tradeoff between bandwidth (number of ports) and capacity. Our simulation results show that the first and second techniques improve cache hierarchy energy-efficiency by up to 64% and 33%, respectively, over that of non-configurable caches.