One of the main challenges for embedded systems is the transfer of data between memory and processor. In this context, Hybrid Memory Cubes (HMCs) can provide substantial energy and bandwidth improvements compared to traditional memory organizations, while also allowing the execution of simple atomic instructions in the memory. However, the complex memory hierarchy still remains a bottleneck, especially for applications with a low reuse of data, limiting the usable parallelism of the HMC vaults and banks. In this paper, we introduce the HIVE architecture, which allows performing common vector operations directly inside the HMC, avoiding contention on the interconnections as well as cache pollution. Our mechanism achieves substantial speedups of up to 17.3× (9.4× on average) compared to a baseline system that performs vector operations in a 8-core processor. We show that the simple instructions provided by HMC actually hurt performance for streaming applications.