Scalable SIMD-parallel memory allocation for many-core machines

Xiaohuang Huang, Christopher I. Rodrigues, Stephen Jones, Ian Buck, Wen Mei Hwu

Research output: Contribution to journalArticlepeer-review


Dynamic memory allocation is an important feature of modern programming systems. However, the cost of memory allocation in massively parallel execution environments such as CUDA has been too high for many types of kernels. This paper presents XMalloc, a high-throughput memory allocation mechanism that dramatically magnifies the allocation throughput of an underlying memory allocator. XMalloc embodies two key techniques: allocation coalescing and buffering using efficient queues. This paper describes these two techniques and presents our implementation of XMalloc as a memory allocator library. The library is designed to be called from kernels executed by massive numbers of threads. Our experimental results based on the NVIDIA G480 GPU show that XMalloc magnifies the allocation throughput of the underlying memory allocator by a factor of 48.

Original languageEnglish (US)
Pages (from-to)1008-1020
Number of pages13
JournalJournal of Supercomputing
Issue number3
StatePublished - Jun 1 2013


  • CUDA
  • Malloc

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Information Systems
  • Hardware and Architecture

Fingerprint Dive into the research topics of 'Scalable SIMD-parallel memory allocation for many-core machines'. Together they form a unique fingerprint.

Cite this