Scalable SIMD-parallel memory allocation for many-core machines

Xiaohuang Huang, Christopher I. Rodrigues, Stephen Jones, Ian Buck, Wen-Mei W Hwu

Research output: Contribution to journalArticle

Abstract

Dynamic memory allocation is an important feature of modern programming systems. However, the cost of memory allocation in massively parallel execution environments such as CUDA has been too high for many types of kernels. This paper presents XMalloc, a high-throughput memory allocation mechanism that dramatically magnifies the allocation throughput of an underlying memory allocator. XMalloc embodies two key techniques: allocation coalescing and buffering using efficient queues. This paper describes these two techniques and presents our implementation of XMalloc as a memory allocator library. The library is designed to be called from kernels executed by massive numbers of threads. Our experimental results based on the NVIDIA G480 GPU show that XMalloc magnifies the allocation throughput of the underlying memory allocator by a factor of 48.

Original languageEnglish (US)
Pages (from-to)1008-1020
Number of pages13
JournalJournal of Supercomputing
Volume64
Issue number3
DOIs
StatePublished - Jun 1 2013

Fingerprint

Storage allocation (computer)
Many-core
Throughput
Data storage equipment
Computer systems programming
kernel
Thread
High Throughput
Queue
Costs
Programming
Experimental Results

Keywords

  • CUDA
  • GPGPU
  • Malloc

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Information Systems
  • Hardware and Architecture

Cite this

Scalable SIMD-parallel memory allocation for many-core machines. / Huang, Xiaohuang; Rodrigues, Christopher I.; Jones, Stephen; Buck, Ian; Hwu, Wen-Mei W.

In: Journal of Supercomputing, Vol. 64, No. 3, 01.06.2013, p. 1008-1020.

Research output: Contribution to journalArticle

Huang, Xiaohuang ; Rodrigues, Christopher I. ; Jones, Stephen ; Buck, Ian ; Hwu, Wen-Mei W. / Scalable SIMD-parallel memory allocation for many-core machines. In: Journal of Supercomputing. 2013 ; Vol. 64, No. 3. pp. 1008-1020.
@article{377a0066936147959f5d04afa7e0e456,
title = "Scalable SIMD-parallel memory allocation for many-core machines",
abstract = "Dynamic memory allocation is an important feature of modern programming systems. However, the cost of memory allocation in massively parallel execution environments such as CUDA has been too high for many types of kernels. This paper presents XMalloc, a high-throughput memory allocation mechanism that dramatically magnifies the allocation throughput of an underlying memory allocator. XMalloc embodies two key techniques: allocation coalescing and buffering using efficient queues. This paper describes these two techniques and presents our implementation of XMalloc as a memory allocator library. The library is designed to be called from kernels executed by massive numbers of threads. Our experimental results based on the NVIDIA G480 GPU show that XMalloc magnifies the allocation throughput of the underlying memory allocator by a factor of 48.",
keywords = "CUDA, GPGPU, Malloc",
author = "Xiaohuang Huang and Rodrigues, {Christopher I.} and Stephen Jones and Ian Buck and Hwu, {Wen-Mei W}",
year = "2013",
month = "6",
day = "1",
doi = "10.1007/s11227-011-0680-7",
language = "English (US)",
volume = "64",
pages = "1008--1020",
journal = "Journal of Supercomputing",
issn = "0920-8542",
publisher = "Springer Netherlands",
number = "3",

}

TY - JOUR

T1 - Scalable SIMD-parallel memory allocation for many-core machines

AU - Huang, Xiaohuang

AU - Rodrigues, Christopher I.

AU - Jones, Stephen

AU - Buck, Ian

AU - Hwu, Wen-Mei W

PY - 2013/6/1

Y1 - 2013/6/1

N2 - Dynamic memory allocation is an important feature of modern programming systems. However, the cost of memory allocation in massively parallel execution environments such as CUDA has been too high for many types of kernels. This paper presents XMalloc, a high-throughput memory allocation mechanism that dramatically magnifies the allocation throughput of an underlying memory allocator. XMalloc embodies two key techniques: allocation coalescing and buffering using efficient queues. This paper describes these two techniques and presents our implementation of XMalloc as a memory allocator library. The library is designed to be called from kernels executed by massive numbers of threads. Our experimental results based on the NVIDIA G480 GPU show that XMalloc magnifies the allocation throughput of the underlying memory allocator by a factor of 48.

AB - Dynamic memory allocation is an important feature of modern programming systems. However, the cost of memory allocation in massively parallel execution environments such as CUDA has been too high for many types of kernels. This paper presents XMalloc, a high-throughput memory allocation mechanism that dramatically magnifies the allocation throughput of an underlying memory allocator. XMalloc embodies two key techniques: allocation coalescing and buffering using efficient queues. This paper describes these two techniques and presents our implementation of XMalloc as a memory allocator library. The library is designed to be called from kernels executed by massive numbers of threads. Our experimental results based on the NVIDIA G480 GPU show that XMalloc magnifies the allocation throughput of the underlying memory allocator by a factor of 48.

KW - CUDA

KW - GPGPU

KW - Malloc

UR - http://www.scopus.com/inward/record.url?scp=84886717782&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84886717782&partnerID=8YFLogxK

U2 - 10.1007/s11227-011-0680-7

DO - 10.1007/s11227-011-0680-7

M3 - Article

AN - SCOPUS:84886717782

VL - 64

SP - 1008

EP - 1020

JO - Journal of Supercomputing

JF - Journal of Supercomputing

SN - 0920-8542

IS - 3

ER -