Semi-Coherent DMA

An Alternative I/OCoherency Management for Embedded Systems

Seung Won Min, Mohammad Alian, Wen-Mei W Hwu, Nam Sung Kim

Research output: Contribution to journalArticle

Abstract

Many modern embedded CPUs prefer a Non-Coherent DMA (NC-DMA) over a Coherent DMA (C-DMA) because of simplicity. The NC DMA, however, requires a CPU to manually invalidate or flush a wide range of cache space. Especially when an I/O device writes data to a main memory region, the CPU needs to invalidate the cache space corresponding to the same memory regiontwice (1) to prevent dirty cache lines from overwriting the DMA data and (2) to remove any cache lines prefetched before the DMA isdone. In this work, we first show that such an NC-DMA consumes 31% of CPU cycles, limiting the bandwidth of a high-speed networkinterface card (NIC) when receving network packets. Second, improving the efficiency of NC-DMA with a slight modification, we propose a Semi-Coherent DMA (SC-DMA) architecture. Specifically, our SC-DMA records the DMA region and prohibits any dataprefetched from the memory region from staying in the cache, reducing nearly 50% of the unnecessary invalidations. Lastly, we identifythat some software optimization can substantially reduce excessive cache invalidations prevalent in NIC drivers. Our evaluation with NVIDIA Jetson TX2 shows that the SC-DMA with the NIC driver optimization can improve the NIC bandwidth up to 53.3%

Original languageEnglish (US)
JournalIEEE Computer Architecture Letters
DOIs
StateAccepted/In press - Aug 22 2018

Fingerprint

Dynamic mechanical analysis
Embedded systems
Program processors
Data storage equipment
Bandwidth
Packet networks

Keywords

  • Bandwidth
  • Cache
  • Computer architecture
  • Data transfer
  • Device drivers
  • Embedded processor
  • Embedded systems
  • Ethernet
  • Hardware
  • Internet of Things
  • Prefetching

ASJC Scopus subject areas

  • Hardware and Architecture

Cite this

Semi-Coherent DMA : An Alternative I/OCoherency Management for Embedded Systems. / Min, Seung Won; Alian, Mohammad; Hwu, Wen-Mei W; Kim, Nam Sung.

In: IEEE Computer Architecture Letters, 22.08.2018.

Research output: Contribution to journalArticle

@article{024bc14958fc447bb987093e678dd57e,
title = "Semi-Coherent DMA: An Alternative I/OCoherency Management for Embedded Systems",
abstract = "Many modern embedded CPUs prefer a Non-Coherent DMA (NC-DMA) over a Coherent DMA (C-DMA) because of simplicity. The NC DMA, however, requires a CPU to manually invalidate or flush a wide range of cache space. Especially when an I/O device writes data to a main memory region, the CPU needs to invalidate the cache space corresponding to the same memory regiontwice (1) to prevent dirty cache lines from overwriting the DMA data and (2) to remove any cache lines prefetched before the DMA isdone. In this work, we first show that such an NC-DMA consumes 31{\%} of CPU cycles, limiting the bandwidth of a high-speed networkinterface card (NIC) when receving network packets. Second, improving the efficiency of NC-DMA with a slight modification, we propose a Semi-Coherent DMA (SC-DMA) architecture. Specifically, our SC-DMA records the DMA region and prohibits any dataprefetched from the memory region from staying in the cache, reducing nearly 50{\%} of the unnecessary invalidations. Lastly, we identifythat some software optimization can substantially reduce excessive cache invalidations prevalent in NIC drivers. Our evaluation with NVIDIA Jetson TX2 shows that the SC-DMA with the NIC driver optimization can improve the NIC bandwidth up to 53.3{\%}",
keywords = "Bandwidth, Cache, Computer architecture, Data transfer, Device drivers, Embedded processor, Embedded systems, Ethernet, Hardware, Internet of Things, Prefetching",
author = "Min, {Seung Won} and Mohammad Alian and Hwu, {Wen-Mei W} and Kim, {Nam Sung}",
year = "2018",
month = "8",
day = "22",
doi = "10.1109/LCA.2018.2866568",
language = "English (US)",
journal = "IEEE Computer Architecture Letters",
issn = "1556-6056",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Semi-Coherent DMA

T2 - An Alternative I/OCoherency Management for Embedded Systems

AU - Min, Seung Won

AU - Alian, Mohammad

AU - Hwu, Wen-Mei W

AU - Kim, Nam Sung

PY - 2018/8/22

Y1 - 2018/8/22

N2 - Many modern embedded CPUs prefer a Non-Coherent DMA (NC-DMA) over a Coherent DMA (C-DMA) because of simplicity. The NC DMA, however, requires a CPU to manually invalidate or flush a wide range of cache space. Especially when an I/O device writes data to a main memory region, the CPU needs to invalidate the cache space corresponding to the same memory regiontwice (1) to prevent dirty cache lines from overwriting the DMA data and (2) to remove any cache lines prefetched before the DMA isdone. In this work, we first show that such an NC-DMA consumes 31% of CPU cycles, limiting the bandwidth of a high-speed networkinterface card (NIC) when receving network packets. Second, improving the efficiency of NC-DMA with a slight modification, we propose a Semi-Coherent DMA (SC-DMA) architecture. Specifically, our SC-DMA records the DMA region and prohibits any dataprefetched from the memory region from staying in the cache, reducing nearly 50% of the unnecessary invalidations. Lastly, we identifythat some software optimization can substantially reduce excessive cache invalidations prevalent in NIC drivers. Our evaluation with NVIDIA Jetson TX2 shows that the SC-DMA with the NIC driver optimization can improve the NIC bandwidth up to 53.3%

AB - Many modern embedded CPUs prefer a Non-Coherent DMA (NC-DMA) over a Coherent DMA (C-DMA) because of simplicity. The NC DMA, however, requires a CPU to manually invalidate or flush a wide range of cache space. Especially when an I/O device writes data to a main memory region, the CPU needs to invalidate the cache space corresponding to the same memory regiontwice (1) to prevent dirty cache lines from overwriting the DMA data and (2) to remove any cache lines prefetched before the DMA isdone. In this work, we first show that such an NC-DMA consumes 31% of CPU cycles, limiting the bandwidth of a high-speed networkinterface card (NIC) when receving network packets. Second, improving the efficiency of NC-DMA with a slight modification, we propose a Semi-Coherent DMA (SC-DMA) architecture. Specifically, our SC-DMA records the DMA region and prohibits any dataprefetched from the memory region from staying in the cache, reducing nearly 50% of the unnecessary invalidations. Lastly, we identifythat some software optimization can substantially reduce excessive cache invalidations prevalent in NIC drivers. Our evaluation with NVIDIA Jetson TX2 shows that the SC-DMA with the NIC driver optimization can improve the NIC bandwidth up to 53.3%

KW - Bandwidth

KW - Cache

KW - Computer architecture

KW - Data transfer

KW - Device drivers

KW - Embedded processor

KW - Embedded systems

KW - Ethernet

KW - Hardware

KW - Internet of Things

KW - Prefetching

UR - http://www.scopus.com/inward/record.url?scp=85052680840&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85052680840&partnerID=8YFLogxK

U2 - 10.1109/LCA.2018.2866568

DO - 10.1109/LCA.2018.2866568

M3 - Article

JO - IEEE Computer Architecture Letters

JF - IEEE Computer Architecture Letters

SN - 1556-6056

ER -