To install click the Add extension button. That's it.

The source code for the WIKI 2 extension is being checked by specialists of the Mozilla Foundation, Google, and Apple. You could also do it yourself at any point in time.

4,5
Kelly Slayton
Congratulations on this excellent venture… what a great idea!
Alexander Grigorievskiy
I use WIKI 2 every day and almost forgot how the original Wikipedia looks like.
Live Statistics
English Articles
Improved in 24 Hours
Added in 24 Hours
Languages
Recent
Show all languages
What we do. Every page goes through several hundred of perfecting techniques; in live mode. Quite the same Wikipedia. Just better.
.
Leo
Newton
Brights
Milds

Parallel Thread Execution

From Wikipedia, the free encyclopedia

Parallel Thread Execution (PTX or NVPTX[1]) is a low-level parallel thread execution virtual machine and instruction set architecture used in Nvidia's CUDA programming environment. The NVCC compiler translates code written in CUDA, a C++-like language, into PTX instructions (an assembly language represented as ASCII text), and the graphics driver contains a compiler which translates the PTX instructions into the executable binary code[2] which can be run on the processing cores of Nvidia GPUs. The GNU Compiler Collection also has basic ability for PTX generation in the context of OpenMP offloading.[3] Inline PTX assembly can be used in CUDA.[4]

YouTube Encyclopedic

  • 1/3
    Views:
    16 399
    6 640
    2 416
  • Selenium || Parallel Test Execution using TestNG
  • Thread Blocks And GPU Hardware - Intro to Parallel Programming
  • Thread Divergence - Intro to Parallel Programming

Transcription

Registers

PTX uses an arbitrarily large register set; the output from the compiler is almost pure single-assignment form, with consecutive lines generally referring to consecutive registers. Programs start with declarations of the form

.reg .u32 %r<335>;            // declare 335 registers %r0, %r1, ..., %r334 of type unsigned 32-bit integer

It is a three-argument assembly language, and almost all instructions explicitly list the data type (in terms of sign and width) on which they operate. Register names are preceded with a % character and constants are literal, e.g.:

shr.u64 %rd14, %rd12, 32;     // shift right an unsigned 64-bit integer from %rd12 by 32 positions, result in %rd14
cvt.u64.u32 %rd142, %r112;    // convert an unsigned 32-bit integer to 64-bit

There are predicate registers, but compiled code in shader model 1.0 uses these only in conjunction with branch commands; the conditional branch is

@%p14 bra $label;             // branch to $label

The setp.cc.type instruction sets a predicate register to the result of comparing two registers of appropriate type, there is also a set instruction, where set.le.u32.u64 %r101, %rd12, %rd28 sets the 32-bit register %r101 to 0xffffffff if the 64-bit register %rd12 is less than or equal to the 64-bit register %rd28. Otherwise %r101 is set to 0x00000000.

There are a few predefined identifiers that denote pseudoregisters. Among others, %tid, %ntid, %ctaid, and %nctaid contain, respectively, thread indices, block dimensions, block indices, and grid dimensions.[5]

State spaces

Load (ld) and store (st) commands refer to one of several distinct state spaces (memory banks), e.g. ld.param. There are eight state spaces:[5]

.reg
registers
.sreg
special, read-only, platform-specific registers
.const
shared, read-only memory
.global
global memory, shared by all threads
.local
local memory, private to each thread
.param
parameters passed to the kernel
.shared
memory shared between threads in a block
.tex
global texture memory (deprecated)

Shared memory is declared in the PTX file via lines at the start of the form:

.shared .align 8 .b8 pbatch_cache[15744]; // define 15,744 bytes, aligned to an 8-byte boundary

Writing kernels in PTX requires explicitly registering PTX modules via the CUDA Driver API, typically more cumbersome than using the CUDA Runtime API and Nvidia's CUDA compiler, nvcc. The GPU Ocelot project provided an API to register PTX modules alongside CUDA Runtime API kernel invocations, though the GPU Ocelot is no longer actively maintained.[6]

See also

References

  1. ^ "User Guide for NVPTX Back-end — LLVM 7 documentation". llvm.org.
  2. ^ "CUDA Binary Utilities". docs.nvidia.com. Retrieved 2019-10-19.
  3. ^ "nvptx". GCC Wiki.
  4. ^ "Inline PTX Assembly in CUDA". docs.nvidia.com. Retrieved 2019-11-03.
  5. ^ a b "PTX ISA Version 2.3" (PDF).
  6. ^ "GPUOCelot: A dynamic compilation framework for PTX". github.com. 7 November 2022.

External links

This page was last edited on 16 January 2024, at 20:25
Basis of this page is in Wikipedia. Text is available under the CC BY-SA 3.0 Unported License. Non-text media are available under their specified licenses. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. WIKI 2 is an independent company and has no affiliation with Wikimedia Foundation.