-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcuda.html
395 lines (329 loc) · 30 KB
/
cuda.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
<!DOCTYPE html><html><head>
<title>cuda</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css">
<style>
/**
* prism.js Github theme based on GitHub's theme.
* @author Sam Clarke
*/
code[class*="language-"],
pre[class*="language-"] {
color: #333;
background: none;
font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;
text-align: left;
white-space: pre;
word-spacing: normal;
word-break: normal;
word-wrap: normal;
line-height: 1.4;
-moz-tab-size: 8;
-o-tab-size: 8;
tab-size: 8;
-webkit-hyphens: none;
-moz-hyphens: none;
-ms-hyphens: none;
hyphens: none;
}
/* Code blocks */
pre[class*="language-"] {
padding: .8em;
overflow: auto;
/* border: 1px solid #ddd; */
border-radius: 3px;
/* background: #fff; */
background: #f5f5f5;
}
/* Inline code */
:not(pre) > code[class*="language-"] {
padding: .1em;
border-radius: .3em;
white-space: normal;
background: #f5f5f5;
}
.token.comment,
.token.blockquote {
color: #969896;
}
.token.cdata {
color: #183691;
}
.token.doctype,
.token.punctuation,
.token.variable,
.token.macro.property {
color: #333;
}
.token.operator,
.token.important,
.token.keyword,
.token.rule,
.token.builtin {
color: #a71d5d;
}
.token.string,
.token.url,
.token.regex,
.token.attr-value {
color: #183691;
}
.token.property,
.token.number,
.token.boolean,
.token.entity,
.token.atrule,
.token.constant,
.token.symbol,
.token.command,
.token.code {
color: #0086b3;
}
.token.tag,
.token.selector,
.token.prolog {
color: #63a35c;
}
.token.function,
.token.namespace,
.token.pseudo-element,
.token.class,
.token.class-name,
.token.pseudo-class,
.token.id,
.token.url-reference .token.variable,
.token.attr-name {
color: #795da3;
}
.token.entity {
cursor: help;
}
.token.title,
.token.title .token.punctuation {
font-weight: bold;
color: #1d3e81;
}
.token.list {
color: #ed6a43;
}
.token.inserted {
background-color: #eaffea;
color: #55a532;
}
.token.deleted {
background-color: #ffecec;
color: #bd2c00;
}
.token.bold {
font-weight: bold;
}
.token.italic {
font-style: italic;
}
/* JSON */
.language-json .token.property {
color: #183691;
}
.language-markup .token.tag .token.punctuation {
color: #333;
}
/* CSS */
code.language-css,
.language-css .token.function {
color: #0086b3;
}
/* YAML */
.language-yaml .token.atrule {
color: #63a35c;
}
code.language-yaml {
color: #183691;
}
/* Ruby */
.language-ruby .token.function {
color: #333;
}
/* Markdown */
.language-markdown .token.url {
color: #795da3;
}
/* Makefile */
.language-makefile .token.symbol {
color: #795da3;
}
.language-makefile .token.variable {
color: #183691;
}
.language-makefile .token.builtin {
color: #0086b3;
}
/* Bash */
.language-bash .token.keyword {
color: #0086b3;
}
/* highlight */
pre[data-line] {
position: relative;
padding: 1em 0 1em 3em;
}
pre[data-line] .line-highlight-wrapper {
position: absolute;
top: 0;
left: 0;
background-color: transparent;
display: block;
width: 100%;
}
pre[data-line] .line-highlight {
position: absolute;
left: 0;
right: 0;
padding: inherit 0;
margin-top: 1em;
background: hsla(24, 20%, 50%,.08);
background: linear-gradient(to right, hsla(24, 20%, 50%,.1) 70%, hsla(24, 20%, 50%,0));
pointer-events: none;
line-height: inherit;
white-space: pre;
}
pre[data-line] .line-highlight:before,
pre[data-line] .line-highlight[data-end]:after {
content: attr(data-start);
position: absolute;
top: .4em;
left: .6em;
min-width: 1em;
padding: 0 .5em;
background-color: hsla(24, 20%, 50%,.4);
color: hsl(24, 20%, 95%);
font: bold 65%/1.5 sans-serif;
text-align: center;
vertical-align: .3em;
border-radius: 999px;
text-shadow: none;
box-shadow: 0 1px white;
}
pre[data-line] .line-highlight[data-end]:after {
content: attr(data-end);
top: auto;
bottom: .4em;
}html body{font-family:"Helvetica Neue",Helvetica,"Segoe UI",Arial,freesans,sans-serif;font-size:16px;line-height:1.6;color:#333;background-color:#fff;overflow:initial;box-sizing:border-box;word-wrap:break-word}html body>:first-child{margin-top:0}html body h1,html body h2,html body h3,html body h4,html body h5,html body h6{line-height:1.2;margin-top:1em;margin-bottom:16px;color:#000}html body h1{font-size:2.25em;font-weight:300;padding-bottom:.3em}html body h2{font-size:1.75em;font-weight:400;padding-bottom:.3em}html body h3{font-size:1.5em;font-weight:500}html body h4{font-size:1.25em;font-weight:600}html body h5{font-size:1.1em;font-weight:600}html body h6{font-size:1em;font-weight:600}html body h1,html body h2,html body h3,html body h4,html body h5{font-weight:600}html body h5{font-size:1em}html body h6{color:#5c5c5c}html body strong{color:#000}html body del{color:#5c5c5c}html body a:not([href]){color:inherit;text-decoration:none}html body a{color:#08c;text-decoration:none}html body a:hover{color:#00a3f5;text-decoration:none}html body img{max-width:100%}html body>p{margin-top:0;margin-bottom:16px;word-wrap:break-word}html body>ul,html body>ol{margin-bottom:16px}html body ul,html body ol{padding-left:2em}html body ul.no-list,html body ol.no-list{padding:0;list-style-type:none}html body ul ul,html body ul ol,html body ol ol,html body ol ul{margin-top:0;margin-bottom:0}html body li{margin-bottom:0}html body li.task-list-item{list-style:none}html body li>p{margin-top:0;margin-bottom:0}html body .task-list-item-checkbox{margin:0 .2em .25em -1.8em;vertical-align:middle}html body .task-list-item-checkbox:hover{cursor:pointer}html body blockquote{margin:16px 0;font-size:inherit;padding:0 15px;color:#5c5c5c;background-color:#f0f0f0;border-left:4px solid #d6d6d6}html body blockquote>:first-child{margin-top:0}html body blockquote>:last-child{margin-bottom:0}html body hr{height:4px;margin:32px 0;background-color:#d6d6d6;border:0 none}html body table{margin:10px 0 15px 0;border-collapse:collapse;border-spacing:0;display:block;width:100%;overflow:auto;word-break:normal;word-break:keep-all}html body table th{font-weight:bold;color:#000}html body table td,html body table th{border:1px solid #d6d6d6;padding:6px 13px}html body dl{padding:0}html body dl dt{padding:0;margin-top:16px;font-size:1em;font-style:italic;font-weight:bold}html body dl dd{padding:0 16px;margin-bottom:16px}html body code{font-family:Menlo,Monaco,Consolas,'Courier New',monospace;font-size:.85em !important;color:#000;background-color:#f0f0f0;border-radius:3px;padding:.2em 0}html body code::before,html body code::after{letter-spacing:-0.2em;content:"\00a0"}html body pre>code{padding:0;margin:0;font-size:.85em !important;word-break:normal;white-space:pre;background:transparent;border:0}html body .highlight{margin-bottom:16px}html body .highlight pre,html body pre{padding:1em;overflow:auto;font-size:.85em !important;line-height:1.45;border:#d6d6d6;border-radius:3px}html body .highlight pre{margin-bottom:0;word-break:normal}html body pre code,html body pre tt{display:inline;max-width:initial;padding:0;margin:0;overflow:initial;line-height:inherit;word-wrap:normal;background-color:transparent;border:0}html body pre code:before,html body pre tt:before,html body pre code:after,html body pre tt:after{content:normal}html body p,html body blockquote,html body ul,html body ol,html body dl,html body pre{margin-top:0;margin-bottom:16px}html body kbd{color:#000;border:1px solid #d6d6d6;border-bottom:2px solid #c7c7c7;padding:2px 4px;background-color:#f0f0f0;border-radius:3px}@media print{html body{background-color:#fff}html body h1,html body h2,html body h3,html body h4,html body h5,html body h6{color:#000;page-break-after:avoid}html body blockquote{color:#5c5c5c}html body pre{page-break-inside:avoid}html body table{display:table}html body img{display:block;max-width:100%;max-height:100%}html body pre,html body code{word-wrap:break-word;white-space:pre}}.markdown-preview{width:100%;height:100%;box-sizing:border-box}.markdown-preview .pagebreak,.markdown-preview .newpage{page-break-before:always}.markdown-preview pre.line-numbers{position:relative;padding-left:3.8em;counter-reset:linenumber}.markdown-preview pre.line-numbers>code{position:relative}.markdown-preview pre.line-numbers .line-numbers-rows{position:absolute;pointer-events:none;top:1em;font-size:100%;left:0;width:3em;letter-spacing:-1px;border-right:1px solid #999;-webkit-user-select:none;-moz-user-select:none;-ms-user-select:none;user-select:none}.markdown-preview pre.line-numbers .line-numbers-rows>span{pointer-events:none;display:block;counter-increment:linenumber}.markdown-preview pre.line-numbers .line-numbers-rows>span:before{content:counter(linenumber);color:#999;display:block;padding-right:.8em;text-align:right}.markdown-preview .mathjax-exps .MathJax_Display{text-align:center !important}.markdown-preview:not([for="preview"]) .code-chunk .btn-group{display:none}.markdown-preview:not([for="preview"]) .code-chunk .status{display:none}.markdown-preview:not([for="preview"]) .code-chunk .output-div{margin-bottom:16px}.scrollbar-style::-webkit-scrollbar{width:8px}.scrollbar-style::-webkit-scrollbar-track{border-radius:10px;background-color:transparent}.scrollbar-style::-webkit-scrollbar-thumb{border-radius:5px;background-color:rgba(150,150,150,0.66);border:4px solid rgba(150,150,150,0.66);background-clip:content-box}html body[for="html-export"]:not([data-presentation-mode]){position:relative;width:100%;height:100%;top:0;left:0;margin:0;padding:0;overflow:auto}html body[for="html-export"]:not([data-presentation-mode]) .markdown-preview{position:relative;top:0}@media screen and (min-width:914px){html body[for="html-export"]:not([data-presentation-mode]) .markdown-preview{padding:2em calc(50% - 457px + 2em)}}@media screen and (max-width:914px){html body[for="html-export"]:not([data-presentation-mode]) .markdown-preview{padding:2em}}@media screen and (max-width:450px){html body[for="html-export"]:not([data-presentation-mode]) .markdown-preview{font-size:14px !important;padding:1em}}@media print{html body[for="html-export"]:not([data-presentation-mode]) #sidebar-toc-btn{display:none}}html body[for="html-export"]:not([data-presentation-mode]) #sidebar-toc-btn{position:fixed;bottom:8px;left:8px;font-size:28px;cursor:pointer;color:inherit;z-index:99;width:32px;text-align:center;opacity:.4}html body[for="html-export"]:not([data-presentation-mode])[html-show-sidebar-toc] #sidebar-toc-btn{opacity:1}html body[for="html-export"]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc{position:fixed;top:0;left:0;width:300px;height:100%;padding:32px 0 48px 0;font-size:14px;box-shadow:0 0 4px rgba(150,150,150,0.33);box-sizing:border-box;overflow:auto;background-color:inherit}html body[for="html-export"]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc::-webkit-scrollbar{width:8px}html body[for="html-export"]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc::-webkit-scrollbar-track{border-radius:10px;background-color:transparent}html body[for="html-export"]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc::-webkit-scrollbar-thumb{border-radius:5px;background-color:rgba(150,150,150,0.66);border:4px solid rgba(150,150,150,0.66);background-clip:content-box}html body[for="html-export"]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc a{text-decoration:none}html body[for="html-export"]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc ul{padding:0 1.6em;margin-top:.8em}html body[for="html-export"]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc li{margin-bottom:.8em}html body[for="html-export"]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc ul{list-style-type:none}html body[for="html-export"]:not([data-presentation-mode])[html-show-sidebar-toc] .markdown-preview{left:300px;width:calc(100% - 300px);padding:2em calc(50% - 457px - 150px);margin:0;box-sizing:border-box}@media screen and (max-width:1274px){html body[for="html-export"]:not([data-presentation-mode])[html-show-sidebar-toc] .markdown-preview{padding:2em}}@media screen and (max-width:450px){html body[for="html-export"]:not([data-presentation-mode])[html-show-sidebar-toc] .markdown-preview{width:100%}}html body[for="html-export"]:not([data-presentation-mode]):not([html-show-sidebar-toc]) .markdown-preview{left:50%;transform:translateX(-50%)}html body[for="html-export"]:not([data-presentation-mode]):not([html-show-sidebar-toc]) .md-sidebar-toc{display:none}
/* Please visit the URL below for more information: */
/* https://shd101wyy.github.io/markdown-preview-enhanced/#/customize-css */
</style>
</head>
<body for="html-export">
<div class="mume markdown-preview ">
<h1 class="mume-header" id="cuda">CUDA</h1>
<blockquote>
<p>CUDA(Compute Unified Device Architecture),是显卡厂商NVIDIA推出的运算平台。 CUDA™是一种由NVIDIA推出的通用并行计算架构,该架构使GPU能够解决复杂的计算问题。 它包含了CUDA指令集架构(ISA)以及GPU内部的并行计算引擎。 开发人员现在可以使用C语言来为CUDA™架构编写程序,C语言是应用最广泛的一种高级编程语言。所编写出的程序可以在支持CUDA™的处理器上以超高性能运行。</p>
</blockquote>
<ul>
<li><a href="#cuda">CUDA</a>
<ul>
<li><a href="#cuda%E6%89%A7%E8%A1%8C%E7%9A%84%E5%85%B8%E5%9E%8B%E6%B5%81%E7%A8%8B">CUDA执行的典型流程</a></li>
<li><a href="#cuda%E7%9A%84%E6%9E%B6%E6%9E%84">CUDA的架构</a>
<ul>
<li><a href="#%E7%AE%80%E8%AF%B4gpu%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97%E4%B8%8Ecpu%E4%B8%B2%E8%A1%8C%E8%AE%A1%E7%AE%97">简说GPU并行计算与CPU串行计算</a></li>
<li><a href="#cuda%E7%9A%84%E7%BA%BF%E7%A8%8B%E6%9E%B6%E6%9E%84">CUDA的线程架构</a></li>
<li><a href="#cuda%E7%9A%84%E5%86%85%E5%AD%98">CUDA的内存</a></li>
<li><a href="#cuda%E7%9A%84%E7%BC%96%E7%A8%8B%E6%A8%A1%E5%9E%8B">CUDA的编程模型</a></li>
</ul>
</li>
<li><a href="#pycuda-example">PyCUDA example</a></li>
</ul>
</li>
</ul>
<h2 class="mume-header" id="cuda%E6%89%A7%E8%A1%8C%E7%9A%84%E5%85%B8%E5%9E%8B%E6%B5%81%E7%A8%8B">CUDA执行的典型流程</h2>
<p>在讨论之前,我们先对所用术语进行一些定义。在CUDA中,CPU为Host,GPU为Device,一个Kernal函数为一个由CPU调用,在GPU进行执行的函数。如图:</p>
<p><img src="cuda/cuda-definition.png" alt></p>
<p>cuda所做的事情就是CPU将数据交给GPU执行并行计算。不失一般性的,可以将CUDA的流程描述为:</p>
<ol>
<li>将数据从Host加载到Device,包括:<pre data-role="codeBlock" data-info="c" class="language-c"><span class="token comment">//在GPU上创建变量的储存空间</span>
<span class="token function">cudaMalloc</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token comment">//将数据从CPU加载到GPU的对应地址</span>
<span class="token function">cudaMemcpy</span><span class="token punctuation">(</span><span class="token punctuation">,</span><span class="token punctuation">,</span><span class="token punctuation">,</span>cudaMemcpyDeviceToHost<span class="token punctuation">)</span><span class="token punctuation">;</span>
</pre></li>
<li>调用Kernal函数,在Device进行计算。</li>
<li>将数据从Device取回Host,包括<pre data-role="codeBlock" data-info="c" class="language-c"><span class="token comment">//将数据从CPU的对应地址读回</span>
<span class="token function">cudaMemcpy</span><span class="token punctuation">(</span><span class="token punctuation">,</span><span class="token punctuation">,</span><span class="token punctuation">,</span>cudaMemcpyDeviceToHost<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token comment">//释放空间</span>
<span class="token function">cudaFree</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
</pre></li>
</ol>
<h2 class="mume-header" id="cuda%E7%9A%84%E6%9E%B6%E6%9E%84">CUDA的架构</h2>
<h3 class="mume-header" id="%E7%AE%80%E8%AF%B4gpu%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97%E4%B8%8Ecpu%E4%B8%B2%E8%A1%8C%E8%AE%A1%E7%AE%97">简说GPU并行计算与CPU串行计算</h3>
<p>CUDA执行Kernal的最基本单元是Thread,在CPU调用Kernal时,GPU每个thread<strong>同时、并行</strong>执行<strong>同一个</strong>Kernal(exactly same kernal)。因此,在函数内部通常通过线程ID的不同,来实现对不同数据的访问,而不是传入不同数据。在调用时,传入的是数组的指针。</p>
<p>CUDA内部提供的获取线程ID的方法为<code>threadIdx</code>,简单起见只考虑一维情况下,对一向量进行翻倍的操作,其Kernal为:</p>
<pre data-role="codeBlock" data-info="c" class="language-c">__global__ <span class="token keyword">void</span> <span class="token function">double_vector</span><span class="token punctuation">(</span><span class="token keyword">float</span> <span class="token operator">*</span>x<span class="token punctuation">)</span><span class="token punctuation">{</span>
<span class="token keyword">const</span> <span class="token keyword">int</span> i <span class="token operator">=</span> threadIdx<span class="token punctuation">.</span>x<span class="token punctuation">;</span>
x<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token number">2</span><span class="token operator">*</span>x<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
</pre><p>执行该函数时,每个thread分别取对应位置的数据进行翻倍,并行执行。假设向量的长度为<code>MAX_VECTOR_SIZE</code>,那需要调用<code>MAX_VECTOR_SIZE</code>条thread来进行执行。它等价于CPU中的函数:</p>
<pre data-role="codeBlock" data-info="c" class="language-c"><span class="token keyword">void</span> <span class="token function">double_vector</span><span class="token punctuation">(</span><span class="token keyword">float</span> <span class="token operator">*</span>x<span class="token punctuation">)</span><span class="token punctuation">{</span>
<span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">int</span> i <span class="token operator">=</span> <span class="token number">0</span><span class="token punctuation">;</span> i <span class="token operator"><</span> MAX_VECTOR_SIZE<span class="token punctuation">;</span> i<span class="token operator">++</span><span class="token punctuation">)</span><span class="token punctuation">{</span>
x<span class="token punctuation">[</span>i<span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token number">2</span><span class="token operator">*</span>x<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
<span class="token punctuation">}</span>
</pre><p>由这个简单例子的例子可以看出CPU串行计算与GPU并行计算(理论上和编程实现上)的不同,对于语句<code>x[i] = 2*x[i];</code>,在CPU上使用1条线程,通过循环执行了<code>MAX_VECTOR_SIZE</code>次;在GPU上使<code>MAX_VECTOR_SIZE</code>条线程,每条线程执行1次实现。</p>
<p>这体现了在面对大批量、无耦合数据的计算时通过CUDA进行并行计算的优越性。但是话虽这么说,在计算量不足够大的情况下往往会发现通过CUDA优化的并行程序计算时间并不比CPU执行的更快,差不多,甚至更慢,这是因为数据在CPU和GPU之间传输需要的时间较长导致的,即<code>cudaMemcpy()</code>较慢。因此,在CUDA编程中要尽量减少Host和Device之间的数据交互,以一批上传,一批计算,再一批取回为好。</p>
<h3 class="mume-header" id="cuda%E7%9A%84%E7%BA%BF%E7%A8%8B%E6%9E%B6%E6%9E%84">CUDA的线程架构</h3>
<p><img src="cuda/thread-architechture.png" alt><br>
CUDA中,Kernal执行的基本单元(理论)是<strong>thread</strong>。thread的集合为<strong>block</strong>。block可以为一维、二维或者三维(如果不需要多维,令其他维度为1即可),即一个block中有线程ID<code>(x,y,z)</code>,获取block中线程ID的方法为:</p>
<pre data-role="codeBlock" data-info="c" class="language-c"><span class="token keyword">int</span> x<span class="token punctuation">,</span> y<span class="token punctuation">,</span> z<span class="token punctuation">;</span>
x <span class="token operator">=</span> threadIdx<span class="token punctuation">.</span>x<span class="token punctuation">;</span>
y <span class="token operator">=</span> threadIdx<span class="token punctuation">.</span>y<span class="token punctuation">;</span>
z <span class="token operator">=</span> threadIdx<span class="token punctuation">.</span>z<span class="token punctuation">;</span>
</pre><p>易知,一个具有意义的Kernal至少应该分布在<strong>一个</strong>具有<strong>若干</strong>thread的block上进行执行。</p>
<p>需要注意的是,一个block中最多只能包含1024条thread,即有<code>x*y*z <= 1024</code>,如果计算要求有更多的thread,那么需要分布在多block上进行执行。</p>
<p>block的集合是grid。grid只可以为一维或者二维,获取grid中block ID的方法为:</p>
<pre data-role="codeBlock" data-info="c" class="language-c"><span class="token keyword">int</span> x<span class="token punctuation">,</span> y<span class="token punctuation">;</span>
x <span class="token operator">=</span> blockIdx<span class="token punctuation">.</span>x<span class="token punctuation">;</span>
y <span class="token operator">=</span> blockIdy<span class="token punctuation">.</span>y<span class="token punctuation">;</span>
</pre><p>在调用Kernal的时候,需要通过<code><<<grid, block>>></code>声明Kernal分配在多少个grid和block上进行执行。</p>
<p>在同一个block里的thread具有共享的shared memory,其访问速度较快。反过来说,分布在不同block进行执行的thread的问题是不能共用一shared memory,这是由硬件架构限制的。</p>
<h3 class="mume-header" id="cuda%E7%9A%84%E5%86%85%E5%AD%98">CUDA的内存</h3>
<p><img src="cuda/memory-architechture.png" alt></p>
<p>在选用储存类型上,主要考虑点差异点有:</p>
<ul>
<li>每个grid(若干个block)共用自己的global memory</li>
<li>每个block(若干个thread)共用自己的shared memory</li>
</ul>
<p>其中,shared memory的访问速度要快于global memory,但缺点是只能在同一个block里的thread可以访问。在Kernal中声明或者使用shared memory中的变量需要有关键字<code>__shared__</code>,如:</p>
<pre data-role="codeBlock" data-info="c" class="language-c">__shared__ <span class="token keyword">int</span> result<span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">;</span>
</pre><p>在host执行<code>cudaMalloc()</code>和<code>cudaMemcpy()</code>会将数据加载到global memory。</p>
<h3 class="mume-header" id="cuda%E7%9A%84%E7%BC%96%E7%A8%8B%E6%A8%A1%E5%9E%8B">CUDA的编程模型</h3>
<table>
<thead>
<tr>
<th style="text-align:center">关键字</th>
<th style="text-align:center">执行在host/device</th>
<th style="text-align:center">只能由host/device调用</th>
<th style="text-align:center">注</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center"><code>__device__ float DeviceFunc()</code></td>
<td style="text-align:center">device</td>
<td style="text-align:center">decice</td>
<td style="text-align:center"></td>
</tr>
<tr>
<td style="text-align:center"><code>__global__ void KernalFunc()</code></td>
<td style="text-align:center">device</td>
<td style="text-align:center">host</td>
<td style="text-align:center">返回值必须为void</td>
</tr>
<tr>
<td style="text-align:center"><code>__host__ float HostFunc()</code></td>
<td style="text-align:center">host</td>
<td style="text-align:center">host</td>
<td style="text-align:center"></td>
</tr>
</tbody>
</table>
<h2 class="mume-header" id="pycuda-example">PyCUDA example</h2>
</div>
</body></html>