一些小小的问题

和现在的 MCFCRT 比较了一下，因为 MCFCRT 不打算支持 AVX 就只测试了 SSE 的（实际上是懒得改，其实比较简单，目前的复制操作都是两个连续 `movups` 打包的，这地方改改就能支持 AVX）：

![4311](https://user-images.githubusercontent.com/5071344/35473846-b0aeb952-03c0-11e8-8ffa-5ed06979666d.png)

<https://github.com/lhmouse/MCF/blob/master/MCFCRT/src/stdc/string/_memcpy_impl.h#L292>

```plaintext
gcc (gcc-7-branch HEAD with MCF thread model, built by LH_Mouse.) 7.3.1 20180125
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
```

处理器是 Intel Xeon E3 1230v3 ，Haswell 架构。

你这个测试我看了下，有几个小问题：
1. #1 里面提过的。这个二次函数调用的开销对 32 字节的复制测试的影响比较大，大概有十几个毫秒。
2. 只有 32 64 512 等尺寸的复制，没有带余数的。
3. `timeGetTime()` 太不精确了（误差经常在十几个毫秒），建议用 `QuertPerformanceCounter()`， 还不用链接 `winmm`。

实现的问题也有一些：
1. 全部用 `static` 函数涉嫌利用编译器对 internal linkage 的函数的 aggressive optimization 造假。（逃）
2. `memcpy_fast()` 最后的 `memcpy_tiny()` 实际上没有必要：因为上面 `if (size <= 128) {` 条件不成立的缘故，直接用 `_mm_storeu_si128()` 处理最后的 xmmword 就行。
3. prefetching 实测意义不大。
4. 不建议用整数的 sse 指令。它要求 SSE2，指令长度更大所以 cache locality 不好。相比之下， `_mm_{loadu,load,storeu,store,stream}_ps()` 仅要求 SSE 支持，并且指令长度更小。（由于 Core2 上存在延迟问题， GCC 会将  `movups` 优化为 `movlps` 和 `movhps` 的拼合。然而这是后话，Haswell 上已经没有这延迟问题了。）


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

一些小小的问题 #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

一些小小的问题 #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions