Yingchi Long
longyingchi24s@ict.ac.cn
Institute of Computing Technology, CAS
| 类别 | 测试项 | 关 SIMD | 开 SIMD | 提升 |
|---|---|---|---|---|
| 整数 | 600.perlbench_s | 3.4 | 3.39 | -0.29% |
| 602.gcc_s | 5.79 | 5.79 | 0.00% | |
| 605.mcf_s | 4.32 | 4.1 | -5.09% | |
| 620.omnetpp_s | 3.01 | 2.99 | -0.66% | |
| 623.xalancbmk_s | 3.45 | 3.72 | 7.83% | |
| 625.x264_s | 4.1 | 7.18 | 75.12% | |
| 631.deepsjeng_s | 3.71 | 3.65 | -1.62% | |
| 641.leela_s | 3.13 | 3.16 | 0.96% | |
| 648.exchange2_s | 5.82 | 5.91 | 1.55% | |
| 657.xz_s | 2.17 | 2.23 | 2.76% | |
| 几何平均 | 3.74 | 3.97 | 6.29% | |
| 类别 | 测试项 | 关 SIMD | 开 SIMD | 提升 |
|---|---|---|---|---|
| 浮点 | 603.bwaves_s | 10.8 | 10.8 | 0.00% |
| 607.cactuBSSN_s | 2.01 | 2.02 | 0.50% | |
| 619.lbm_s | 3.3 | 3.38 | 2.42% | |
| 621.wrf_s | 2.09 | 3.29 | 57.42% | |
| 627.cam4_s | 1.03 | 1.03 | 0.00% | |
| 628.pop2_s | 1.71 | 1.97 | 15.20% | |
| 638.imagick_s | 1.6 | 1.6 | 0.00% | |
| 644.nab_s | 3.91 | 3.92 | 0.26% | |
| 649.fotonik3d_s | 3.59 | 5.14 | 43.18% | |
| 654.roms_s | 1.85 | 2.79 | 50.81% | |
| 几何平均 | 2.53 | 2.91 | 14.99% | |
Things to do:
Things to do:
Things to do:
Things to do:
基本数据类型: int、float, ...
我们如何表示向量数据类型?
TargetTransformInfo中端可能感兴趣什么问题?
后端如何实现?
| 后端提供的代价类型 | 语义 |
|---|---|
TCK_RecipThroughput |
吞吐量倒数 |
TCK_Latency |
指令延迟 |
TCK_CodeSize |
指令产生的二进制大小 |
TCK_SizeAndLatency |
二进制大小与延迟的加权和 |
Funny thing:
后端:所有代价类型都输出同一个数
digraph {
1 -> "+"
2 -> "+"
"+" -> "*"
2 -> "*"
}
digraph {
1 -> "+"
2 -> "+"
"+" -> "*"
2 -> "*"
}
它没这么简单
define i32 @add(i32 %a, i32 %b) {
%ret = add i32 %a, %b
ret i32 %ret
}
digraph "dag-combine1 input for add:" {
rankdir="BT";
Node0x431106d0 [shape=record,shape=Mrecord,label="{EntryToken|t0|{ch|glue}}"];
Node0x43168070 [shape=record,shape=Mrecord,label="{Register %0|t1|{i32}}"];
Node0x431680e0 [shape=record,shape=Mrecord,label="{{0|1}|CopyFromReg|t2|{i32|ch}}"];
Node0x431680e0:s0 -> Node0x431106d0:d0[color=blue,style=dashed];
Node0x431680e0:s1 -> Node0x43168070:d0;
Node0x43168150 [shape=record,shape=Mrecord,label="{Register %1|t3|{i32}}"];
Node0x431681c0 [shape=record,shape=Mrecord,label="{{0|1}|CopyFromReg|t4|{i32|ch}}"];
Node0x431681c0:s0 -> Node0x431106d0:d0[color=blue,style=dashed];
Node0x431681c0:s1 -> Node0x43168150:d0;
Node0x43168230 [shape=record,shape=Mrecord,label="{{0|1}|add|t5|{i32}}"];
Node0x43168230:s0 -> Node0x431680e0:d0;
Node0x43168230:s1 -> Node0x431681c0:d0;
}
define void @store(ptr %f) {
store i32 1, ptr %f
store i32 2, ptr %f
ret void
}
digraph "dag-combine1 input for store:" {
rankdir="BT";
Node0x3776a290 [shape=record,shape=Mrecord,label="{EntryToken|t0|{ch|glue}}"];
Node0x377c1af0 [shape=record,shape=Mrecord,label="{Register %0|t1|{i64}}"];
Node0x377c1b60 [shape=record,shape=Mrecord,label="{{0|1}|CopyFromReg|t2|{i64|ch}}"];
Node0x377c1b60:s0 -> Node0x3776a290:d0[color=blue,style=dashed];
Node0x377c1b60:s1 -> Node0x377c1af0:d0;
Node0x377c1bd0 [shape=record,shape=Mrecord,label="{Constant\<1\>|t3|{i32}}"];
Node0x377c1cb0 [shape=record,shape=Mrecord,label="{undef|t5|{i64}}"];
Node0x377c1d20 [shape=record,shape=Mrecord,label="{{0|1|2|3}|store\<(store (s32) into %ir.f)\>|t6|{ch}}"];
Node0x377c1d20:s0 -> Node0x3776a290:d0[color=blue,style=dashed];
Node0x377c1d20:s1 -> Node0x377c1bd0:d0;
Node0x377c1d20:s2 -> Node0x377c1b60:d0;
Node0x377c1d20:s3 -> Node0x377c1cb0:d0;
Node0x377c1d90 [shape=record,shape=Mrecord,label="{Constant\<2\>|t7|{i32}}"];
Node0x377c1e00 [shape=record,shape=Mrecord,label="{{0|1|2|3}|store\<(store (s32) into %ir.f)\>|t8|{ch}}"];
Node0x377c1e00:s0 -> Node0x377c1d20:d0[color=blue,style=dashed];
Node0x377c1e00:s1 -> Node0x377c1d90:d0;
Node0x377c1e00:s2 -> Node0x377c1b60:d0;
Node0x377c1e00:s3 -> Node0x377c1cb0:d0;
}
某Arch支持向量指令:
如何合法化 Op:
ADD/SUBSRAMULLOAD/STOREBSWAPABS某Arch支持向量指令:
Load Store?
拆成两个 + Combine
Load p $\Rightarrow$
Load p_low + Load p_hi
如果还不对齐的话,需要 memcpy
地址对齐是前端分析出来并一路从中端传到后端的
要分析正确、传递正确
某Arch支持向量指令:
Byte Swap (from Chromium)?
整体移位 + or
$(A_0, A_1, A_2, A_3)$
$(A_3, O, O, O)$
$(O, A_2, O, O)$
$(O, O, A_1, O)$
某Arch支持向量指令:
ABS?
(SRA x, type_size - 1) = $t$
(ABS x) = (XOR (ADD x, $t$), $t$)
例子:8 位整数 -5 = $11111011_2$
$t$ = -1
$11111010_2$ xor $11111111_2$
= $11111111_2$
啥问题?
Neon 指令集只有 128 位,但是加速比吊打 x86
Source code of x264_pixel_satd_8x4