Yingchi Long
longyingchi24s@ict.ac.cn
Institute of Computing Technology, CAS
类别 | 测试项 | 关 SIMD | 开 SIMD | 提升 |
---|---|---|---|---|
整数 | 600.perlbench_s | 3.4 | 3.39 | -0.29% |
602.gcc_s | 5.79 | 5.79 | 0.00% | |
605.mcf_s | 4.32 | 4.1 | -5.09% | |
620.omnetpp_s | 3.01 | 2.99 | -0.66% | |
623.xalancbmk_s | 3.45 | 3.72 | 7.83% | |
625.x264_s | 4.1 | 7.18 | 75.12% | |
631.deepsjeng_s | 3.71 | 3.65 | -1.62% | |
641.leela_s | 3.13 | 3.16 | 0.96% | |
648.exchange2_s | 5.82 | 5.91 | 1.55% | |
657.xz_s | 2.17 | 2.23 | 2.76% | |
几何平均 | 3.74 | 3.97 | 6.29% |
类别 | 测试项 | 关 SIMD | 开 SIMD | 提升 |
---|---|---|---|---|
浮点 | 603.bwaves_s | 10.8 | 10.8 | 0.00% |
607.cactuBSSN_s | 2.01 | 2.02 | 0.50% | |
619.lbm_s | 3.3 | 3.38 | 2.42% | |
621.wrf_s | 2.09 | 3.29 | 57.42% | |
627.cam4_s | 1.03 | 1.03 | 0.00% | |
628.pop2_s | 1.71 | 1.97 | 15.20% | |
638.imagick_s | 1.6 | 1.6 | 0.00% | |
644.nab_s | 3.91 | 3.92 | 0.26% | |
649.fotonik3d_s | 3.59 | 5.14 | 43.18% | |
654.roms_s | 1.85 | 2.79 | 50.81% | |
几何平均 | 2.53 | 2.91 | 14.99% |
Things to do:
Things to do:
Things to do:
Things to do:
基本数据类型: int、float, ...
我们如何表示向量数据类型?
TargetTransformInfo
中端可能感兴趣什么问题?
后端如何实现?
后端提供的代价类型 | 语义 |
---|---|
TCK_RecipThroughput |
吞吐量倒数 |
TCK_Latency |
指令延迟 |
TCK_CodeSize |
指令产生的二进制大小 |
TCK_SizeAndLatency |
二进制大小与延迟的加权和 |
Funny thing:
后端:所有代价类型都输出同一个数
digraph { 1 -> "+" 2 -> "+" "+" -> "*" 2 -> "*" }
digraph { 1 -> "+" 2 -> "+" "+" -> "*" 2 -> "*" }
它没这么简单
define i32 @add(i32 %a, i32 %b) {
%ret = add i32 %a, %b
ret i32 %ret
}
digraph "dag-combine1 input for add:" { rankdir="BT"; Node0x431106d0 [shape=record,shape=Mrecord,label="{EntryToken|t0|{ch| glue}}"]; Node0x43168070 [shape=record,shape=Mrecord,label="{Register %0|t1|{ i32}}"]; Node0x431680e0 [shape=record,shape=Mrecord,label="{{ 0| 1}|CopyFromReg|t2|{ i32| ch}}"]; Node0x431680e0:s0 -> Node0x431106d0:d0[color=blue,style=dashed]; Node0x431680e0:s1 -> Node0x43168070:d0; Node0x43168150 [shape=record,shape=Mrecord,label="{Register %1|t3|{ i32}}"]; Node0x431681c0 [shape=record,shape=Mrecord,label="{{ 0| 1}|CopyFromReg|t4|{ i32| ch}}"]; Node0x431681c0:s0 -> Node0x431106d0:d0[color=blue,style=dashed]; Node0x431681c0:s1 -> Node0x43168150:d0; Node0x43168230 [shape=record,shape=Mrecord,label="{{ 0| 1}|add|t5|{ i32}}"]; Node0x43168230:s0 -> Node0x431680e0:d0; Node0x43168230:s1 -> Node0x431681c0:d0; }
define void @store(ptr %f) {
store i32 1, ptr %f
store i32 2, ptr %f
ret void
}
digraph "dag-combine1 input for store:" { rankdir="BT"; Node0x3776a290 [shape=record,shape=Mrecord,label="{EntryToken|t0|{ch| glue}}"]; Node0x377c1af0 [shape=record,shape=Mrecord,label="{Register %0|t1|{ i64}}"]; Node0x377c1b60 [shape=record,shape=Mrecord,label="{{ 0| 1}|CopyFromReg|t2|{ i64| ch}}"]; Node0x377c1b60:s0 -> Node0x3776a290:d0[color=blue,style=dashed]; Node0x377c1b60:s1 -> Node0x377c1af0:d0; Node0x377c1bd0 [shape=record,shape=Mrecord,label="{Constant\<1\>|t3|{ i32}}"]; Node0x377c1cb0 [shape=record,shape=Mrecord,label="{undef|t5|{ i64}}"]; Node0x377c1d20 [shape=record,shape=Mrecord,label="{{ 0| 1| 2| 3}|store\<(store (s32) into %ir.f)\>|t6|{ ch}}"]; Node0x377c1d20:s0 -> Node0x3776a290:d0[color=blue,style=dashed]; Node0x377c1d20:s1 -> Node0x377c1bd0:d0; Node0x377c1d20:s2 -> Node0x377c1b60:d0; Node0x377c1d20:s3 -> Node0x377c1cb0:d0; Node0x377c1d90 [shape=record,shape=Mrecord,label="{Constant\<2\>|t7|{ i32}}"]; Node0x377c1e00 [shape=record,shape=Mrecord,label="{{ 0| 1| 2| 3}|store\<(store (s32) into %ir.f)\>|t8|{ ch}}"]; Node0x377c1e00:s0 -> Node0x377c1d20:d0[color=blue,style=dashed]; Node0x377c1e00:s1 -> Node0x377c1d90:d0; Node0x377c1e00:s2 -> Node0x377c1b60:d0; Node0x377c1e00:s3 -> Node0x377c1cb0:d0; }
某Arch支持向量指令:
如何合法化 Op:
ADD/SUB
SRA
MUL
LOAD/STORE
BSWAP
ABS
某Arch支持向量指令:
Load Store?
拆成两个 + Combine
Load p $\Rightarrow$
Load p_low + Load p_hi
如果还不对齐的话,需要 memcpy
地址对齐是前端分析出来并一路从中端传到后端的
要分析正确、传递正确
某Arch支持向量指令:
Byte Swap (from Chromium)?
整体移位 + or
$(A_0, A_1, A_2, A_3)$
$(A_3, O, O, O)$
$(O, A_2, O, O)$
$(O, O, A_1, O)$
某Arch支持向量指令:
ABS?
(SRA x, type_size - 1) = $t$
(ABS x) = (XOR (ADD x, $t$), $t$)
例子:8 位整数 -5 = $11111011_2$
$t$ = -1
$11111010_2$ xor $11111111_2$
= $11111111_2$
啥问题?
Neon 指令集只有 128 位,但是加速比吊打 x86
Source code of x264_pixel_satd_8x4