Optimize the kernel implementation of layernorm with openmp #20895

yihuaxu · 2019-10-30T07:33:05Z

Based on investigation of ERNIE, we found that layernorm is lack of multi-threads JIT implemention. This PR is to add the layernorm multi-threads optimizing by using OpenMP.
But based on initial benchmark with ERNIE, single thread layernorm performance seems to be worse than before.

CPU Model: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

Thread Number	Baseline	Optimization
1 thread	2341.13ms	2502.5ms
20 threads	4009.16ms	847.298ms

test=develop

Xreki · 2019-10-30T08:21:46Z

paddle/fluid/operators/jit/more/intrinsic/layer_norm.cc

+        rest_mask & 0x80 ? 0xffffffff : 0, rest_mask & 0x40 ? 0xffffffff : 0,
+        rest_mask & 0x20 ? 0xffffffff : 0, rest_mask & 0x10 ? 0xffffffff : 0,
+        rest_mask & 0x8 ? 0xffffffff : 0, rest_mask & 0x4 ? 0xffffffff : 0,
+        rest_mask & 0x2 ? 0xffffffff : 0, rest_mask & 0x1 ? 0xffffffff : 0);


可以拆成#pragma omp parallel和#paragma omp for两个部分，这样__m256i mask_vec就只需设置一次？

已经修改好了

test=develop

luotao1 · 2019-10-31T03:47:05Z

Ernie性能数据在：PaddlePaddle/benchmark#180 (comment) @Xreki 可以再看一下

luotao1

LGTM！

请更新下描述部分的数据（是否是该PR最新时间），请加上测试机器型号
时间请写单位，multi-thread直接写20线程

…ddle#20895)

…22198) * Optimize the kernel implementation of layernorm with openmp (#20895) * Add ernie c++ inference test (#21015) * Add ernie unit test test=develop * Add ernie unit test test=develop * Add ernie unit test test=develop * remove ngraph * optimize gpu test test=develop * optimize codes test=develop * fix cmake fails on inference_download_and_uncompress (#21185) * solve cmake fails on inference_download_and_uncompress test=develop * solve cmake fails on inference_download_and_uncompress test=develop * Add fc padding to improve mkl GEMM's performance when N and K are multiple of 128. (#20972) * Add fc padding to solve mkl performance test=develop * fix gpu pass and error information test=develop * fix fc_fuse_pass_test test=develop * fix error information test=develop * fix error information test=develop * fix name and add fc op padding test test=develop * fix attributes test=develop * optimize fc padding test=develop * fix test test=develop * Polish the codes of fc when needs padding (#21378) test=develop * Add ernie large c++ inference test (#21365) * add ernie-large test test=develop * add ernie large c++ inference test test=develop * Modify padding strategy: remove weight copy in fc padding (#21650) test=develop * optimize fc jit (#21878) test=develop Co-authored-by: Yihua Xu <[email protected]>

Optimize the kernel implementation of layernorm with openmp.

65520ab

test=develop

Xreki reviewed Oct 30, 2019

View reviewed changes

luotao1 added the Intel label Oct 30, 2019

yihuaxu added 2 commits October 30, 2019 17:09

Seperate the omp parameters of parallel and for.

0697f92

test=develop

Fix the compile issue when openmp is disabled.

e56b423

test=develop

GaoWei8 mentioned this pull request Oct 31, 2019

Optimize inference performance of ERNIE on CPU PaddlePaddle/benchmark#180

Open

3 tasks

luotao1 approved these changes Oct 31, 2019

View reviewed changes

luotao1 merged commit b6260f3 into PaddlePaddle:develop Oct 31, 2019

seiriosPlus pushed a commit to seiriosPlus/Paddle that referenced this pull request Dec 9, 2019

Optimize the kernel implementation of layernorm with openmp (PaddlePa…

c0cf535

…ddle#20895)

seiriosPlus pushed a commit to seiriosPlus/Paddle that referenced this pull request Dec 9, 2019

Optimize the kernel implementation of layernorm with openmp (PaddlePa…

e5a0b6c

…ddle#20895)

GaoWei8 pushed a commit to GaoWei8/Paddle that referenced this pull request Jan 9, 2020

Optimize the kernel implementation of layernorm with openmp (PaddlePa…

853bc8d

…ddle#20895)

GaoWei8 mentioned this pull request Jan 9, 2020

[cherry-pick] Add FC padding, ernie test unit and layernorm parallel #22198

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the kernel implementation of layernorm with openmp #20895

Optimize the kernel implementation of layernorm with openmp #20895

yihuaxu commented Oct 30, 2019 •

edited

Loading

Xreki Oct 30, 2019

yihuaxu Oct 30, 2019

luotao1 commented Oct 31, 2019

luotao1 left a comment

Optimize the kernel implementation of layernorm with openmp #20895

Optimize the kernel implementation of layernorm with openmp #20895

Conversation

yihuaxu commented Oct 30, 2019 • edited Loading

Xreki Oct 30, 2019

Choose a reason for hiding this comment

yihuaxu Oct 30, 2019

Choose a reason for hiding this comment

luotao1 commented Oct 31, 2019

luotao1 left a comment

Choose a reason for hiding this comment

yihuaxu commented Oct 30, 2019 •

edited

Loading