Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the kernel implementation of layernorm with openmp #20895

Merged

Conversation

yihuaxu
Copy link
Contributor

@yihuaxu yihuaxu commented Oct 30, 2019

Based on investigation of ERNIE, we found that layernorm is lack of multi-threads JIT implemention. This PR is to add the layernorm multi-threads optimizing by using OpenMP.
But based on initial benchmark with ERNIE, single thread layernorm performance seems to be worse than before.

CPU Model: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

Thread Number Baseline Optimization
1 thread 2341.13ms 2502.5ms
20 threads 4009.16ms 847.298ms

rest_mask & 0x80 ? 0xffffffff : 0, rest_mask & 0x40 ? 0xffffffff : 0,
rest_mask & 0x20 ? 0xffffffff : 0, rest_mask & 0x10 ? 0xffffffff : 0,
rest_mask & 0x8 ? 0xffffffff : 0, rest_mask & 0x4 ? 0xffffffff : 0,
rest_mask & 0x2 ? 0xffffffff : 0, rest_mask & 0x1 ? 0xffffffff : 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以拆成#pragma omp parallel#paragma omp for两个部分,这样__m256i mask_vec就只需设置一次?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改好了

@luotao1 luotao1 added the Intel label Oct 30, 2019
@luotao1
Copy link
Contributor

luotao1 commented Oct 31, 2019

Ernie性能数据在:PaddlePaddle/benchmark#180 (comment) @Xreki 可以再看一下

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

  • 请更新下描述部分的数据(是否是该PR最新时间),请加上测试机器型号
  • 时间请写单位,multi-thread直接写20线程

@luotao1 luotao1 merged commit b6260f3 into PaddlePaddle:develop Oct 31, 2019
seiriosPlus pushed a commit to seiriosPlus/Paddle that referenced this pull request Dec 9, 2019
seiriosPlus pushed a commit to seiriosPlus/Paddle that referenced this pull request Dec 9, 2019
Xreki pushed a commit that referenced this pull request Jan 10, 2020
…22198)

* Optimize the kernel implementation of layernorm with openmp (#20895)

* Add ernie c++ inference test (#21015)

* Add ernie unit test
test=develop

* Add ernie unit test
test=develop

* Add ernie unit test
test=develop

* remove ngraph

* optimize gpu test
test=develop

* optimize codes
test=develop

* fix cmake fails on inference_download_and_uncompress (#21185)

* solve cmake fails on inference_download_and_uncompress
test=develop

* solve cmake fails on inference_download_and_uncompress
test=develop

* Add fc padding to improve mkl GEMM's performance when N and K are multiple of 128. (#20972)

* Add fc padding to solve mkl performance
test=develop

* fix gpu pass and error information
test=develop

* fix fc_fuse_pass_test
test=develop

* fix error information
test=develop

* fix error information
test=develop

* fix name and add fc op padding test
test=develop

* fix attributes
test=develop

* optimize fc padding
test=develop

* fix test
test=develop

* Polish the codes of fc when needs padding (#21378)

test=develop

* Add ernie large c++ inference test (#21365)

* add ernie-large test
test=develop

* add ernie large c++ inference test
test=develop

* Modify padding strategy: remove weight copy in fc padding (#21650)

test=develop

* optimize fc jit (#21878)

test=develop

Co-authored-by: Yihua Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants