c++ - Implementing a ticket lock with atomics generates extra mov -


i wrote naive implementation of simple ticket lock. locking part looks like:

struct ticket {     uint16_t next_ticket;     uint16_t now_serving; };  void lock(ticket* tkt) {     const uint16_t my_ticket =         __sync_fetch_and_add(&tkt->next_ticket, 1);      while (tkt->now_serving != my_ticket) {         _mm_pause();         __asm__ __volatile__("":::"memory");     }    } 

then realized rather using gcc intrinsic, can write std::atomics:

struct atom_ticket {     std::atomic<uint16_t> next_ticket;     std::atomic<uint16_t> now_serving; };  void lock(atom_ticket* tkt) {     const uint16_t my_ticket =         tkt->next_ticket.fetch_add(1, std::memory_order_relaxed);     while (tkt->now_serving.load(std::memory_order_relaxed) != my_ticket) {         _mm_pause();     }    } 

these generate almost identical assembly, latter generates additional movzwl instruction. why there mov? there better, correct way write lock()?

assembly output -march=native -o3:

 0000000000000000 <lock(ticket*)>:     0:   b8 01 00 00 00          mov    $0x1,%eax     5:   66 f0 0f c1 07          lock xadd %ax,(%rdi)     a:   66 39 47 02             cmp    %ax,0x2(%rdi)     e:   74 08                   je     18 <lock(ticket*)+0x18>    10:   f3 90                   pause      12:   66 39 47 02             cmp    %ax,0x2(%rdi)    16:   75 f8                   jne    10 <lock(ticket*)+0x10>    18:   f3 c3                   repz retq     1a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1) 

 0000000000000020 <lock(atom_ticket*)>:    20:   ba 01 00 00 00          mov    $0x1,%edx    25:   66 f0 0f c1 17          lock xadd %dx,(%rdi)    2a:   48 83 c7 02             add    $0x2,%rdi    2e:   eb 02                   jmp    32 <lock(atom_ticket*)+0x12>    30:   f3 90                   pause   => 32:   0f b7 07                movzwl (%rdi),%eax <== ???    35:   66 39 c2                cmp    %ax,%dx    38:   75 f6                   jne    30 <lock(atom_ticket*)+0x10>    3a:   f3 c3                   repz retq  

why not cmp (%rdi),%dx directly?

first of all, think need use std::memory_order_acquire, since you're acquiring lock. if use mo_relaxed, potentially see stale data before of stores previous lock-holder did. jeff preshing's blog excellent, , has post on release/acquire semantics.

on x86, can happen if compiler re-orders loads , stores, mo_relaxed tells it's allowed to. acquire load compiles same relaxed load on x86, without reordering. every x86 asm load acquire. on weakly-ordered architectures need it, you'll whatever instructions needed load-acquire. (and on x86, you'll stop compiler reordering).


i put version of code on godbolt @ asm various compilers.


well spotted, gcc optimization failure, still present @ least in 6.0 (checked wandbox, using main return execlp("objdump", "objdump", "-mintel", "-d", argv[0], null); dump disassembler output of itself, including functions we're interested in.

it looks clang 3.7 worse this. 16bit load, zero-extends, compares.

gcc treats atomic loads specially, , apparently fails notice can fold compare. optimization pass have done happened while atomic load still represented differently regular loads, or something. i'm not gcc hacker, guesswork.

i suspect either have old gcc (4.9.2 or older), or you're building on / amd, because compiler used rep ret -march=native. should if care generating optimal code. i've noticed gcc5 making better code gcc 4.9 sometimes. (not helps in case, though :/)


i tried using uint32_t, no luck.

the performance impact of doing load , compare separately irrelevant since function busy-wait loop.

the fast path (unlocked case, loop condition false on first iteration) still 1 taken branch, , ret. however, in std:atomic version, fast path goes through loop branch. instead of 2 separate branch-predictor entries (one fast path , 1 spin-loop), spinning cause branch mispredict in next unlocked case. not issue, , new code take 1 fewer branch-predictor entries.

idk if jumping middle of loop had ill effects on uop cache of intel snb-family microarchitectures. of trace cache. agner fog's testing found same piece of code can have multiple entries in uop cache if has multiple jump entry points. function uop-cache unfriendly, since starts mov r, imm / lock xadd. lock xadd has go in uop cache line itself, because it's microcoded (more 4 uops. 9 in fact). unconditional jump ends uop cache line. i'm not sure taken conditional branch, i'd guess taken jcc ends cache line if predicted taken when decoded. (e.g. branch predictor entry still good, old uop cache entry evicted).

so first version potentially 3 uops cache lines fast-path: 1 mov (and if inlined, full previous instructions), 1 lock xadd alone, 1 macro-fused cmp/je following code (if inlined. if not, jump targets ret, may end 4th cache line 32byte code block, isn't allowed. non-inlined version of may have re-decoded every time?)

the std::atomic version again 1 uop-cache line initial mov imm (and preceding instructions), lock xadd, add / jmp, ... uh oh, 4th cache-line needed movzx / compare-and-branch uops. version more have decode bottleneck when inlined.

fortunately, frontend can still gain ground , instructions queued ooo core while running code, because lock xadd 9 uops. that's enough cover cycle or 2 of fewer uops frontend, , switch between decoding , uop-cache fetching.

the main problem here code-size, since prob. want inlined. speed-wise, fast-path worse, , non-fast path spin-loop anyway doesn't matter.

the fast-path 11 fused-domain uops old version (1 mov imm, 9 lock xadd, 1 cmp/je macro fused). cmp/je includes micro-fused memory operand.

the fast-path 41 fused-domain uops new version (1 mov imm, 9 lock xadd, 1 add, 1 jmp, 1 movzx, 1 cmp/je macro fused).

using add instead of using 8bit offset in addressing mode of movzx shooting in foot, imo. idk if gcc thinks ahead far enough make choices have loop branch target come out @ 16b boundary, or if dumb luck.


compiler-identification experiment on godbolt using op's code:

  • gcc 4.8 , earlier: use rep ret when it's branch target, -march=native -mtune=core2 (on haswell), or -march=core2.
  • gcc 4.9: uses rep ret -march=native on haswell, because haswell new it. -march=native -mtune=haswell uses ret, know name haswell.
  • gcc 5.1 , later: uses ret -march=native (on haswell). still uses rep ret when -march isn't specified.

Comments

Popular posts from this blog

How to show in django cms breadcrumbs full path? -

php - Invalid Cofiguration - yii\base\InvalidConfigException - Yii2 -

ruby on rails - npm error: tunneling socket could not be established, cause=connect ETIMEDOUT -