c++ - Implementing a ticket lock with atomics generates extra mov -
i wrote naive implementation of simple ticket lock. locking part looks like:
struct ticket { uint16_t next_ticket; uint16_t now_serving; }; void lock(ticket* tkt) { const uint16_t my_ticket = __sync_fetch_and_add(&tkt->next_ticket, 1); while (tkt->now_serving != my_ticket) { _mm_pause(); __asm__ __volatile__("":::"memory"); } }
then realized rather using gcc intrinsic, can write std::atomic
s:
struct atom_ticket { std::atomic<uint16_t> next_ticket; std::atomic<uint16_t> now_serving; }; void lock(atom_ticket* tkt) { const uint16_t my_ticket = tkt->next_ticket.fetch_add(1, std::memory_order_relaxed); while (tkt->now_serving.load(std::memory_order_relaxed) != my_ticket) { _mm_pause(); } }
these generate almost identical assembly, latter generates additional movzwl
instruction. why there mov
? there better, correct way write lock()
?
assembly output -march=native -o3
:
0000000000000000 <lock(ticket*)>: 0: b8 01 00 00 00 mov $0x1,%eax 5: 66 f0 0f c1 07 lock xadd %ax,(%rdi) a: 66 39 47 02 cmp %ax,0x2(%rdi) e: 74 08 je 18 <lock(ticket*)+0x18> 10: f3 90 pause 12: 66 39 47 02 cmp %ax,0x2(%rdi) 16: 75 f8 jne 10 <lock(ticket*)+0x10> 18: f3 c3 repz retq 1a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000020 <lock(atom_ticket*)>: 20: ba 01 00 00 00 mov $0x1,%edx 25: 66 f0 0f c1 17 lock xadd %dx,(%rdi) 2a: 48 83 c7 02 add $0x2,%rdi 2e: eb 02 jmp 32 <lock(atom_ticket*)+0x12> 30: f3 90 pause => 32: 0f b7 07 movzwl (%rdi),%eax <== ??? 35: 66 39 c2 cmp %ax,%dx 38: 75 f6 jne 30 <lock(atom_ticket*)+0x10> 3a: f3 c3 repz retq
why not cmp (%rdi),%dx
directly?
first of all, think need use std::memory_order_acquire
, since you're acquiring lock. if use mo_relaxed
, potentially see stale data before of stores previous lock-holder did. jeff preshing's blog excellent, , has post on release/acquire semantics.
on x86, can happen if compiler re-orders loads , stores, mo_relaxed
tells it's allowed to. acquire load compiles same relaxed load on x86, without reordering. every x86 asm load acquire. on weakly-ordered architectures need it, you'll whatever instructions needed load-acquire. (and on x86, you'll stop compiler reordering).
i put version of code on godbolt @ asm various compilers.
well spotted, gcc optimization failure, still present @ least in 6.0 (checked wandbox, using main
return execlp("objdump", "objdump", "-mintel", "-d", argv[0], null);
dump disassembler output of itself, including functions we're interested in.
it looks clang 3.7 worse this. 16bit load, zero-extends, compares.
gcc treats atomic loads specially, , apparently fails notice can fold compare. optimization pass have done happened while atomic load still represented differently regular loads, or something. i'm not gcc hacker, guesswork.
i suspect either have old gcc (4.9.2 or older), or you're building on / amd, because compiler used rep ret
-march=native
. should if care generating optimal code. i've noticed gcc5 making better code gcc 4.9 sometimes. (not helps in case, though :/)
i tried using uint32_t, no luck.
the performance impact of doing load , compare separately irrelevant since function busy-wait loop.
the fast path (unlocked case, loop condition false on first iteration) still 1 taken branch, , ret. however, in std:atomic version, fast path goes through loop branch. instead of 2 separate branch-predictor entries (one fast path , 1 spin-loop), spinning cause branch mispredict in next unlocked case. not issue, , new code take 1 fewer branch-predictor entries.
idk if jumping middle of loop had ill effects on uop cache of intel snb-family microarchitectures. of trace cache. agner fog's testing found same piece of code can have multiple entries in uop cache if has multiple jump entry points. function uop-cache unfriendly, since starts mov r, imm / lock xadd
. lock xadd has go in uop cache line itself, because it's microcoded (more 4 uops. 9 in fact). unconditional jump ends uop cache line. i'm not sure taken conditional branch, i'd guess taken jcc ends cache line if predicted taken when decoded. (e.g. branch predictor entry still good, old uop cache entry evicted).
so first version potentially 3 uops cache lines fast-path: 1 mov
(and if inlined, full previous instructions), 1 lock xadd
alone, 1 macro-fused cmp/je
following code (if inlined. if not, jump targets ret
, may end 4th cache line 32byte code block, isn't allowed. non-inlined version of may have re-decoded every time?)
the std::atomic version again 1 uop-cache line initial mov imm
(and preceding instructions), lock xadd
, add / jmp
, ... uh oh, 4th cache-line needed movzx / compare-and-branch
uops. version more have decode bottleneck when inlined.
fortunately, frontend can still gain ground , instructions queued ooo core while running code, because lock xadd
9 uops. that's enough cover cycle or 2 of fewer uops frontend, , switch between decoding , uop-cache fetching.
the main problem here code-size, since prob. want inlined. speed-wise, fast-path worse, , non-fast path spin-loop anyway doesn't matter.
the fast-path 11 fused-domain uops old version (1 mov imm
, 9 lock xadd
, 1 cmp/je
macro fused). cmp/je
includes micro-fused memory operand.
the fast-path 41 fused-domain uops new version (1 mov imm
, 9 lock xadd
, 1 add
, 1 jmp
, 1 movzx
, 1 cmp/je
macro fused).
using add
instead of using 8bit offset in addressing mode of movzx
shooting in foot, imo. idk if gcc thinks ahead far enough make choices have loop branch target come out @ 16b boundary, or if dumb luck.
compiler-identification experiment on godbolt using op's code:
- gcc 4.8 , earlier: use
rep ret
when it's branch target,-march=native -mtune=core2
(on haswell), or-march=core2
. - gcc 4.9: uses
rep ret
-march=native
on haswell, because haswell new it.-march=native -mtune=haswell
usesret
, know namehaswell
. - gcc 5.1 , later: uses
ret
-march=native
(on haswell). still usesrep ret
when-march
isn't specified.
Comments
Post a Comment