c plus plus: mysterious rtm abort using haswell tsx

I'm experimenting with the tsx extensions in haswell, by adapting an existing medium-sized (1000's of lines) codebase to using GCC transactional memory extensions (which indirectly are using haswell tsx in this machine) instead of coarse grained locks.

I'm having issues getting it to work fast enough because I get high rates of hardware transaction abort for mysterious reasons. As shown below, these aborts are not due to conflicts nor due to capacity limitations.

Here is the perf command I used to quantify the failure rate and underlying causes:

perf stat \
 -e cpu/event=0x54,umask=0x2,name=tx_mem_abort_capacity_write/ \
 -e cpu/event=0x54,umask=0x1,name=tx_mem_abort_conflict/ \
 -e cpu/event=0x5d,umask=0x1,name=tx_exec_misc1/ \
 -e cpu/event=0x5d,umask=0x2,name=tx_exec_misc2/ \
 -e cpu/event=0x5d,umask=0x4,name=tx_exec_misc3/ \
 -e cpu/event=0x5d,umask=0x8,name=tx_exec_misc4/ \
 -e cpu/event=0x5d,umask=0x10,name=tx_exec_misc5/ \
 -e cpu/event=0xc9,umask=0x1,name=rtm_retired_start/ \
 -e cpu/event=0xc9,umask=0x2,name=rtm_retired_commit/ \
 -e cpu/event=0xc9,umask=0x4,name=rtm_retired_aborted/pp \
 -e cpu/event=0xc9,umask=0x8,name=rtm_retired_aborted_misc1/ \
 -e cpu/event=0xc9,umask=0x10,name=rtm_retired_aborted_misc2/ \
 -e cpu/event=0xc9,umask=0x20,name=rtm_retired_aborted_misc3/ \
 -e cpu/event=0xc9,umask=0x40,name=rtm_retired_aborted_misc4/ \
 -e cpu/event=0xc9,umask=0x80,name=rtm_retired_aborted_misc5/ \ 
./myprogram -th 1 -reps 3000000

So, the program runs some code with transactions in it 30 million times. Each request involves one transaction gcc __transaction_atomic block. There is only one thread in this run.

This particular perf command captures most of the relevant tsx performance events described in the Intel software developers manual vol 3.

The output from perf stat is the following:

             0 tx_mem_abort_capacity_write                                  [26.66%]
             0 tx_mem_abort_conflict                                        [26.65%]
    29,937,894 tx_exec_misc1                                                [26.71%]
             0 tx_exec_misc2                                                [26.74%]
             0 tx_exec_misc3                                                [26.80%]
             0 tx_exec_misc4                                                [26.92%]
             0 tx_exec_misc5                                                [26.83%]
    29,906,632 rtm_retired_start                                            [26.79%]
             0 rtm_retired_commit                                           [26.70%]
    29,985,423 rtm_retired_aborted                                          [26.66%]
             0 rtm_retired_aborted_misc1                                    [26.75%]
             0 rtm_retired_aborted_misc2                                    [26.73%]
    29,927,923 rtm_retired_aborted_misc3                                    [26.71%]
             0 rtm_retired_aborted_misc4                                    [26.69%]
           176 rtm_retired_aborted_misc5                                    [26.67%]

  10.583607595 seconds time elapsed

As you can see from the output:

The rtm_retired_start count is 30 million (matches input to program)
The rtm_retired_abort count is about the same (no commits at all)
The abort_conflict and abort_capacity counts are 0, so these are not the reasons. Also, recall it is only one thread running, conflicts should be rare.
The only actual leads here are the high values of tx_exec_misc1 and rtm_retired_aborted_misc3, which are somewhat similar in description.

The Intel manual (vol 3) defines rtm_retired_aborted_misc3 counters:

code: C9H 20H

mnemonic: RTM_RETIRED.ABORTED_MISC3

description: Number of times an RTM execution aborted due to HLE unfriendly instructions.

The definition for tx_exec_misc1 has some similar words:

code: 5DH 01H

mnemonic: TX_EXEC.MISC1

description: Counts the number of times a class of instructions that may cause a transactional abort was executed. Since this is the count of execution, it may not always cause a transactional abort.

I checked the assembly location for the aborts using perf record/ perf report using high precision (PEBS) support for rtm_retired_aborted. The location has a mov instruction from register to register. No weird instruction names seen nearby.

Could someone clarify / help me understanding of the causes for misc3 aborts (and misc1 tx_exec) in general? Perhaps someone else is running into this problem as well?

some clarifying notes:

I am using GCC's transactional_memory extensions, not writing my own _xbegin / _xend directly. I am using the ITM_DEFAULT_METHOD=htm
Sadly, this behavior is somewhat erratic. At some point, in one version of the code, it would happen when compiled with -O1 but not with -O2 nor with -O0. Later on, it also started happening with O2 when I made some various changes to the code for unrelated reasons. In all these cases, the 'culprit' instructions blamed by perf with PEBS enabled were not obviously wrong (mov register to register, push to the stack). And there were no obvious bad instructions nearby either.
I'm going to try reproing this problem in a smaller codebase, but even now, sometimes the hardware success rate does go up to 99.9% due to seemingly minor changes. This means it is hard to keep shaving off code and keep the thing happening, but it is also hard to understand the difference between the changes.

c plus plus

mercredi 6 mai 2015

mysterious rtm abort using haswell tsx

Aucun commentaire:

Enregistrer un commentaire