atomic: Use __atomic_thread_fence() when available
This avoids requiring inline assembly for each architecture.
It also fixes some weakly ordered architectures which lacked
said inline assembly (RISC-V, MIPS, LoongArch, etc) and were
thus disasterously broken.
(cherry picked from commit 2858a32723924f2015ce44452b45505b02994103)