arm: use optimized find_first_set_bit() on Cortex-M
Use the optimized version based on __buitlin_ctz() which GCC
will compile to two instructions (rbit, ctz) on Cortex-M4/M7;
faster and smaller than the handcoded assembly version.
Change-Id: I33f69ff829b048f1e53fc7ead1bd6ac3c5bd7a4c
authored by