Best practices to use models requiring flash_attn on Apple silicon macs (or non CUDA)?

There are any number of models on HuggingFaces that seem to require flash_attn, even though my understanding is most models can actually work fine without it. A few examples:

What is the best practice to get them working on Apple M2/M3 laptops (ideally teally with Metal support)? Obviously flash_attn won’t be available, but there is still plenty of value in working with models locally on a laptop before they need the higher efficiency of flash_attn and CUDA.

I’ve found a few directional hints, but none of them have worked:

In theory you should be able to monkey patch out the exception triggered in transformers.dynamic_module_utils but I cannot get that to work

In theory you should be able to FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn==2.5.8 but that fails to build (due to some strange issue with os.rename not working on Mac OS).

Has anybody gotten these models working? Is there a general solution that Huggingface can implement to allow these models to run / train (even if it isn’t very efficient) on non CUDA devices?

Just as I posted it, I found atleast one solution (the monkey patch approach) that works!

Can something like this be built into transformers so we don’t have to do it everytime?

thanks for sharing, it works! was also facing the same problem