Benchmarks and Chain of Thought

#8
by DreamFast - opened

Hi there. Thanks for making this model!

I mean with all respect, I have been comparing abliterated models with benchmarks and tensor comparison. In the case for this model, it could have been the way I ran it or using bnb4, I could not find any evidence to suggest enhanced capabilities compared to the base model. Also the CoT seemed broken. For GMS8K benchmark for example, other models with the same environment and settings would take 1 or 2 hours at most to complete. This AEON model had taken 11 hours and hit the reasoning limit the most compared.

You can see the results here, although I want to confirm first do you have any publicly verifiable benchmarks? Or have I made a mistake in the way I have performed the comparison? I am happy to make a note if that's the case.

Thanks

Which version of the model were you running and on what hardware? Thanks for sharing you results what recipe were you using and what environment. A lot of factors can result in different outcomes. I was very focused on testing and validation from a DGX Spark with equivalent NVFP4 quantized models against this model's NVFP4 quantization. I do not have equipment capable of running BF16 this was just the initial step in building a fully abliterated version to quantize for optimal operation on the DGX Spark.

Which version of the model were you running and on what hardware? Thanks for sharing you results what recipe were you using and what environment. A lot of factors can result in different outcomes. I was very focused on testing and validation from a DGX Spark with equivalent NVFP4 quantized models against this model's NVFP4 quantization. I do not have equipment capable of running BF16 this was just the initial step in building a fully abliterated version to quantize for optimal operation on the DGX Spark.

Totally understand a lot of different factors can affect the result. I'm also new to benchmarking and have learned a lot. It's also only fair for the author of the model to review and give their feedback also.

For all these benchmarks we ran safetensors in bnb4.

We also had ran all the other models in the exact same setting, hardware, hyperparameters and GPU and kept everything the same as much as possible. In this case we used temperature 0 as recommended by lm eval harness, and the recommended hyperparamters set by Qwen 3.6.

Given that we had the same setting comparing all models, maybe bnb4 did not enhance the models abilities. So in this case the base model did perform slightly better at bnb4.

CoT/Reasoning too seemed to have issues. Compared to other models, it would spend a great deal more tokens thinking of the same problems and in some cases exhaust the token reasoning budget.

Edit: hardware, as specified in the benchmark is 5090.

I appreciate your generally scientific approach, but I will point out a few factors I think are important to consider. I never quantized this to bnb4 and that is a much more lossy quant method which means there is some significant drift. I cannot attest to the bnb4 quant of this model since I didn't do it, if it was poorly quantized it would cause it to behave sub-par. Also consider quantizing an obliterated model is usually unstable unless using something close to losses quantization like NVFP4. It's not recommended to run an abliterated or uncensored model in a lossy quant or a GGUF because of the loss of information and the elevated sensitivity to quantization a fine-tuned model has.

I would be curious if you did a head to head evaluation with exclusively NVFP4 models how it would come out, also the BF16 model I think is missing a tokenizer and MTP head that was grafted into the NVFP4-MTP models. The ones I generally recommend for a 5090 is the MTP-XS it fits perfectly on a 5090 with room for max context window supported 256k.

@AEON-7 Thanks for the feedback! I do agree that bnb4 is not ideal, given the models size and what was compared, it's the best I could do. No doubt it has some effect. I did note this in the benchmark. Abliterix also gave similar feedback. So if you like, I am happy to update the document with yours as well.

I noticed you also have some Gemma 4. I am currently doing Gemma 4 E2B, which I can run at BF16. Do you have one of those in full safetensor? Next I'll do Gemma 4 E4B which I should be able to run at BF16 too.

NVFP4 if it works with VLLM sure. Would be happy to try it out! Maybe for the larger Gemma 4 ones when I get around to it, it will beat using bnb4.

Edit: I was hoping too, given they were all benchmarked in bnb4 it'd make it comparable at least to the base model.

The Gemma models were more experimental and based on other pre-existing abliterated models so wouldn't take any credit for the fine-tuning or abliteration on those. Appreciate the work you are doing it is needed in this world with so many models to choose from a lot of people just don't know where to start. I think these kinds of test are needed more.

I do think the Gemma 4 26B A4B and Gemma 4 31B models are great, but the only contribution I made there was the high quality quantization, for use in my pre-compiled Docker Containers to fully leverage the DGX Spark hardware. My guiding mission has been to more effectively maximize the DGX Spark. https://github.com/AEON-7

Maybe for abliterated models test the NVFP4 ones, since you have an RTX 5090 the Blackwell and it supports native hardware acceleration of NVFP4. This is the most optimized Qwen3.6-27B model for an RTX 5090 that you will be able to run with full context window. https://huggingface.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP-XS

@AEON-7 Cool thanks. I can do an NVFP4 comparison after the Gemma 4s.

A reason why I wanted to do this, is just as you said. This information doesn't exist. Anyone can upload anything to huggingface and make any claims without transparency. So it's good to at least have peoples claims verified. We even uncovered that Hauhaucs had plagiarized heretic and none of the benchmarks we had ran, had held up to this claims.

I do appreciate your reply! I'll update the benchmark with your feedback and a link to this discussion for transparency.

Sign up or log in to comment