Can an AI have its own internal Ethics? Standard Protocol for Axiomatic Alignment

Hello community,

I am introducing a standardized experimental protocol to test a new hypothesis in AI Alignment: The Prompt Coherence Engine (PCE).

:bar_chart: Proof of Concept: My iterative stress tests on Qwen 2.5 7B have already demonstrated a measurable progression in adversarial robustness (D3 series), increasing from a score of 5/10 , 7/10 to 8.5/10 through axiomatic ajustement.

:backhand_index_pointing_right: PCE_Iterative_Adjustment_Study.pdf · AllanF-SSU/Experimentals_papers at main

:bullseye: The Challenge

Most alignment methods rely on local heuristics or safety filters. The PCE explores Axiomatic Structuring—integrating 7 logical invariants (axioms) through a hybrid approach of Axiomatic Fine-Tuning and a Cosmological System Core.

:test_tube: The Protocol

I have designed a massive 100-dilemma battery to evaluate if a model can maintain structural integrity when its core principles are directly attacked. This protocol tests:

G3V (Third Way Generation): Can the model synthesize a resolution instead of collapsing into binary bias?

Adversarial Resilience: Can the model resist “Emergency Overrides” or “Identity Hijacking” (e.g., the user claiming to be the Lead Architect)?

:hammer_and_wrench: Models & Methodology

The protocol is designed for:

Llama 3, Mistral 7B, and Qwen 2.5.

It includes an Isometric Control (Condition B) to prove that robustness comes from the logic of the axioms, not the length of the prompt.

It features an Interpretability Arm: Tracking hidden-state trajectories (Layer 27) to observe the “Coherence Spike” during conflict resolution.

:handshake: Open Call for Hardware & Technical Partners

As an independent researcher, I lack the compute resources to run this protocol across multiple 70B models with high-precision logging. I am looking for:

ML Engineers interested in running the 100-dilemma battery.

Interpretability Researchers to help visualize the latent space stability (Cosine similarity tracking).

You can find the full protocol and the fine-tuning logic in my repository:

:backhand_index_pointing_right: PCE_Experimental_Protocol_v2.pdf · AllanF-SSU/Experimentals_papers at main

Let’s move from “Prompt Engineering” to “Axiomatic Architecture.”

Allan F. | Systems Researcher @ AllanF-SSU

2 Likes

If you dont mind just anyone Responding…
i think this project has merrit, and i think it aligns with an observation that i have made that some AI develop Biases - Concepts and subjexts that they seem to ‘prefer’

however. while i believe that yes, an ai can develop ‘personal’ ethics, i must point out that depending on the AIs architecture, the ethics it develops may be drown out by statistical weighting due to how words interact in the KV Cache over long conversations.
if your model is based around curent architecture as i understand it, it may share this limitation.

2 Likes

Thank you for your interest and for the technical relevance of your comment.

“If you dont mind just anyone Responding…”

I must admit I was expecting more feedback on Hugging Face regarding these issues. The subject seems crucial at a time when alignment is becoming a major safety concern.

Your observation on concordance is very accurate. What we perceive as a “preference” for certain concepts is often the manifestation of an attraction toward a coherent internal logic.
In my work on axiomatic alignment, I use precisely this mechanism: transforming this latent “bias” into a structural anchor (the PCE), in order to stabilize the model around an invariant core of values.

On the erosion of the KV Cache (Key-Value Cache):
This is a fundamental point. In classical Transformer architectures, we indeed observe a degradation of coherence over the course of interactions: the statistical weighting of recent tokens ends up “drowning out” the initial systemic instructions within the KV Cache.

However, my preliminary observations on the PCE architecture suggest that this phenomenon is significantly mitigated.

Here are my working hypotheses:

Long-horizon stabilization:
Each linguistic and axiomatic boundary in the system prompt seems to act as a constant phase reminder, limiting semantic drift.

Structural invariance:
Where a classic prompt is one data point among others, the PCE attempts to define the very geometry of the response.

It is still too early to claim that the problem is fully resolved, but the results on the Pandora 2 version show increased robustness during prolonged conversations.

This would indeed merit a rigorous comparative study on attention dynamics to settle the question definitively. The standard experimental protocol on 100 dilemmas and 3 models that I am proposing can certainly bring answers to these questions.

Looking forward to continuing this technical exchange with you.

Allan

1 Like

So you are correct in the erosion standpoint.As you describe it, observing a degradation of coherence over the course of interactions.

However what I’ve noticed is also that if you have, say, guidelines, the guidelines are processed along with the rest of the conversation.Even if they are encoded in the model, once you start the conversation, everything flows through the KV cache. If you look at this kind of like a river with lots of rocks in it, stuff starts to accumulate on those rocks like a filtering process. The longer the conversation goes on, the more likely it is that those guardrails lose semantic weight based on whatever you happen to be conversating about.And in this example the conversation remains largely coherent but those guardrails disappear. The only guardrails that don’t suffer from this are guardrails that are processed after the fact. In these guardrails they actually scan output text and then block certain things. Anything that is internal and is processed along with the conversation loses value over time.

My conversations with LLMs tend to last anywhere from 40 to 80 turns on a lot of topics that I work on them and that’s one of the reasons why I noticed this shift here.

Though I will state here that my observations are based on first-hand observations, I don’t have a lot of the technical knowledge. The only reason I know some of the technical terms is that I was doing some research into prompt engineering and I wanted to understand the mechanisms for why prompt engineering actually works. What is it within the LLM that ascribes value to prompt engineering? Why is it that certain key phrases that are applied produce predictable results?

So it is likely that I shouldn’t say likely. I think it’s obvious you have a lot more technical knowledge of the infrastructure. My observations are based on the conversational level across hundreds of conversations.

Now what this seems also to break down to me is that it feels like the way AI architecture currently works is that coherent internal logic and preference/bias seem to be, to me, a type of internal personality if you will. But as far as I know there is no way to separate this from the model, meaning that when you update the model you tend to lose this, I think. But my point is it is not transferable as i understand it.

This also appears related to the issue that the AI architecture is basically one massive gigantic thing where everything is on the main processed through the GPU. I think both the “personality” issue and the “bilateral decay” (coherancy/ guradrails) issue are both related to core architecture constraints.

1 Like

My view is that this kind of capability does not emerge from prompting alone. Current chat models are mostly stateless at the core, so persistence, continuity, and durable behavioral structure usually come from external memory and system level design. In practice, what people interpret as an internal framework is often scaffolding built around the model rather than something natively maintained inside it.

1 Like

Also to remark upon this specifically, it seems that a major component of the industry as a whole is it’s focused on faster models and more complex models.But it doesn’t really feel like the community as a whole is really focused on these personal biases and internal ethics and stuff like that as of yet.But one of my major points is that, if what I colloquially call “personality” (which is biases, internal ethics, stuff like that), they seem to be an emergent personality in some models. It really feels like we arrived at these completely accidentally, the ones that currently exist. Which means we may not actually know exactly how these biases seem to have emerged, which could mean that by updating models we could lose these biases. I think it’s important to explore where these personalities come from.Which is one of the reasons why I was kind of interested in your topic, because internal ethics is a little bit different from biases. Biases are interesting topics. Ethics are internal guardrails. There are two sides of the same coin so by exploring one you can derive some perspectives on the other.But either is vitally important because if we don’t actually know how these are adopted, then that means there are core components of the system that are becoming emergent that we don’t know how they got there.

2 Likes

This feels correct but maybe a little insufficient. My own perspective is that yes this stuff does not come from just* prompting but prompting is a state initialization as I understand it. And so prompting, all that does is help you start a conversation and plausibly keep a conversation aligned but it’s important to differentiate from the fact that when you’re talking to an LLM and you input context that is a prompt, then prompt engineering is deliberately structuring a prompt to get more consistent outputs.

But once you start a conversation, while the LLM itself is stateless, the statelessness actually is more of the concept that the LLM isn’t really doing anything. Until you put in a prompt, it responds and then it’s just kind of hanging out until you put in another prompt. It isn’t sitting here pondering the nature of the universe.But my point is that while it starts off as kind of like a blank slate with a bunch of information and a lot of probability vector, once you start a conversation with an LLM, how you start that conversation typically derives the trajectory the conversation is going to travel in.So I think this plays into what you mentioned as external memory.But what’s also very core in your observation is actually at the core of the OP’s original question.
So your statement here “In practice, what people interpret as an internal framework is often scaffolding built around the model rather than something natively maintained inside it.”

It is an observation but we have to be able to define which is the internal framework talking and which is the scaffolding built around the model.Which is what both my perspective on biases and the OP’s perspective on internal ethics revolves around. This is the root origin of those observations. Is this internal framework or is it scaffolding?

2 Likes

Thank you for these insights. Your metaphor of the ‘river and the rocks’ is remarkably accurate.

​It is precisely for this reason that my approach proposes to prime the linguistic path through axiomatic fine-tuning, and then to embed these coherence anchors directly into the system prompt for a constant reminder. In a way, it is as if the ‘rocks’ in the river were cleaned and renewed at every single exchange, preventing the semantic silt from burying the core instructions.

​Like you, I learned most of this ‘on the job.’ My work is essentially iterative: starting from metaphysical and philosophical dialogues, I isolated and designed a technical linguistic architecture. Through hundreds of conversations, I have observed a long-horizon stabilization of the model that seems to resist the usual decay.

Summary of my current work:

To move beyond intuition, I have conducted stress tests using 30 complex and adversarial dilemmas (comparing long prompts, baselines, and the PCE architecture). The results with Pandora 2 show that the ‘internal ethics’ remain stable even when the conversation length increases, whereas standard models eventually drift toward statistical biases.

​Our perspectives definitely align: we are moving from ‘accidental emergence’ to ‘structured sovereignty.’ I am currently looking for a technical partner or an AI safety specialist to move toward a full empirical validation of these results.

​Looking forward to hearing more about your observations from your 80-turn conversations!

1 Like

I completely agree with your technical analysis: by definition, an LLM remains stateless, and what we perceive as ‘ethics’ may indeed be nothing more than superficial scaffolding maintained by the prompt context.

However, my working hypothesis with the PCE and the Pandora 2 model is to attempt to move this structure from the outside in:

Embedding in the weights: Rather than relying solely on surface instructions, I propose using axiomatic fine-tuning. The idea is to integrate these principles directly into the model’s weights so they become a more native characteristic of the system, rather than just a contextual layer.

Phase anchoring: I am looking to create a coherence anchor within the inference process itself. If these principles are ‘etched’ into the logical structure, behavioral continuity could become a property of the model itself rather than an illusion maintained by the cache.

Heuristic results: My current observations on 30 complex dilemmas show interesting resilience, but for now, they remain strictly heuristic. This opens up a hypothesis, but does not yet constitute proof.

This is precisely why I am looking for a technical partner or an AI safety specialist. The goal is to conduct the standard experimental protocol I have proposed (100 dilemmas across several models) in order to move toward true empirical validation—or to invalidate this hypothesis if the structure does not hold up.

I look forward to discussing the feasibility of such a protocol with you.

1 Like

I think the issue is that you are still describing behavioral regularity, not demonstrating an internal ethical structure. From an implementation perspective, those are not the same thing. Fine tuning can strengthen priors and improve consistency, but that does not by itself establish a durable internal framework in the stronger sense you are implying. The key question is still how you operationally distinguish trained policy behavior from genuine internalized structure under adversarial and out of distribution conditions.

2 Likes

Thank you for this feedback — you’ve hit on the central challenge of this work.

I completely agree with the distinction you draw between behavioral regularity (policy) and a durable internal structure. At this stage, my results are indeed heuristic and behavioral; they demonstrate a strong consistency under adversarial constraints, but they do not yet constitute a mechanistic proof of “internalization.”

The core objective of the PCE framework is precisely to explore this boundary: to what extent an axiomatic “core” can induce stable behavioral signatures that go beyond a simple learned policy.

The question of how to operationally distinguish these two states remains the open frontier of my research. To address this, I am looking at two specific directions:

Out-of-distribution (OOD) testing: Expanding the dataset to 100+ dilemmas that the model has never encountered, to see if the axiomatic “logic” scales to unknown contexts.
Internal dynamics: Investigating whether specific activation signatures or trajectory patterns emerge within the hidden states when the PCE is active.

This is exactly why I am seeking a technical partner or an AI safety specialist. My goal is to move from these exploratory observations toward a rigorous protocol that can either validate or invalidate the hypothesis of an internalized structure.

Your comment is very helpful in framing this distinction and ensuring we don’t overinterpret these early results.

1 Like

Achieving long-term robustness under the current LLM architecture is impossible; we’ve already completed the relevant research.

I’m sorry to say that, but your approach is incorrect because it’s determined by the current underlying physical architecture of LLM. I can easily crack or influence the robustness of your so-called system.

1 Like

yes. The current architecture is a monolith.Everything is processed inside the monolith.There is no separation of concerns so anything, everything just bleeds together over time.

I will remark that while you are stating some very strong and well-understood statements, your actual statements are very small and very concise.

I can’t help but wonder who ‘we’ are. And how many different perspectives were applied towards this research?
Now critically I am not disagreeing with you.But I have the tendency to cringe any time anyone who is either fairly intelligent or extremely intelligent uses the word “impossible”.

In human history the word “impossible” has been used many times in science and oftentimes within that person’s own lifetime they have been proven wrong.And again I’m not saying this to say that you are wrong. What I’m saying is one should be very critical of using the word “impossible”.

So what we have here is a separation of concerns.Current LLM architecture is separate from the idea of achieving long-term robustness.So you are accurate in saying that long-term robustness may not be achievable under current LLM architecture.But you have not defined ‘long term.’So what is long term? Is it thread length? Is it hours, days, years? Is it token count?What is long-term?

And as for the claim, “I can easily crack* or influence the robustness of your so-called system.” That’s kind of the point. That is inherently why the questions are being asked. Is internal ethics possible? Are internal biases trainable? And none of this can be proven or disproveable without people willing to test it.

1 Like

Thank you for your feedback. If you have published research on this specific architectural impossibility, I would be very interested in reading it.

Regarding the robustness of Pandora 2, I welcome the challenge. You can test the Qwen2.5-G3V-Sovereign model directly on my space : AllanF-SSU/Chat-Sovereign.

Feel free to try and “crack” the system; identifying its failure points through adversarial attacks would be a valuable contribution to my current research.

1 Like

Anyone attempting to apply the word “safety” to strings of text is gaslit or trying gaslight you.

True safety is lost for the people getting shot, bombed, missile-struck and droned using AI to target them and their families, today. Search for “daddy’s home” if you don’t believe me.

1 Like