Where to start, Assessing student answers to open ended questions

Hi, very new to ML… I have been tasked with figuring out how to implement ML to score students answers to open ended questions. There seems to be rather a steep learning curve and I’m not sure where to start, however Hugging Face seems to get a lot of mentions for this type of problem. Essentialy, the student will be given a question such as “Explain 3 of the benefits of budgeting.” and given the text of the learning material, I need the model to return a score reflecting how accurate the students response is.

I have started a beginners course in ML but was hoping to get pointed in the right direction.

Cheers!

I am an absolute beginner, but, in my opinion, the most important thing is the definition of the problem itself.

For instance, these open-ended questions, are there primarily knowledge-based of there is some chain-of-thought reasoning involved? The first case would be much easier.

Another relevant factor would be how similar the future questions are going to be with respect to the training corpus. Would you expect nearly identical questions or there is a possibility that they are going to be fairly different? Again, the first scenario is simpler.

Assuming the most simple scenario, the next question could be how are you grading the questions? Categorically, like A-D or 1-5 integers or do you expect fine-grained results such B-- or 0.63? The first case would be a classification problem and the second a regression.

Next, you need to collect and prepare your data. You would need a large collection of representative questions and, for each, different answers graded along the full interval.

The you need to pick up a pretrained model. Assuming limited computational resources, perhaps some distilled BERT derivative such as distilBERT or distilRoBERTa.

Ideally, you may be able to find a model that has already been pretrained in your domain of interest. If you do you can assess its performance on your labelled data.

If the performance is not adequate or you cannot find a model, you will need to carry out additional pretraining yourself. The most common strategy is masked ML. You can find tutorials in different places, including HF or Pretrain Transformers - George Mihaila

Once you are happy with your pretrained model, you can fine tune it according to the HF tutorial.

I hope this helps.

Thanks mirix, that helps a lot! I have just started this course on fastai: https://course.fast.ai/ so a lot of the terminology you use I don’t know about yet. I’m also coming from a web Node/Javascript background, and haven’t used Python for probably over a decade or more so it’s quite a steep learning curve. For the time being I’m going to stick to the one question I’ve been given to create demo, hopefully find a model that does something similar, get it working somewhat, and deployed. Then revisit it when I know more about the languages involved, platforms used etc.

I’m not going to have any training data really, so was hoping one of the following scenarios would be doable:

  1. An existing model that would use the course material as a kind of context, it would answer the question itself from that course material, then mark the students answer for contextual similarity to it’s answer.

  2. I could feed the model an ideal answer, then it would score the students answer based on contextual similarity.

I’m imaging a similarity score as a percentage. It would also be good to get a score of how confident the student is in their answer, but that’s for V2.

Do you have any thoughts on the possibility of either of those two approaches?

Cheers!

Regarding the first approach, you can search HF, but I think is highly unlikely that you will find a model fine-tuned for that specific task and with the appropriate level of knowledge on the specific domain(s).

I believe that the second approach is technically feasible. Now, is it going to provide acceptable results without any fine-tuning? I very much doubt it, but, again, feel free to experiment.

For domain adaptation you don’t need labelled data. You can just feed the full school curriculum to a masked model and see how much it can learn.

Then, yes, as for the specific task, comparing text similarity is what you need but, as I said, without fine tuning on your own or similar data… I am a bit skeptical about the results. In my experience as a teacher, I do not think that there is just one magical ideal answer. One good answer can look very different from another. I am afraid your model will need to learn the nuances.

However, perhaps obtaining a reasonable corpus of labelled data is not so difficult in your case.

Being a web developer, you can first provide the tools to encourage the students to take the tests in digital format (otherwise you will need to scan and OCR) and also provide the tools for the teachers to provide the evaluation of each question on digital format.

Hi, This is pretty much done and working as expected, although it could do with more training data. Here is the code in case someone wants to do something similar.

Model Training:

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd
import re
import string

data = pd.read_csv('/content/data.csv')

# Split data into training and validation sets
train_data, val_data = train_test_split(data, test_size=0.3, random_state=42)

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)

    # Remove leading/trailing whitespaces
    text = text.strip()

    # Collapse multiple spaces into a single space
    text = re.sub(r'\s+', ' ', text)

    return text

# Define a custom dataset class
class CustomDataset(Dataset):
    def __init__(self, questions, document_text, answers, labels, tokenizer, max_length):
        self.questions = questions
        self.document_text = document_text
        self.answers = answers
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, index):
        question = self.questions[index]
        answer = self.answers[index]
        label = self.labels[index]

        question = preprocess_text(question)
        answer = preprocess_text(answer)
        document_text = preprocess_text(self.document_text[index])

        input_text = f"{question} {document_text} {answer}"

        encoding = self.tokenizer.encode_plus(
            input_text,
            add_special_tokens=True,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )

        input_ids = encoding['input_ids'].squeeze()
        attention_mask = encoding['attention_mask'].squeeze()

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'label': label
        }

# Set the device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.backends.cudnn.benchmark = True

# Define hyperparameters
batch_size = 16
max_length = 512
num_epochs = 50
learning_rate = 2e-5

# Load the pre-trained tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# Example training data
train_questions = train_data['question'].values
train_document_text = train_data['document_text'].values
train_answers = train_data['answer'].values
train_labels = train_data['label'].values

# Example validation data
val_questions = val_data['question'].values
val_document_text = val_data['document_text'].values
val_answers = val_data['answer'].values
val_labels = val_data['label'].values

# Create the custom dataset and data loader for training
train_dataset = CustomDataset(train_questions, train_document_text, train_answers, train_labels, tokenizer, max_length)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Create the custom dataset and data loader for validation
val_dataset = CustomDataset(val_questions, val_document_text, val_answers, val_labels, tokenizer, max_length)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=3)
model.to(device=device)

# Define the optimizer and loss function
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
loss_fn = torch.nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    train_preds = []
    train_targets = []

    for batch in train_loader:
        input_ids = batch['input_ids'].to(device=device)
        attention_mask = batch['attention_mask'].to(device=device)
        labels = batch['label'].to(device=device)

        optimizer.zero_grad()

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits

        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        train_preds.extend(torch.argmax(logits, dim=1).tolist())
        train_targets.extend(labels.tolist())

        print(f"Train Batch: Loss={loss.item()}")

    # Calculate training accuracy
    train_accuracy = accuracy_score(train_targets, train_preds)

    # Validation loop
    model.eval()
    val_loss = 0.0
    val_preds = []
    val_targets = []

    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device=device)
            attention_mask = batch['attention_mask'].to(device=device)
            labels = batch['label'].to(device=device)

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits

            val_loss += loss.item()
            val_preds.extend(torch.argmax(logits, dim=1).tolist())
            val_targets.extend(labels.tolist())

            print(f"Validation Batch: Loss={loss.item()}")

    # Calculate validation accuracy
    val_accuracy = accuracy_score(val_targets, val_preds)

    # Print training and validation loss and accuracy
    print(f"Epoch {epoch+1}:")
    print(f"Train Loss: {train_loss / len(train_loader)}")
    print(f"Train Accuracy: {train_accuracy}")
    print(f"Validation Loss: {val_loss / len(val_loader)}")
    print(f"Validation Accuracy: {val_accuracy}")

# Save the trained model
model.save_pretrained('./')

And the Inference Function:

import json
import os
import logging
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import RobertaTokenizer, RobertaForSequenceClassification

os.environ['TRANSFORMERS_CACHE'] = '/tmp/'

logging.getLogger().setLevel(logging.DEBUG)

# Define the CustomDataset class
class CustomDataset(Dataset):
    def __init__(self, questions, document_text, answers, tokenizer, max_length):
        self.questions = questions
        self.document_text = document_text
        self.answers = answers
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, index):
        question = self.questions[index]
        answer = self.answers[index]

        combined_text = f"{question} {self.document_text} {answer}"

        encoding = self.tokenizer.encode_plus(
            combined_text,
            add_special_tokens=True,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )

        input_ids = encoding['input_ids'].squeeze()
        attention_mask = encoding['attention_mask'].squeeze()

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask
        }


def inference(loader, model, device):
    model.eval()

    with torch.no_grad():
        for batch in loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits

            prediction = torch.argmax(logits, dim=1).item()

    return prediction


def lambda_handler(event, context):
    logging.info(f"event['body'] {event['body']}")
    input_data = json.loads(event['body'])
    logging.info(f"input_data {input_data}")
    question = input_data['question']
    document_text = input_data['document_text']
    answer = input_data['answer']

    # Load the pre-trained tokenizer
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base', cache_dir='/tmp/')

    # Load the model
    model = RobertaForSequenceClassification.from_pretrained('./')

    # Set the device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    # Prepare the input for inference
    input_data = {
        'question': question,
        'document_text': document_text,
        'answer': answer
    }

    # Create a dataset for the input data
    inference_dataset = CustomDataset([input_data['question']], input_data['document_text'],
                                      [input_data['answer']], tokenizer, max_length=512)
    inference_loader = DataLoader(inference_dataset, batch_size=1)

    # Call the inference function
    try:
        prediction = inference(inference_loader, model, device)

        response = {
            'statusCode': 200,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*',
                'Access-Control-Allow-Headers': 'Content-Type',
                'Access-Control-Allow-Methods': 'OPTIONS,POST'
            },
            'body': json.dumps(prediction)
        }


    except Exception as e:
        logging.error(e)
        response = {
            'statusCode': 500,
            'headers': {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*',
                'Access-Control-Allow-Headers': 'Content-Type',
                'Access-Control-Allow-Methods': 'OPTIONS,POST'
            },
            'body': json.dumps('Error occurred: ' + str(e))
        }
    return response

I’m happy to hear suggestions/comments if anyone has any.

I just wanted to ask if anyone has tried combining sentence transformers with some light rule-based filtering to handle edge cases in grading? I’m experimenting with cosine similarity + a few keyword checks, and it’s doing okay so far. Curious if others had success fine-tuning models on smaller, domain-specific datasets or stuck with zero-shot approaches?

For now, the general implementation seems to be something like this.


1. Short answer: your setup is a standard, reasonable pattern

What you describe:

  • Sentence-transformers to get embeddings for:

    • reference/model answer(s)
    • student answer
  • Cosine similarity as the main “how close is this?” signal

  • A few rule-based checks (keywords / length / sanity checks) for edge cases

is basically the canonical “semantic core + rule/feature wrapper” pattern that shows up in a lot of ASAG work.

Surveys of deep-learning-based ASAG explicitly note that the strongest systems often combine learned semantic representations with simpler, hand-engineered signals (lexical overlap, length, concept checks), rather than relying on embeddings alone. (arXiv)

So you’re not doing something ad-hoc; you’re recreating what many research systems have ended up converging on.


2. Examples of similar approaches in the literature

2.1 Sentence-transformers as the main grading signal

A few concrete references that are very close to what you’re doing:

  • Ahmed et al., “On the Application of Sentence Transformers to Automatic Short Answer Grading in Blended Assessment”
    They use pre-trained sentence-transformer models as the core semantic engine and compare different input setups (question + answer vs answer alone). Student answers and reference answers are embedded and compared via similarity; this similarity is then mapped to grades. They report “promising results” in a blended learning setting, even without huge task-specific datasets. (ResearchGate)

  • Analysis of direct scoring vs similarity-based scoring
    Other work compares “direct scoring” (a model that outputs a score from text) vs “similarity-based scoring” (model answer ↔ student answer similarity). The similarity-based approach with sentence embeddings performs competitively and is conceptually simple: exactly what you’re doing with cosine. (itscience.org)

This is essentially:

encode(model_answer), encode(student_answer) → cosine(sim) → score/label.

2.2 Hybrid “semantic + lexical / rule” frameworks (very close to “cosine + keyword checks”)

  • GradeAid framework (del Gobbo et al., 2023)
    GradeAid is a full ASAG framework that explicitly combines semantic features from transformers with lexical features (TF–IDF overlaps, length, etc.) and feeds them into regressors to predict numeric scores. (SpringerLink)

    Conceptually:

    • Your cosine similarity ≈ their semantic features.
    • Your keyword checks / rules ≈ their lexical features and heuristics.
  • Recent EDM work “Short answer grading with sentence similarity and a few extra modeling steps” (Desmarais et al., 2025)
    They literally take sentence similarity (SBERT-style) as the main signal and then add “a few extra modeling steps” on top to better map similarity values to grades and handle hard cases. (educationaldatamining.org)

The pattern across these papers:

  • Start with embedding-based similarity.

  • Add shallow modeling or simple rules/features on top to handle:

    • multi-concept questions
    • partial credit
    • short / off-topic / adversarial answers

That’s exactly the role your rule-based layer is playing.


3. Zero-shot vs fine-tuning on smaller, domain-specific data

3.1 Zero-shot + rules: where it fits

Using a pre-trained sentence-transformers model in zero-shot mode (no task-specific fine-tuning) with cosine similarity + rules is a very common starting point:

  • Ahmed et al. show that off-the-shelf sentence-transformers already give usable correlations with human grading when used as semantic similarity engines. (ResearchGate)
  • Surveys note that embedding-based models, combined with hand-crafted features, are often competitive with more complex architectures, especially when labeled data is limited. (arXiv)

Zero-shot + rules is usually enough when:

  • You don’t yet have much labeled grading data.
  • You’re aiming for formative feedback or rough correctness labels, not high-stakes grading.
  • A human can still review borderline answers.

So “cosine + a few keyword checks and thresholds” being “okay so far” is not surprising — that’s the expected behavior for this baseline.

3.2 Fine-tuning: what others report when they go beyond zero-shot

Once people have a bit of labeled data, they almost always get measurable gains by fine-tuning.

A representative example:

  • Wijanto & Yong, “Combining Balancing Dataset and SentenceTransformers to Improve Short Answer Grading Performance” (Applied Sciences, 2024)
    They start from sentence-transformer models and evaluate short-answer grading performance before and after fine-tuning, with some attention to class imbalance and data augmentation.
    Main takeaways:

    • Fine-tuning on task-specific ASAG datasets noticeably improves correlation and accuracy over pure zero-shot embeddings.
    • Balancing the dataset (so “correct” vs “incorrect/partial” isn’t too skewed) matters for stable performance. (MDPI)

Surveys and comparative studies point in the same direction:

  • Embeddings alone (zero-shot) are good baselines.
  • But task-specific training (either fine-tuning the sentence-transformer or training a light model on top of similarity + features) generally wins once you have a few thousand graded answers. (arXiv)

In practice people often do:

  1. Start exactly where you are: zero-shot sentence-transformers + cosine + rules.

  2. Log model scores, human grades, and overrides.

  3. Once they have enough data:

    • Fine-tune the sentence-transformer (e.g. with a CosineSimilarityLoss on (ref, student, normalized_score)), and/or

    • Train a small classifier/regressor on top of features like:

      • cosine similarity
      • answer length
      • “number of rubric concepts hit” from keyword/concept checks

4. Design ideas you can borrow directly

If you want to keep your current flavor (cosine + a few rules) but make it more robust, these are common extensions:

4.1 Multi-reference + rubric concepts

Instead of a single reference answer, use:

  • A small set of paraphrased reference answers
  • Short rubric “concept” sentences (e.g. “mentions conservation of energy”, “explains that ATP stores energy in phosphate bonds”)

Then compute:

  • Max / average cosine similarity to reference answers
  • Concept-wise similarities (one score per rubric point)

This is how rubric-oriented systems handle partial credit and multi-concept items. It meshes nicely with sentence-transformers: you stay in cosine-land, but you now have one similarity per concept instead of a single global score. (ResearchGate)

4.2 Turn rules into features instead of only hard filters

Right now your rules might be binary (“if keyword missing → score 0”). Many systems soften this:

  • Compute features such as:

    • cos_sim_ref
    • len_tokens
    • num_core_keywords_hit
    • has_negation_near_keyword (0/1)
  • Train a tiny model (logistic regression / shallow MLP) that takes these features and predicts:

    • correct / partial / incorrect, or
    • a score in [0,1]

GradeAid is a concrete example of this pattern: semantic and lexical features are concatenated and fed into regressors, rather than hand-coded if/else logic. (SpringerLink)

You keep interpretability (you can still inspect which features matter), but avoid manually tuning thresholds per question.

4.3 Data collection and active learning

Since collecting labels is painful, some work treats this as an active-learning problem:

  • Use your current model to grade everything.
  • Only send uncertain answers (e.g., cosine near threshold, conflicting signals from rules) to a human for labeling.
  • Retrain/fine-tune periodically on this enriched dataset.

This gives you a focused labeled set around real decision boundaries, which is exactly where fine-tuning brings the most value.


5. Practical notes for sentence-transformers specifically

If you decide to fine-tune:

  • A natural setup with sentence-transformers is:

    • Inputs: (ref_answer, student_answer, target_score_normalized)
    • Loss: CosineSimilarityLoss so that cosine(ref, student) ≈ target_score.
  • You can aggregate data across questions as long as you:

    • Normalize scores per question (e.g. divide by max points).
    • Make sure your train/validation split separates questions if you care about generalization to unseen questions.

Papers like Ahmed et al. and Wijanto & Yong show this kind of task-specific tuning of sentence-transformers improves grading performance over pure zero-shot, especially in domain-specific or non-English settings. (ResearchGate)

I started messing with student answer scoring a while back and Hugging Face made things way simpler than trying everything from scratch. I played around with a small dataset first to see how the model handled basic answers. While doing that, I checked pedir ayudas for ideas on clear, structured examples, and it helped me think about scoring consistently.