GraphRAG for Mainframe Abend Troubleshooting with AgentScope

Posted on Fri 12 June 2026 in GenAI

Most mainframe troubleshooting RAGs fail at the same place: retrieval. An abend code like S0C7 is a near-exact lookup, not a fuzzy semantic match — but vector search happily returns the S0C4 chunk because the embeddings sit close together. And job dependencies are graph-shaped: an abend in step 3 cascades to downstream jobs, and plain RAG flattens that structure into disconnected paragraphs. The fix is a knowledge graph that traverses real relationships instead of guessing with cosine similarity.

The Core Idea

Structure over similarity — Mainframe docs encode jobs, steps, abend codes, and fixes with strict relationships. Model those as typed nodes and edges so the agent traverses known paths instead of hoping a vector match lands on the right text.

Schema is the boundary — Every md file gets parsed into one fixed node/edge structure. The LLM never reasons over raw document shape; it queries known types. This is where accuracy comes from.

Step 1 — Define the Schema

Five node types and a handful of edge types cover most batch environments:

Node types:    JOB, STEP, ABEND, DATASET, REMEDIATION

Edge types:    JOB   -HAS_STEP->     STEP
               STEP  -RAISES->       ABEND
               ABEND -RESOLVED_BY->  REMEDIATION
               JOB   -TRIGGERS->     JOB        (downstream dependency)
               STEP  -READS/WRITES-> DATASET

Step 2 — Parse the MD Files into the Graph

Codes follow strict formats, so regex handles the deterministic bits. Prose remediation text may need an LLM extraction pass for messier docs.

# build_graph.py
import re
import networkx as nx

ABEND_RE = re.compile(r'\b([SU]\d{3,4}|U\d{4})\b')   # S0C7, U4038
JOB_RE   = re.compile(r'\bJOB[_A-Z0-9]+\b')

def parse_md_to_graph(md_text: str) -> nx.DiGraph:
    g = nx.DiGraph()
    current_job = current_step = None

    for line in md_text.splitlines():
        if line.startswith("## "):                       # ## JOB_PAYROLL
            jobs = JOB_RE.findall(line)
            if jobs:
                current_job = jobs[0]
                g.add_node(current_job, type="JOB")

        elif line.startswith("### ") and current_job:     # ### STEP04
            current_step = line.replace("###", "").strip()
            g.add_node(current_step, type="STEP")
            g.add_edge(current_job, current_step, rel="HAS_STEP")

        elif current_step:
            for ab in ABEND_RE.findall(line):
                g.add_node(ab, type="ABEND")
                g.add_edge(current_step, ab, rel="RAISES")
                rem = line.split(ab, 1)[-1].strip(" :-")
                if rem:
                    rem_id = f"FIX::{ab}::{hash(rem) & 0xffff}"
                    g.add_node(rem_id, type="REMEDIATION", text=rem)
                    g.add_edge(ab, rem_id, rel="RESOLVED_BY")

        if current_job and "trigger" in line.lower():
            for dep in JOB_RE.findall(line):
                if dep != current_job:
                    g.add_edge(current_job, dep, rel="TRIGGERS")

    return g

Tune the section markers and regex to your actual md conventions — the principle holds regardless.

Step 3 — Query Functions (the Agent's Tools)

Deterministic traversals. No embeddings, no LLM, no similarity error on the code lookup.

# graph_tools.py
import json
import networkx as nx

g = nx.read_gml("abend_graph.gml")

def query_abend(abend_code: str) -> str:
    """Remediations + which steps/jobs raise a given abend code."""
    if abend_code not in g:
        return json.dumps({"found": False, "abend": abend_code})

    remediations = [g.nodes[n].get("text", n)
                    for n in g.successors(abend_code)
                    if g.edges[abend_code, n].get("rel") == "RESOLVED_BY"]

    raised_by = [s for s in g.predecessors(abend_code)
                 if g.edges[s, abend_code].get("rel") == "RAISES"]

    jobs = []
    for step in raised_by:
        jobs += [j for j in g.predecessors(step)
                 if g.edges[j, step].get("rel") == "HAS_STEP"]

    return json.dumps({
        "found": True, "abend": abend_code,
        "remediations": remediations,
        "raised_by_steps": raised_by,
        "jobs": list(set(jobs)),
    })

def downstream_impact(job: str) -> str:
    """Jobs that fail or delay if this job abends — the cascade."""
    if job not in g:
        return json.dumps({"found": False, "job": job})
    affected = [v for u, v in nx.edge_dfs(g, job)
                if g.edges[u, v].get("rel") == "TRIGGERS"]
    return json.dumps({"job": job, "downstream_jobs": affected})

Step 4 — Wire into AgentScope

The graph handles structure; vector RAG stays as the fallback for open-ended prose questions. The agent picks the right tool.

# agent.py
import asyncio
from agentscope.agent import ReActAgent
from agentscope.tool import Toolkit
from agentscope.message import Msg
from graph_tools import query_abend, downstream_impact
from rag_tools import retrieve_docs   # your existing vector RAG

toolkit = Toolkit()
toolkit.register_tool_function(query_abend)        # exact structured lookup
toolkit.register_tool_function(downstream_impact)  # dependency cascade
toolkit.register_tool_function(retrieve_docs)      # semantic fallback

agent = ReActAgent(
    name="mainframe_ops",
    sys_prompt=(
        "You troubleshoot mainframe batch jobs. For a specific abend code, "
        "ALWAYS call query_abend first. To assess the blast radius of a failed "
        "job, call downstream_impact. Use retrieve_docs only for open-ended "
        "questions the graph cannot answer. Never invent abend semantics."
    ),
    model=...,           # your model wrapper
    toolkit=toolkit,
)

async def main():
    q = Msg("user",
            "JOB_PAYROLL step 4 hit S0C7. What do I do, and what breaks downstream?",
            role="user")
    print((await agent(q)).content)

asyncio.run(main())

Why This Lifts Accuracy

No similarity error on codesquery_abend("S0C7") walks ABEND -> RESOLVED_BY directly. S0C7 and S0C4 can no longer be confused because there is no embedding step in the lookup path.

Real cascade reasoningdownstream_impact traverses TRIGGERS edges to return the actual blast radius — something flat vector RAG structurally cannot do.

Graceful fallback — Vector retrieval stays for "how do I..." prose questions, so you lose nothing and gain structured precision.

The Honest Order of Operations

Build hybrid retrieval (exact code match + vector) first, then this graph layer. Most mainframe abend RAGs become good enough right here at Step 4 — the probabilistic layer (PKG + MCMC) is worth adding only if you have genuine multi-cause, cross-job causal uncertainty left to resolve after the graph is in place.

One thing to confirm: verify the AgentScope import paths against your installed version — the agentscope.agent and agentscope.tool namespaces have moved across releases.