AgentKai Labs
Practical research notebooks documenting real experiments, model vulnerability scans, prompt injection testing, and tool integrations. We build, learn, document, and share.
Deterministic vs Stochastic Systems
Detailing why traditional software testing methodologies fall short when confronted with probabilistic AI behaviors and how behavioral evaluation bridges the gap.
Getting Started with Promptfoo
Setting up automated evaluation pipelines for LLM application outputs, writing custom assertions, and analyzing local evaluation metrics.
Prompt Injection Basics
Analyzing direct and indirect vulnerability vectors, staging proof-of-concept injection exploits, and building robust prompt guardrails.
Testing a Willowfish AI Guide
Evaluating conversational instruction-following, customer journey boundaries, and escalation triggers in a live chatbot context.
PyRIT First Experiments
Testing Microsoft's Python Risk Identification Tool (PyRIT) to automate adversarial red teaming pipelines against target model endpoints.
Garak Vulnerability Scanning
Running Garak vulnerability scanner to audit LLM API integrations for data leaks, toxicity, prompt injections, and structural hallucinations.
RAG Evaluation Fundamentals
Measuring grounding, context retrieval precision, and answer faithfulness using Ragas (Retrieval-Augmented Generation Assessment).
Our Content Philosophy
"Build. Learn. Document. Share."
AgentKai does not publish generic marketing content. Every lab entry, article, and case study originates from real projects, active system evaluations, and actual vulnerabilities discovered. We document our learning journey in the open to advance LLM safety and engineering robustness.
