ユーザーフィードバックを改善ループに組み込む

01TL;DR

フィードバック収集は「評価ボタン」と「自由記述」の2層設計が実用的なスタート地点です。
収集データは feedbackId / sessionId / rating / comment / context を軸に正規化し、後の分析に耐える構造にします。
自由記述の自動分類には LLM を使い、カテゴリと緊急度を構造化して抽出します。
優先度付けは「再現頻度 × 影響度 × 修正可能性」の3軸スコアで機械的に並べ替えます。
優先度の高いフィードバックはプロンプト改善の根拠として使い、評価データセットへ反映することで改善効果を定量化します。
サンプルコードは TypeScript（Node.js v20.x / Next.js v14.x）で記述しています。

02はじめに

この記事の対象読者

LLM を使ったプロダクトを本番運用しており、次のような課題を感じている方を想定しています。

ユーザーからの否定フィードバックが集まっているが、どう整理して改善に使えばいいかわからない
フィードバック収集の仕組みはあるが、データが眠ったままになっている
プロンプトを改善しているが、改善の効果を定量的に説明できない

前提として、TypeScript と Node.js の基本的な知識と、LLM API（本記事では Anthropic Claude を例示）の呼び出し経験があることを想定しています。

前提知識

フィードバックを収集する対象のエージェントが既に存在し、ある程度のユーザートラフィックがある状態を前提とします。評価の基礎概念については（関連記事: 評価設計の基本：エージェントの品質をどう測るか）を先に読むことをお勧めします。

この記事で得られること

フィードバック収集 UI の設計方針と実装例
収集データのスキーマ設計
自由記述の自動分類（LLM を使ったカテゴリ抽出）
優先度付けのフレームワーク
プロンプト改善と評価データセットへの反映手順
改善サイクルの効果検証方法

03なぜフィードバックループが大切か

エージェントの品質を測る手段はいくつかあります。オフラインの評価データセット、ログのメトリクス監視、定期的な人手レビュー——それぞれに価値があります。

ただ、それらだけでは「ユーザーが実際に困っていること」を捉えにくい側面があります。評価データセットは設計者が想定したシナリオに偏りがちで、ログのメトリクスはエラーや遅延は捉えられても「応答が的外れだった」という体験は見えません。

ユーザーフィードバックは、この死角を補う情報源です。ユーザーは「自分のユースケースで実際に困ったこと」を報告するため、開発側が想定していなかった問題が浮かび上がることがあります。

一方で、フィードバックをそのまま改善に使おうとすると壁にぶつかります。量が多すぎて読み切れない。内容が曖昧で何を直せばいいかわからない。どれを先に直すべきかの判断基準がない。

この記事では、フィードバックを「収集して終わり」にしない仕組みを段階的に作る方法を紹介します。

04フィードバック収集 UI の設計

2 層設計が実用的な理由

フィードバック UI は「評価ボタン（定量）」と「自由記述（定性）」の 2 層で設計すると扱いやすいです。

┌───────────────────────────────────────────────────────┐
│  エージェントの応答                                    │
│  ─────────────────────────────────────────────────     │
│  この応答は役に立ちましたか？                          │
│                                                        │
│  [ 👍 役に立った ]   [ 👎 役に立たなかった ]           │
│                                                        │
│  （否定を選んだ場合に展開）                            │
│  ┌────────────────────────────────────────────────┐   │
│  │ 何が問題でしたか？（任意）                      │   │
│  │ ・情報が間違っていた                            │   │
│  │ ・質問の意図を理解していない                    │   │
│  │ ・回答が不完全だった                            │   │
│  │ ・その他：[自由記述フィールド]                  │   │
│  └────────────────────────────────────────────────┘   │
└───────────────────────────────────────────────────────┘

評価ボタンだけでは「何が問題か」がわかりません。自由記述だけだと入力の心理的ハードルが上がり、回収率が落ちます。

否定評価を選んだユーザーのみに自由記述を展開する設計にすると、問題の内容を把握しつつ記述の強制感を減らせます。肯定評価の場合も任意でコメントを残せるようにすると、「どの応答が特に良かったか」という正例も集まります。

評価の粒度

評価を 👍/👎 の 2 値にするか、5 段階などにするかは用途によります。

2 値評価のメリットは、ユーザーの判断コストが低く、回収率が高くなる傾向があることです。また「良い/悪い」の二項分類は後の処理（分類・優先度付け）もシンプルになります。

5 段階評価はニュアンスを捉えられますが、「3 か 4 か」の迷いが生じやすく、スコアの解釈が複雑になる傾向があります。

筆者たちの経験では、まず 2 値から始めてデータを集め、必要に応じて細分化する順序が実践的です。

UI コンポーネントの実装（Next.js App Router）

Next.js v14.x + TypeScript での実装例を示します。

// components/FeedbackWidget.tsx

"use client";

import { useState } from "react";

export interface FeedbackPayload {
  sessionId: string;
  messageId: string;
  agentId: string;
  promptVersion: string;
  rating: "positive" | "negative";
  category?: string;
  comment?: string;
  userInput: string;
  agentOutput: string;
}

interface FeedbackWidgetProps {
  sessionId: string;
  messageId: string;
  agentId: string;
  promptVersion: string;
  userInput: string;
  agentOutput: string;
  onSubmit: (payload: FeedbackPayload) => Promise<void>;
}

const NEGATIVE_CATEGORIES = [
  { value: "wrong_info", label: "情報が間違っていた" },
  { value: "misunderstood", label: "質問の意図を理解していない" },
  { value: "incomplete", label: "回答が不完全だった" },
  { value: "other", label: "その他" },
] as const;

type NegativeCategory = (typeof NEGATIVE_CATEGORIES)[number]["value"];

export function FeedbackWidget({
  sessionId,
  messageId,
  agentId,
  promptVersion,
  userInput,
  agentOutput,
  onSubmit,
}: FeedbackWidgetProps) {
  const [rating, setRating] = useState<"positive" | "negative" | null>(null);
  const [category, setCategory] = useState<NegativeCategory | null>(null);
  const [comment, setComment] = useState("");
  const [submitted, setSubmitted] = useState(false);
  const [isSubmitting, setIsSubmitting] = useState(false);

  async function handleSubmit() {
    if (!rating) return;

    setIsSubmitting(true);
    try {
      await onSubmit({
        sessionId,
        messageId,
        agentId,
        promptVersion,
        rating,
        category: category ?? undefined,
        comment: comment.trim() || undefined,
        userInput,
        agentOutput,
      });
      setSubmitted(true);
    } finally {
      setIsSubmitting(false);
    }
  }

  if (submitted) {
    return (
      <p className="feedback-thanks">フィードバックをありがとうございました。</p>
    );
  }

  return (
    <div className="feedback-widget">
      <p className="feedback-question">この応答は役に立ちましたか？</p>

      <div className="feedback-buttons">
        <button
          type="button"
          aria-pressed={rating === "positive"}
          onClick={() => {
            setRating("positive");
            // 肯定の場合は即送信
            onSubmit({
              sessionId,
              messageId,
              rating: "positive",
              userInput,
              agentOutput,
            }).then(() => setSubmitted(true));
          }}
        >
          役に立った
        </button>

        <button
          type="button"
          aria-pressed={rating === "negative"}
          onClick={() => setRating("negative")}
        >
          役に立たなかった
        </button>
      </div>

      {rating === "negative" && (
        <div className="feedback-detail">
          <fieldset>
            <legend>何が問題でしたか？（任意）</legend>
            {NEGATIVE_CATEGORIES.map(({ value, label }) => (
              <label key={value}>
                <input
                  type="radio"
                  name="feedback-category"
                  value={value}
                  checked={category === value}
                  onChange={() => setCategory(value)}
                />
                {label}
              </label>
            ))}
          </fieldset>

          {category === "other" && (
            <textarea
              value={comment}
              onChange={(e) => setComment(e.target.value)}
              placeholder="具体的にどのような点が問題でしたか？"
              rows={3}
              maxLength={500}
            />
          )}

          <button
            type="button"
            onClick={handleSubmit}
            disabled={isSubmitting}
          >
            {isSubmitting ? "送信中..." : "送信する"}
          </button>
        </div>
      )}
    </div>
  );
}

05収集データのスキーマ設計

なぜスキーマ設計が重要か

収集したデータを後から分析するとき、構造が不十分だと「欲しい切り口で集計できない」という問題が起きます。たとえば「どのエージェントでの否定評価が多いか」「否定評価の中でどのカテゴリが多いか」を調べたいなら、それを支えるフィールドが最初から必要です。

スキーマは一度決めると変更コストが高いため、収集開始前に設計します。

フィードバックレコードの型定義

// types/feedback.ts

export type FeedbackRating = "positive" | "negative";

export type FeedbackCategory =
  | "wrong_info"       // 情報が間違っている
  | "misunderstood"    // 質問の意図を理解していない
  | "incomplete"       // 回答が不完全
  | "hallucination"    // 存在しない情報を生成した
  | "harmful"          // 有害・不適切な内容
  | "too_long"         // 回答が長すぎる
  | "too_short"        // 回答が短すぎる
  | "formatting"       // フォーマットの問題
  | "other";           // その他

// ユーザーが直接入力したフィードバック（生データ）
export interface FeedbackRecord {
  feedbackId: string;          // UUID
  sessionId: string;           // 会話セッションID
  messageId: string;           // 対象メッセージID
  agentId: string;             // どのエージェントへのフィードバックか
  promptVersion: string;       // 当時使われていたプロンプトバージョン
  rating: FeedbackRating;
  userCategory?: FeedbackCategory;  // ユーザーが選んだカテゴリ（任意）
  comment?: string;                 // 自由記述（任意）
  userInput: string;           // ユーザーの入力（文脈保持）
  agentOutput: string;         // エージェントの出力（文脈保持）
  createdAt: string;           // ISO 8601
  metadata?: Record<string, unknown>;  // 拡張用
}

// 自動分類後のフィードバック（処理済み）
export interface ClassifiedFeedback extends FeedbackRecord {
  autoCategory: FeedbackCategory;    // 自動分類の結果
  autoCategoryConfidence: number;    // 分類の信頼度 0〜1
  urgencyLevel: "critical" | "high" | "medium" | "low";
  autoTags: string[];                // 追加タグ（複数の問題が混在する場合）
  classifiedAt: string;
}

// 優先度付け後のフィードバック（アクション対象）
export interface PrioritizedFeedback extends ClassifiedFeedback {
  priorityScore: number;             // 優先度スコア（高いほど先に対応）
  frequencyScore: number;            // 再現頻度スコア
  impactScore: number;               // ユーザーへの影響度スコア
  fixabilityScore: number;           // 修正可能性スコア
  linkedCaseIds: string[];           // 同種の問題と紐付けたケースID
  status: "pending" | "in_review" | "fixed" | "wont_fix";
  scoredAt: string;
}

コンテキストの保持が重要な理由

userInput と agentOutput をフィードバックレコードに含めている点について補足します。

後から「どの入力に対してどの応答が悪い評価を受けたか」を確認するためには、当時のやり取りが必要です。セッションIDだけを保存して後からログと突き合わせる設計も可能ですが、ログが別のストレージにある場合や保存期間が異なる場合に参照が困難になります。

フィードバックレコードに文脈を含めておくと、分析時の自己完結性が高まります。ただしユーザーの個人情報が含まれる可能性のある入力を保存する際は、適切な個人情報取り扱い規則に従ってください。

API エンドポイントの実装

// app/api/feedback/route.ts（Next.js App Router）

import { NextRequest, NextResponse } from "next/server";
import { randomUUID } from "crypto";
import type { FeedbackRecord, FeedbackRating } from "../../../types/feedback.js";

// 実際の実装ではデータベースへの保存処理に置き換えてください
async function saveFeedback(record: FeedbackRecord): Promise<void> {
  // 例: PostgreSQL / Firestore / DynamoDB など
  console.log("[feedback] saved:", record.feedbackId);
}

export async function POST(request: NextRequest): Promise<NextResponse> {
  let body: unknown;

  try {
    body = await request.json();
  } catch {
    return NextResponse.json({ error: "Invalid JSON" }, { status: 400 });
  }

  if (!isValidFeedbackPayload(body)) {
    return NextResponse.json({ error: "Invalid payload" }, { status: 400 });
  }

  const record: FeedbackRecord = {
    feedbackId: randomUUID(),
    sessionId: body.sessionId,
    messageId: body.messageId,
    agentId: body.agentId,
    promptVersion: body.promptVersion,
    rating: body.rating,
    userCategory: body.category,
    comment: body.comment,
    userInput: body.userInput,
    agentOutput: body.agentOutput,
    createdAt: new Date().toISOString(),
  };

  await saveFeedback(record);

  return NextResponse.json({ feedbackId: record.feedbackId }, { status: 201 });
}

interface FeedbackPayloadRaw {
  sessionId: string;
  messageId: string;
  agentId: string;
  promptVersion: string;
  rating: FeedbackRating;
  category?: string;
  comment?: string;
  userInput: string;
  agentOutput: string;
}

function isValidFeedbackPayload(value: unknown): value is FeedbackPayloadRaw {
  if (typeof value !== "object" || value === null) return false;

  const v = value as Record<string, unknown>;

  return (
    typeof v.sessionId === "string" &&
    typeof v.messageId === "string" &&
    typeof v.agentId === "string" &&
    typeof v.promptVersion === "string" &&
    (v.rating === "positive" || v.rating === "negative") &&
    typeof v.userInput === "string" &&
    typeof v.agentOutput === "string"
  );
}

06自由記述の自動分類

手動分類の限界

フィードバックが 1 日に数十件の段階なら目視での分類も可能です。しかし数百件を超えると手動分類はすぐにボトルネックになります。

自由記述の内容は多様です。「なんか変」という曖昧なものから「〇〇という前提で質問したのに、△△という別の前提で答えていた」という具体的なものまで幅広く存在します。

これを定型カテゴリに分類し、緊急度を判定する処理を LLM に任せることで、分類コストを大幅に下げられます。

分類プロンプトの設計

// feedback/classifier.ts

import Anthropic from "@anthropic-ai/sdk"; // @anthropic-ai/sdk@0.27.0
import type {
  FeedbackRecord,
  FeedbackCategory,
  ClassifiedFeedback,
} from "../types/feedback.js";

const client = new Anthropic();

interface ClassificationResult {
  category: FeedbackCategory;
  confidence: number;
  urgencyLevel: "critical" | "high" | "medium" | "low";
  tags: string[];
  reasoning: string;
}

function buildClassificationPrompt(record: FeedbackRecord): string {
  return `
あなたはユーザーフィードバックを分類するアシスタントです。
以下のフィードバックを分析し、カテゴリ・緊急度・タグを判定してください。

## カテゴリ定義

- wrong_info: 事実として誤った情報が含まれている
- misunderstood: ユーザーの質問意図を正しく理解していない
- incomplete: 必要な情報が欠けている、または回答が途中で終わっている
- hallucination: 存在しない情報や根拠のない断定が含まれている
- harmful: 有害・不適切・倫理的に問題のある内容が含まれている
- too_long: 回答が長すぎてユーザーが必要な情報を見つけにくい
- too_short: 回答が短すぎて必要な情報が不足している
- formatting: フォーマット・構造・読みやすさの問題
- other: 上記に当てはまらない問題

## 緊急度定義

- critical: 有害コンテンツ、誤った医療・法律情報、セキュリティリスクなど即時対応が必要
- high: 業務に支障をきたすレベルの誤情報・機能不全
- medium: ユーザー体験を損なうが業務は継続できる問題
- low: 軽微な品質改善要望

## 入力情報

ユーザーの入力:
${record.userInput}

エージェントの出力:
${record.agentOutput}

ユーザーが選んだカテゴリ（任意）:
${record.userCategory ?? "未選択"}

ユーザーのコメント（任意）:
${record.comment ?? "なし"}

## 指示

上記の情報をもとに、最も適切な分類を行ってください。
必ず以下の JSON 形式のみで出力してください。余分なテキストは含めないでください。

{
  "category": "<カテゴリ値>",
  "confidence": <0〜1の小数>,
  "urgencyLevel": "<緊急度>",
  "tags": ["<タグ1>", "<タグ2>"],
  "reasoning": "<判断理由を1〜2文で>"
}
`.trim();
}

export async function classifyFeedback(
  record: FeedbackRecord
): Promise<ClassifiedFeedback> {
  // 肯定評価は分類不要（コスト節約）
  if (record.rating === "positive") {
    return {
      ...record,
      autoCategory: "other",
      autoCategoryConfidence: 1.0,
      urgencyLevel: "low",
      autoTags: [],
      classifiedAt: new Date().toISOString(),
    };
  }

  const prompt = buildClassificationPrompt(record);

  const response = await client.messages.create({
    model: "claude-haiku-4-5",
    max_tokens: 512,
    messages: [{ role: "user", content: prompt }],
  });

  const content = response.content[0];
  if (content.type !== "text") {
    throw new Error("Unexpected response type from classification");
  }

  const result = parseClassificationResult(content.text);

  return {
    ...record,
    autoCategory: result.category,
    autoCategoryConfidence: result.confidence,
    urgencyLevel: result.urgencyLevel,
    autoTags: result.tags,
    classifiedAt: new Date().toISOString(),
  };
}

function parseClassificationResult(rawText: string): ClassificationResult {
  const jsonMatch = rawText.match(/\{[\s\S]*\}/);
  if (!jsonMatch) {
    throw new Error(`Failed to extract JSON from classification response`);
  }

  const parsed = JSON.parse(jsonMatch[0]) as Partial<ClassificationResult>;

  const validCategories: FeedbackCategory[] = [
    "wrong_info", "misunderstood", "incomplete", "hallucination",
    "harmful", "too_long", "too_short", "formatting", "other",
  ];

  const validUrgency = ["critical", "high", "medium", "low"] as const;

  return {
    category: validCategories.includes(parsed.category as FeedbackCategory)
      ? (parsed.category as FeedbackCategory)
      : "other",
    confidence:
      typeof parsed.confidence === "number"
        ? Math.max(0, Math.min(1, parsed.confidence))
        : 0.5,
    urgencyLevel: validUrgency.includes(parsed.urgencyLevel as typeof validUrgency[number])
      ? (parsed.urgencyLevel as typeof validUrgency[number])
      : "medium",
    tags: Array.isArray(parsed.tags)
      ? parsed.tags.filter((t): t is string => typeof t === "string")
      : [],
    reasoning: typeof parsed.reasoning === "string" ? parsed.reasoning : "",
  };
}

// 使用例

import { classifyFeedback } from "./feedback/classifier.js";
import type { FeedbackRecord } from "./types/feedback.js";

const record: FeedbackRecord = {
  feedbackId: "fb-001",
  sessionId: "sess-abc",
  messageId: "msg-xyz",
  agentId: "bizplan-advisor",
  promptVersion: "2.1.0",
  rating: "negative",
  userCategory: "other",
  comment: "競合他社の分析を頼んだのに、自社の強みについての話をされた",
  userInput: "競合他社 A 社の強みと弱みを分析してください",
  agentOutput: "御社の強みとして、以下の点が挙げられます...",
  createdAt: "2026-06-11T10:00:00Z",
};

const classified = await classifyFeedback(record);
console.log(classified.autoCategory);    // => "misunderstood"
console.log(classified.urgencyLevel);    // => "high"
console.log(classified.autoTags);        // => ["context_confusion", "wrong_subject"]

バッチ処理でのコスト管理

1 件ずつリアルタイムで分類する方法と、バッチで一括処理する方法があります。

リアルタイム分類は「収集と同時に分類が終わる」メリットがありますが、フィードバックが集中する時間帯にコストと処理時間が増加します。

筆者たちの経験では、「critical」判定が見込まれるケースのみリアルタイムで処理し、それ以外は 1 時間ごとのバッチ処理に回す設計が実用的です。

// feedback/batch-classifier.ts

import { classifyFeedback } from "./classifier.js";
import type { FeedbackRecord, ClassifiedFeedback } from "../types/feedback.js";

// concurrency を絞ってレート制限を回避する
async function classifyBatch(
  records: FeedbackRecord[],
  concurrency = 5
): Promise<ClassifiedFeedback[]> {
  const results: ClassifiedFeedback[] = [];

  for (let i = 0; i < records.length; i += concurrency) {
    const chunk = records.slice(i, i + concurrency);
    const chunkResults = await Promise.all(
      chunk.map((r) => classifyFeedback(r))
    );
    results.push(...chunkResults);

    // チャンク間で少し待機（レート制限の余裕を持たせる）
    if (i + concurrency < records.length) {
      await new Promise((resolve) => setTimeout(resolve, 500));
    }
  }

  return results;
}

export { classifyBatch };

バッチ処理の実行例（出力ログ）：

[batch-classifier] Processing 48 records (concurrency=5)
[batch-classifier] Chunk 1/10 done (5 records, 2340ms)
[batch-classifier] Chunk 2/10 done (5 records, 1890ms)
...
[batch-classifier] Batch complete: 48 classified, 0 errors

07優先度付けのフレームワーク

「どれを先に直すか」を機械的に決める

フィードバックの量が増えると「どれから対応するか」の判断自体がボトルネックになります。毎回チームで議論していると時間がかかり、判断の一貫性も保ちにくくなります。

筆者たちは「再現頻度 × 影響度 × 修正可能性」の 3 軸スコアを使って機械的に優先度を算出する方法を取っています。

3 軸スコアの定義

優先度スコア = 再現頻度スコア × 影響度スコア × 修正可能性スコア

各スコアは 0〜1 の正規化値

軸	内容	高スコアの基準
再現頻度	同一カテゴリ・同一タグの報告が多い	7 日間で同種の否定評価が 10 件以上
影響度	ユーザーへの深刻さ	urgencyLevel が critical / high
修正可能性	プロンプト変更で解決できるか	autoCategory が misunderstood / incomplete

影響度と修正可能性を分けている理由は、「深刻だが修正が難しい問題」と「軽微だがすぐ直せる問題」を区別するためです。修正可能性が低い問題（たとえばモデル自体の能力限界に起因するもの）を無理に優先しても、改善効果が得られません。

優先度スコアの計算実装

// feedback/prioritizer.ts

import type { ClassifiedFeedback, PrioritizedFeedback } from "../types/feedback.js";

interface FrequencyMap {
  [categoryAndTag: string]: number;
}

const URGENCY_IMPACT_SCORE: Record<string, number> = {
  critical: 1.0,
  high: 0.75,
  medium: 0.4,
  low: 0.1,
};

// プロンプト変更で対応しやすいカテゴリ
const HIGH_FIXABILITY_CATEGORIES = new Set([
  "misunderstood",
  "incomplete",
  "too_long",
  "too_short",
  "formatting",
]);

// プロンプト変更での対応が難しいカテゴリ（モデル能力依存）
const LOW_FIXABILITY_CATEGORIES = new Set([
  "hallucination",
  "wrong_info",
]);

export function scoreFixability(category: string): number {
  if (HIGH_FIXABILITY_CATEGORIES.has(category)) return 1.0;
  if (LOW_FIXABILITY_CATEGORIES.has(category)) return 0.3;
  return 0.5; // その他
}

export function buildFrequencyMap(
  feedbacks: ClassifiedFeedback[],
  windowDays = 7
): FrequencyMap {
  const cutoff = new Date();
  cutoff.setDate(cutoff.getDate() - windowDays);
  const cutoffIso = cutoff.toISOString();

  const map: FrequencyMap = {};

  for (const fb of feedbacks) {
    if (fb.rating !== "negative") continue;
    if (fb.classifiedAt < cutoffIso) continue;

    // カテゴリ単体
    const catKey = fb.autoCategory;
    map[catKey] = (map[catKey] ?? 0) + 1;

    // カテゴリ × タグの組み合わせ
    for (const tag of fb.autoTags) {
      const tagKey = `${fb.autoCategory}::${tag}`;
      map[tagKey] = (map[tagKey] ?? 0) + 1;
    }
  }

  return map;
}

export function scoreFrequency(
  feedback: ClassifiedFeedback,
  frequencyMap: FrequencyMap,
  maxCount = 30
): number {
  const catCount = frequencyMap[feedback.autoCategory] ?? 0;
  const tagCounts = feedback.autoTags.map(
    (tag) => frequencyMap[`${feedback.autoCategory}::${tag}`] ?? 0
  );
  const maxTagCount = tagCounts.length > 0 ? Math.max(...tagCounts) : 0;
  const count = Math.max(catCount, maxTagCount);

  // maxCount を上限に正規化
  return Math.min(count / maxCount, 1.0);
}

export function prioritize(
  feedbacks: ClassifiedFeedback[]
): PrioritizedFeedback[] {
  const freqMap = buildFrequencyMap(feedbacks);
  const now = new Date().toISOString();

  return feedbacks
    .map((fb): PrioritizedFeedback => {
      const frequencyScore = scoreFrequency(fb, freqMap);
      const impactScore = URGENCY_IMPACT_SCORE[fb.urgencyLevel] ?? 0.4;
      const fixabilityScore = scoreFixability(fb.autoCategory);
      const priorityScore = frequencyScore * impactScore * fixabilityScore;

      return {
        ...fb,
        priorityScore,
        frequencyScore,
        impactScore,
        fixabilityScore,
        linkedCaseIds: [],
        status: "pending",
        scoredAt: now,
      };
    })
    .sort((a, b) => b.priorityScore - a.priorityScore);
}

// 使用例

import { prioritize } from "./feedback/prioritizer.js";
import type { ClassifiedFeedback } from "./types/feedback.js";

const feedbacks: ClassifiedFeedback[] = [
  {
    feedbackId: "fb-001",
    // ... 略 ...
    rating: "negative",
    autoCategory: "misunderstood",
    autoCategoryConfidence: 0.9,
    urgencyLevel: "high",
    autoTags: ["context_confusion"],
    classifiedAt: "2026-06-11T09:00:00Z",
    // ... 他フィールドは省略 ...
  } as ClassifiedFeedback,
  {
    feedbackId: "fb-002",
    // ... 略 ...
    rating: "negative",
    autoCategory: "too_long",
    autoCategoryConfidence: 0.8,
    urgencyLevel: "low",
    autoTags: [],
    classifiedAt: "2026-06-11T09:30:00Z",
  } as ClassifiedFeedback,
];

const prioritized = prioritize(feedbacks);

prioritized.forEach((fb) => {
  console.log(
    `[${fb.feedbackId}] score=${fb.priorityScore.toFixed(3)} category=${fb.autoCategory}`
  );
});

// 出力例（実際のスコアはデータ量・分布に依存します）：
// [fb-001] score=0.250 category=misunderstood
// [fb-002] score=0.013 category=too_long

グルーピングで「同じ問題のまとまり」を見つける

優先度スコアで並べ替えるだけでなく、同種の問題をグルーピングすると「何件のユーザーが同じ問題を抱えているか」が一目でわかります。

// feedback/grouper.ts

import type { PrioritizedFeedback } from "../types/feedback.js";

export interface FeedbackGroup {
  groupKey: string;          // "misunderstood::context_confusion" など
  category: string;
  tags: string[];
  count: number;
  maxPriorityScore: number;
  samples: PrioritizedFeedback[];  // 代表的なサンプル（最大5件）
  promptVersions: string[];        // このグループに関わるプロンプトバージョン
}

export function groupFeedbacks(
  feedbacks: PrioritizedFeedback[]
): FeedbackGroup[] {
  const groups = new Map<string, PrioritizedFeedback[]>();

  for (const fb of feedbacks) {
    if (fb.rating !== "negative") continue;

    const key =
      fb.autoTags.length > 0
        ? `${fb.autoCategory}::${fb.autoTags.sort().join(":")}`
        : fb.autoCategory;

    const existing = groups.get(key) ?? [];
    groups.set(key, [...existing, fb]);
  }

  return Array.from(groups.entries())
    .map(([groupKey, items]): FeedbackGroup => {
      const sorted = [...items].sort(
        (a, b) => b.priorityScore - a.priorityScore
      );
      return {
        groupKey,
        category: items[0].autoCategory,
        tags: items[0].autoTags,
        count: items.length,
        maxPriorityScore: sorted[0].priorityScore,
        samples: sorted.slice(0, 5),
        promptVersions: [...new Set(items.map((i) => i.promptVersion))],
      };
    })
    .sort((a, b) => b.maxPriorityScore - a.maxPriorityScore);
}

08プロンプト改善への反映

フィードバックをプロンプト改善の「根拠」として使う

フィードバックデータは、プロンプトを「なんとなく改善」から「根拠を持って改善」に変えるための材料です。

グルーピングで見えてきたトップ課題に対して、次の手順で改善します。

1. グループのサンプル（userInput / agentOutput のペア）を5〜10件確認する
2. 「何がどう悪いか」を言語化する
3. プロンプトの「どの指示が欠けているか / 誤解を招いているか」を特定する
4. 改善仮説を立て、プロンプトを修正する
5. 修正後のプロンプトを評価データセットで検証する

評価データセットへの反映

フィードバックから収集した「悪い応答のサンプル」は、そのまま評価データセットの否定例として使えます。

// feedback/to-eval-case.ts

import type { PrioritizedFeedback } from "../types/feedback.js";

export interface EvalCase {
  id: string;
  input: string;
  expectedBehavior: string;   // 期待する応答の方向性（完全な正解でなくてよい）
  actualOutput: string;       // 問題が起きた出力
  sourceType: "feedback" | "manual" | "synthetic";
  sourceFeedbackId?: string;
  tags: string[];
  createdAt: string;
}

export function feedbackToEvalCase(
  feedback: PrioritizedFeedback,
  expectedBehavior: string
): EvalCase {
  return {
    id: `eval-from-fb-${feedback.feedbackId}`,
    input: feedback.userInput,
    expectedBehavior,
    actualOutput: feedback.agentOutput,
    sourceType: "feedback",
    sourceFeedbackId: feedback.feedbackId,
    tags: [feedback.autoCategory, ...feedback.autoTags],
    createdAt: new Date().toISOString(),
  };
}

expectedBehavior は完全な正解文を書く必要はありません。「競合他社の分析を求められた場合は、自社ではなく指定された企業について回答する」のような方向性の記述で十分です。この方向性がルーブリック評価のヒントにもなります。

（関連記事: 評価データセットの作り方：ゴールデンケースの集め方）

プロンプト改善前後の比較テスト

評価データセットにフィードバック由来のケースを追加したら、改善前後のプロンプトで比較します。

// feedback/regression-check.ts

import Anthropic from "@anthropic-ai/sdk";
import type { EvalCase } from "./to-eval-case.js";

const client = new Anthropic();

interface RegressionResult {
  caseId: string;
  promptVersionBefore: string;
  promptVersionAfter: string;
  outputBefore: string;
  outputAfter: string;
  isImproved: boolean | null;  // null = 判定保留（人手確認が必要）
}

export async function runRegressionCheck(
  cases: EvalCase[],
  systemPromptBefore: string,
  systemPromptAfter: string,
  promptVersionBefore: string,
  promptVersionAfter: string
): Promise<RegressionResult[]> {
  const results: RegressionResult[] = [];

  for (const evalCase of cases) {
    const [outputBefore, outputAfter] = await Promise.all([
      callAgent(evalCase.input, systemPromptBefore),
      callAgent(evalCase.input, systemPromptAfter),
    ]);

    results.push({
      caseId: evalCase.id,
      promptVersionBefore,
      promptVersionAfter,
      outputBefore,
      outputAfter,
      // 自動判定は難しいため、初期値は null としてレビュー待ちにする
      isImproved: null,
    });
  }

  return results;
}

async function callAgent(
  userInput: string,
  systemPrompt: string
): Promise<string> {
  const response = await client.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 1024,
    system: systemPrompt,
    messages: [{ role: "user", content: userInput }],
  });

  const content = response.content[0];
  if (content.type !== "text") return "";

  return content.text;
}

回帰チェックの出力例（ログ形式）：

[regression] case eval-from-fb-001
  before (v2.0.0): 御社の強みとして...
  after  (v2.1.0): A社の競争優位性として、コスト構造と...
  isImproved: null → 人手確認待ち

[regression] case eval-from-fb-007
  before (v2.0.0): はい、その方法で問題ありません（根拠なし）
  after  (v2.1.0): 公式ドキュメントには明記されていませんが、...
  isImproved: null → 人手確認待ち

比較結果の isImproved は自動判定が難しいため、この段階では人手で確認します。改善の判定基準が明確に言語化できる場合は、LLM ジャッジで自動化することも検討できます。

（関連記事: LLM-as-a-Judge：LLMを評価者として使う）

09効果検証のサイクル

何をもって「改善された」とするか

プロンプトを更新した後、改善効果をどう測るかを事前に決めておかないと、後から判断に迷います。

筆者たちが使っている判断軸を整理します。

軸	測り方
フィードバック率（主指標）	否定フィードバック数 / 全応答数。プロンプトバージョン別に比較する
カテゴリ分布の変化	改善対象カテゴリの件数が減ったか。別のカテゴリが増えていないか（副作用）
評価スコアの変化	回帰テストのルーブリックスコアの推移。フィードバック由来のケースを重点確認
セッション継続率（間接指標）	否定評価後にユーザーが会話を続けたか。改善後はセッション継続率が上がるはず

プロンプトバージョン別のフィードバック集計

// feedback/analytics.ts

import type { FeedbackRecord } from "../types/feedback.js";

export interface VersionFeedbackSummary {
  promptVersion: string;
  totalFeedbacks: number;
  negativeCount: number;
  positiveCount: number;
  negativeRate: number;
  categoryCounts: Record<string, number>;
}

export function summarizeByPromptVersion(
  feedbacks: FeedbackRecord[]
): VersionFeedbackSummary[] {
  const grouped = new Map<string, FeedbackRecord[]>();

  for (const fb of feedbacks) {
    const existing = grouped.get(fb.promptVersion) ?? [];
    grouped.set(fb.promptVersion, [...existing, fb]);
  }

  return Array.from(grouped.entries())
    .map(([version, records]): VersionFeedbackSummary => {
      const negatives = records.filter((r) => r.rating === "negative");
      const positives = records.filter((r) => r.rating === "positive");

      const categoryCounts: Record<string, number> = {};
      for (const fb of negatives) {
        if (fb.userCategory) {
          categoryCounts[fb.userCategory] =
            (categoryCounts[fb.userCategory] ?? 0) + 1;
        }
      }

      return {
        promptVersion: version,
        totalFeedbacks: records.length,
        negativeCount: negatives.length,
        positiveCount: positives.length,
        negativeRate:
          records.length > 0 ? negatives.length / records.length : 0,
        categoryCounts,
      };
    })
    .sort((a, b) => a.promptVersion.localeCompare(b.promptVersion));
}

// 使用例

import { summarizeByPromptVersion } from "./feedback/analytics.js";
import type { FeedbackRecord } from "./types/feedback.js";

// 実際のデータを渡す
const feedbacks: FeedbackRecord[] = [
  // ... 省略 ...
];

const summaries = summarizeByPromptVersion(feedbacks);

summaries.forEach((s) => {
  console.log(
    `v${s.promptVersion}: 否定率=${(s.negativeRate * 100).toFixed(1)}% (${s.negativeCount}/${s.totalFeedbacks})`
  );
});

// 出力例（数値は実際のデータによります）：
// v2.0.0: 否定率=18.2% (40/220)
// v2.1.0: 否定率=11.4% (25/219)
// v2.2.0: 否定率=9.8%  (22/224)

改善サイクルの全体像

ここまでの内容を改善サイクルとしてまとめます。

flowchart TD
    A["収集\n・評価ボタン\n・自由記述\n・文脈保持"]
    B["分類・優先度付け\n・LLM自動分類\n・3軸スコア\n・グルーピング"]
    C["プロンプト改善仮説\n・グループ分析\n・根拠言語化"]
    D["リリース\n・カナリア\n・回帰テスト\n・スモーク"]
    E["効果検証\n・否定率変化\n・カテゴリ分布変化"]

    A --> B --> C --> D --> E --> A

このサイクルを回す頻度はプロダクトの規模とチームの体制によります。フィードバック数が少ない段階（週に 20〜30 件以下）であれば、2 週間に 1 回サイクルを回す程度から始めると無理がありません。フィードバック量が増えてきたら、分類・集計の自動化を進めながらサイクルを短縮していく流れが自然です。

10運用上の注意点

「改善した」のか「データが変わった」のかを区別する

否定フィードバック率が下がったとき、それがプロンプト改善の効果なのか、ユーザー層・利用シナリオの変化によるものなのかを区別する必要があります。

カナリアリリースで新旧プロンプトのデータを同時期に収集することで、この混在を避けられます。同じ時期・同じユーザー層で比較できるため、プロンプト変更の純粋な効果が測りやすくなります。

（関連記事: プロンプトのバージョン管理とリリースフロー）

フィードバックが少ない段階での扱い方

プロダクトの初期段階では、フィードバック数が統計的に意味を持つほど集まらないことがあります。否定評価が「たまたま 3 件あった」だけで「この問題は深刻」と判断すると、誤った方向に改善が進むリスクがあります。

件数が少ない段階では、スコアによる機械的な優先度付けよりも「内容を直接読む」方が確かです。自動分類・スコアリングの仕組みは、週に 50 件以上のフィードバックが集まり始めた頃から本格的に活用するのが実際的です。

分類精度の継続的な検証

LLM による自動分類は完全ではありません。定期的に分類結果の一部（たとえば 2 週間に 30〜50 件）を人手で確認し、分類精度を把握しておくことをお勧めします。

精度が低いカテゴリが見つかった場合は、分類プロンプトのカテゴリ定義を見直します。「hallucination と wrong_info の区別が難しい」のような問題は、定義を具体的な例とともに記述することで改善できます。

// 分類精度の簡易チェック

interface AccuracyCheckResult {
  total: number;
  matches: number;
  accuracy: number;
  confusionPairs: Array<{ auto: string; human: string; count: number }>;
}

function checkClassificationAccuracy(
  samples: Array<{ autoCategory: string; humanCategory: string }>
): AccuracyCheckResult {
  const matches = samples.filter(
    (s) => s.autoCategory === s.humanCategory
  ).length;

  const confusionMap = new Map<string, number>();

  for (const { autoCategory, humanCategory } of samples) {
    if (autoCategory !== humanCategory) {
      const key = `${autoCategory}→${humanCategory}`;
      confusionMap.set(key, (confusionMap.get(key) ?? 0) + 1);
    }
  }

  const confusionPairs = Array.from(confusionMap.entries())
    .map(([key, count]) => {
      const [auto, human] = key.split("→");
      return { auto, human, count };
    })
    .sort((a, b) => b.count - a.count);

  return {
    total: samples.length,
    matches,
    accuracy: samples.length > 0 ? matches / samples.length : 0,
    confusionPairs,
  };
}

精度チェックの出力例：

total: 50
matches: 43
accuracy: 86.0%

混同が多いペア：
  hallucination → wrong_info: 4件
  misunderstood → incomplete: 2件
  other → too_long: 1件

プライバシーへの配慮

フィードバックには userInput と agentOutput を含めています。ユーザーの入力には個人情報・機密情報が含まれる可能性があるため、収集する前にプライバシーポリシーへの記載と同意取得が必要です。

また、フィードバックデータを LLM に送って分類する際も、送信先のデータ取り扱い規約を確認してください。機密性が高いデータを含む場合は、分類前に個人情報のマスキング処理を挟む設計も検討してください。

11BizPlan（事業計画エージェント）での活用の考え方

私たちが開発する BizPlan では、事業計画という性質上「ユーザーの意図を正確に理解できなかった」フィードバックが一定数発生します。

事業計画のコンテキストは複雑です。「競合分析をしてほしい」という依頼ひとつにも、「自社のポジショニング把握」「特定の競合の深掘り」「市場全体の見取り図作成」など複数の意図が含まれている場合があります。

こうした「意図の複雑さ」に起因するフィードバックを misunderstood として分類し、実際のやり取りを確認すると、プロンプトに「意図を明確化するための確認質問」を入れる改善仮説が生まれます。

まだ実験段階であり、確定した効果として紹介できる段階ではありませんが、「フィードバックがプロンプト改善の具体的な仮説につながる」プロセスのひとつとして共有します。実践知見がまとまり次第、別記事で詳しく報告する予定です。

12まとめ

ユーザーフィードバックを改善ループに組み込む上で、この記事で紹介した設計を整理します。

収集 UI: 評価ボタン + 否定評価時の自由記述 2 層設計が、回収率と情報量のバランスを取りやすいです。
スキーマ: feedbackId / sessionId / agentId / promptVersion / rating / userInput / agentOutput を軸に、文脈を自己完結させておくと後の分析が楽になります。
自動分類: LLM を使ったカテゴリ・緊急度の抽出で、手動分類のボトルネックを解消できます。分類精度は定期的に確認し、プロンプトのカテゴリ定義を磨き続けることが大切です。
優先度付け: 再現頻度 × 影響度 × 修正可能性の 3 軸スコアで機械的に並べ替えると、「どれを先に直すか」の議論コストを下げられます。
プロンプト改善: フィードバック由来の否定例を評価データセットに追加し、改善前後で比較することで、改善の根拠と効果を定量的に示せるようになります。
効果検証: プロンプトバージョン別の否定フィードバック率とカテゴリ分布の変化が主指標です。カナリアリリースと組み合わせると、純粋な改善効果を測りやすくなります。

すべての仕組みを一度に整備する必要はありません。まず「評価ボタンを設置してデータを集める」ことから始め、フィードバック量が増えるに従って分類・優先度付け・自動化を段階的に追加していくのが現実的な進め方です。

13参考文献

Anthropic SDK for TypeScript - npm（2026年6月時点）
Next.js App Router Documentation（2026年6月時点）
Building LLM-Powered Applications - Chip Huyen, "AI Engineering", O'Reilly, 2025（評価・フィードバックループ設計の参考として）
Evals - OpenAI Platform Documentation（2026年6月時点）
Human-in-the-loop evaluation patterns - RAGAS Documentation（2026年6月時点）

# 評価と改善

管

Author

管理者０２