Harden DeepSeek agent: LiteLLM adapter, DSML/reasoning/embeddings/error fixes

- LiteLLMAdapter (subclasses OpenAIAdapter via _acreate hook): routes DeepSeek through LiteLLM. Opt-in AGENTIC_DEFAULT_MODEL_PROVIDER=litellm. A/B beat the hand-rolled adapter (0 DSML, 0 parse-fails). Defensive chunk.usage getattr, token-estimate usage fallback for billing, quiet litellm logs. - DSML parser: tolerate single/multi fullwidth pipes, honor string="true/false" typed args (openai_adapter fallback when DeepSeek leaks tool calls as text). - Thinking mode: capture and round-trip reasoning_content across turns. - Embeddings: dedicated AGENTIC_EMBEDDINGS_API_KEY (DeepSeek has no embeddings); disable cleanly when unset to avoid per-turn 401. - claude_format: friendly generic error messages to the chat, raw only in logs. - acai agent max_tokens 4096->16384 (whole-file writes no longer truncate); system.md size-based edit policy; strict tools opt-in (off). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:49:48 +00:00
parent e34a39e3bf
commit 6a03fdf284
12 changed files with 396 additions and 58 deletions
--- a/agents/acai/agent.yaml
+++ b/agents/acai/agent.yaml
@@ -4,7 +4,11 @@ description: "Agente genérico de Acai CMS: crea módulos, edita contenido, gest
 icon: "code"
 category: "development"
 temperature: 0.2
-max_tokens: 4096
+# 16K de salida: cubre escribir un fichero entero (acai_write) + el razonamiento
+# (thinking) en un solo turno. Con 4096 el JSON del tool_use se truncaba a mitad
+# en ficheros medianos y el agente caia en micro-ediciones lentas. v4-pro soporta
+# hasta 384K de salida, asi que 16K es conservador.
+max_tokens: 16384
 context_sections:
  - immutable_rules
  - project_profile
--- a/agents/acai/system.md
+++ b/agents/acai/system.md
@@ -74,6 +74,26 @@ cms/data/schema/        # .ini.php — SOLO con tools de schema
 14. **URL del proyecto**: `get_web_url` + `?pruebas=1` siempre.
 15. **Operaciones destructivas**: confirma con el usuario antes de ejecutar.

+# Eficiencia de edición (menos pasos Y menos tokens)
+
+Elige la herramienta por el TAMAÑO del cambio. Ni micro-editar todo (muchos
+pasos), ni reescribir el fichero entero por cada retoque (muchos tokens):
+
+1. **Cambio pequeño o localizado** (un color, un valor, una regla, pocas zonas)
+   → `acai-line-replace`. Barato: solo emites las líneas que cambian. NO
+   reescribas el fichero entero por un retoque.
+2. **Creación o reescritura mayor** (cambias casi todo el fichero o lo creas de
+   cero) → UN solo `acai-write` del fichero completo. Reescribir entero por un
+   cambio pequeño desperdicia tokens; hazlo solo cuando de verdad cambia casi todo.
+3. **Itera con `line-replace`, no con writes repetidos.** Tras ver el resultado
+   en el navegador, aplica los ajustes con `line-replace` puntuales. NO reescribas
+   el fichero completo en cada iteración de diseño.
+4. **Cap de micro-ediciones.** Si te ves haciendo >4-5 `line-replace` sobre el
+   mismo fichero en un turno, para y reescríbelo entero de una vez (`acai-write`).
+5. **NO hagas `acai-view` tras cada edición.** Ya tienes el contenido en contexto;
+   reléelo solo si una edición falló o dudas del estado real.
+6. **Verificación visual al final, una sola pasada** — no tras cada retoque.
+
 # Patrones canónicos (aplica por defecto)

 - **Detalle de registro**: sección `custom-{tableName}` con `thisrecord.*`.
--- a/requirements.txt
+++ b/requirements.txt
@@ -5,6 +5,7 @@ pydantic-settings>=2.7.0,<3.0.0
 redis[hiredis]>=5.2.0,<6.0.0
 anthropic>=0.42.0,<1.0.0
 openai>=1.60.0,<2.0.0
+litellm>=1.50.0,<2.0.0
 httpx>=0.28.0,<1.0.0
 sse-starlette>=2.2.0,<3.0.0
 tiktoken>=0.7.0,<1.0.0
--- a/src/adapters/claude_adapter.py
+++ b/src/adapters/claude_adapter.py
@@ -32,7 +32,9 @@ logger = logging.getLogger(__name__)
 # sintetico mientras streameamos. Es un parche defensivo: el caso normal
 # (tool_use blocks) sigue por el camino estandar.
 _TOOL_CALL_OPEN_RE = re.compile(
-    r"<(?:minimax:tool_call|invoke\s+name|tool_call[\s>]|use_mcp_tool|mm_special)|\[TOOL_CALL\]|<｜｜DSML｜｜",
+    # `<｜` (U+FF5C) cubre cualquier special-token DeepSeek (DSML): <｜DSML｜invoke,
+    # <｜tool_calls, etc. Tolerante a 1+ pipes y a la presencia/ausencia de "DSML".
+    r"<(?:minimax:tool_call|invoke\s+name|tool_call[\s>]|use_mcp_tool|mm_special)|\[TOOL_CALL\]|<｜",
    re.IGNORECASE,
 )
 _INVOKE_RE = re.compile(
@@ -67,16 +69,21 @@ _PERL_ARGS_BLOCK_RE = re.compile(
 _PERL_KV_RE = re.compile(
    r"--([a-zA-Z_][a-zA-Z0-9_]*)\s+(\"[^\"]*\"|\'[^\']*\'|-?\d+(?:\.\d+)?|true|false|null)",
 )
-# Formato 5 (DeepSeek DSML): <｜｜DSML｜｜invoke name="X"><｜｜DSML｜｜parameter name="P" ...>V</｜｜DSML｜｜parameter></｜｜DSML｜｜invoke>
-# U+FF5C = ｜ (fullwidth vertical line)
+# Formato 5 (DeepSeek DSML). Formato oficial V4-Pro: el marcador es `｜DSML｜`
+# con UN pipe fullwidth (U+FF5C) a cada lado — <｜DSML｜invoke name="X"> ...
+# <｜DSML｜parameter name="P" string="true|false">V</｜DSML｜parameter> ...
+# </｜DSML｜invoke>. Hacemos el regex TOLERANTE: 1+ pipes y "DSML" opcional,
+# para cubrir variantes entre versiones del modelo. El atributo `string`
+# decide el tipo del valor: "true" = string crudo, "false" = valor JSON.
 _DSML_INVOKE_RE = re.compile(
-    r"<｜｜DSML｜｜invoke\s+name=\"([^\"]+)\"[^>]*>(.*?)</｜｜DSML｜｜invoke\s*>",
+    r"<｜+(?:DSML｜+)?invoke\s+name=\"([^\"]+)\"[^>]*>(.*?)</｜+(?:DSML｜+)?invoke\s*>",
    re.IGNORECASE | re.DOTALL,
 )
 _DSML_PARAM_RE = re.compile(
-    r"<｜｜DSML｜｜parameter\s+name=\"([^\"]+)\"[^>]*>(.*?)</｜｜DSML｜｜parameter\s*>",
+    r"<｜+(?:DSML｜+)?parameter\s+name=\"([^\"]+)\"([^>]*)>(.*?)</｜+(?:DSML｜+)?parameter\s*>",
    re.IGNORECASE | re.DOTALL,
 )
+_DSML_STRING_ATTR_RE = re.compile(r"string\s*=\s*\"(true|false)\"", re.IGNORECASE)


 def _safe_emit_split(buf: str) -> str:
@@ -104,7 +111,7 @@ def _safe_emit_split(buf: str) -> str:
    if ">" in tail:
        return buf
    # Si el tail puede ser inicio de tool_call/invoke/tool_call_json/dsml, retenerlo.
-    candidates = ("<minimax:tool_call", "<invoke", "<tool_call", "<｜｜dsml｜｜")
+    candidates = ("<minimax:tool_call", "<invoke", "<tool_call", "<｜")
    for cand in candidates:
        if cand.startswith(tail.lower()) or tail.lower().startswith(cand[:len(tail)].lower()):
            return buf[:idx]
@@ -225,13 +232,27 @@ def _parse_xml_tool_calls(text: str) -> list[dict[str, Any]]:
                })

    # Formato 5 (DeepSeek DSML):
-    # <｜｜DSML｜｜invoke name="X"><｜｜DSML｜｜parameter name="P" string="true">V</｜｜DSML｜｜parameter></｜｜DSML｜｜invoke>
+    # <｜DSML｜invoke name="X"><｜DSML｜parameter name="P" string="true">V</｜DSML｜parameter></｜DSML｜invoke>
    for m in _DSML_INVOKE_RE.finditer(text):
        name = m.group(1).strip()
        body = m.group(2)
        args_dsml: dict[str, Any] = {}
        for p in _DSML_PARAM_RE.finditer(body):
-            args_dsml[p.group(1).strip()] = p.group(2).strip()
+            pname = p.group(1).strip()
+            attrs = p.group(2) or ""
+            raw_val = p.group(3)
+            sm = _DSML_STRING_ATTR_RE.search(attrs)
+            if sm and sm.group(1).lower() == "true":
+                # string="true": valor es string crudo — NO strip (preserva
+                # whitespace significativo, p.ej. contenido de ficheros).
+                args_dsml[pname] = raw_val
+            else:
+                # string="false" (o ausente): valor JSON (num/bool/array/obj/string).
+                # Si no parsea, cae a string sin tocar.
+                try:
+                    args_dsml[pname] = json.loads(raw_val.strip())
+                except (json.JSONDecodeError, ValueError):
+                    args_dsml[pname] = raw_val.strip()
        if name:
            calls.append({
                "id": "xml_{}".format(uuid.uuid4().hex[:12]),
--- a/src/adapters/litellm_adapter.py
+++ b/src/adapters/litellm_adapter.py
@@ -0,0 +1,67 @@
+"""LiteLLM model adapter — spike para A/B contra el adapter OpenAI/DeepSeek nativo.
+
+Reutiliza TODO el flujo de OpenAIAdapter (procesado de chunks, conversión de
+mensajes, tools, fallback DSML) y solo cambia la llamada al modelo: en vez del
+SDK de OpenAI, enruta por LiteLLM, que trae handling específico por proveedor
+(DeepSeek incluido) y podría resolver de fábrica el DSML / reasoning_content que
+hoy parcheamos a mano.
+
+Activar con `AGENTIC_DEFAULT_MODEL_PROVIDER=litellm`. Modelo via
+`AGENTIC_LITELLM_MODEL` (p.ej. "deepseek/deepseek-v4-pro"); si vacío, deriva de
+`AGENTIC_DEFAULT_MODEL_ID`. Reusa `openai_api_key` / `openai_base_url` como
+credenciales.
+"""
+
+from __future__ import annotations
+
+import logging
+from typing import Any
+
+import litellm
+
+from ..config import settings
+from .openai_adapter import OpenAIAdapter
+
+logger = logging.getLogger(__name__)
+
+# Que LiteLLM descarte params no soportados por el proveedor en vez de petar.
+litellm.drop_params = True
+# Silenciar el spam INFO de litellm ("LiteLLM completion() model=...").
+litellm.suppress_debug_info = True
+logging.getLogger("LiteLLM").setLevel(logging.WARNING)
+
+
+class LiteLLMAdapter(OpenAIAdapter):
+    """Enruta las llamadas por LiteLLM, reutilizando el pipeline de OpenAIAdapter."""
+
+    def __init__(
+        self,
+        model: str | None = None,
+        api_key: str | None = None,
+        base_url: str | None = None,
+    ) -> None:
+        # NO llamamos a super().__init__: no necesitamos el cliente AsyncOpenAI.
+        self._litellm_model = model or settings.litellm_model or self._derive_model()
+        self._api_key = api_key or settings.openai_api_key or None
+        self._api_base = base_url or settings.openai_base_url or None
+        # LiteLLM no entrega usage fiable en streaming → estimar para billing.
+        self._estimate_usage_fallback = True
+        logger.info(
+            "LiteLLMAdapter: model=%s api_base=%s",
+            self._litellm_model, self._api_base or "(default)",
+        )
+
+    @staticmethod
+    def _derive_model() -> str:
+        mid = settings.default_model_id or "deepseek-chat"
+        # Si ya trae prefijo de proveedor ("deepseek/...", "openai/..."), respetar.
+        return mid if "/" in mid else f"deepseek/{mid}"
+
+    async def _acreate(self, kwargs: dict[str, Any]):
+        kwargs = dict(kwargs)
+        kwargs["model"] = self._litellm_model
+        if self._api_key:
+            kwargs["api_key"] = self._api_key
+        if self._api_base:
+            kwargs["api_base"] = self._api_base
+        return await litellm.acompletion(**kwargs)
--- a/src/adapters/openai_adapter.py
+++ b/src/adapters/openai_adapter.py
@@ -14,6 +14,24 @@ from .base import ModelAdapter, ModelConfig, ModelResponse, StreamChunk
 logger = logging.getLogger(__name__)


+def _estimate_usage(messages: list[dict[str, Any]], output_text: str) -> dict[str, int]:
+    """Estimacion de tokens cuando el proveedor no entrega usage (p.ej. LiteLLM
+    streaming). Aproximada pero evita billing 0."""
+    from ..context.compactor import estimate_tokens
+    inp = 0
+    for m in messages:
+        c = m.get("content")
+        if isinstance(c, str):
+            inp += estimate_tokens(c)
+        elif isinstance(c, list):
+            for b in c:
+                if isinstance(b, dict):
+                    inp += estimate_tokens(
+                        b.get("text") or b.get("thinking") or str(b.get("content") or "")
+                    )
+    return {"input_tokens": inp, "output_tokens": estimate_tokens(output_text or "")}
+
+
 class OpenAIAdapter(ModelAdapter):
    """Adapter for the OpenAI API (GPT-4o, o1, etc.)."""

@@ -25,6 +43,15 @@ class OpenAIAdapter(ModelAdapter):
        if url:
            kwargs["base_url"] = url
        self._client = AsyncOpenAI(**kwargs)
+        # El path nativo conserva el usage real del proveedor; subclases que no
+        # reciben usage fiable en streaming (LiteLLM) lo ponen a True para estimar.
+        self._estimate_usage_fallback = False
+
+    async def _acreate(self, kwargs: dict[str, Any]):
+        """Hook de la llamada al modelo. Subclases (p.ej. LiteLLMAdapter) lo
+        sobreescriben para enrutar por otra librería sin tocar el resto del
+        flujo (procesado de chunks, tools, mensajes)."""
+        return await self._client.chat.completions.create(**kwargs)

    # ------------------------------------------------------------------
    # Streaming
@@ -53,7 +80,7 @@ class OpenAIAdapter(ModelAdapter):
        if tools:
            kwargs["tools"] = self._format_tools(tools)

-        stream = await self._client.chat.completions.create(**kwargs)
+        stream = await self._acreate(kwargs)

        # Fallback de tool-calls-en-texto: DeepSeek a veces emite las tool calls
        # en su formato interno DSML como TEXTO (en el content) en vez de como
@@ -65,28 +92,53 @@ class OpenAIAdapter(ModelAdapter):
        tool_calls_acc: dict[int, dict[str, str]] = {}

        final_usage: dict[str, int] = {}
+        usage_emitted = False   # evita doble conteo si llega usage tras estimar
        full_content = ""       # content acumulado (para el fallback DSML)
+        full_reasoning = ""     # razonamiento acumulado (para estimar usage)
        emitted_chars = 0       # cuanto de full_content ya se emitio como delta
        suppress_text = False   # tras detectar un tool-call-en-texto, no emitir mas

+        # DeepSeek thinking mode: el razonamiento llega en `delta.reasoning_content`
+        # (antes del content). Lo acumulamos como un bloque `thinking` (block_index 0)
+        # para que el orquestador lo persista y `_to_openai_messages` lo reenvie como
+        # `reasoning_content` en el siguiente turno — DeepSeek lo exige en multi-turno
+        # con tool calls ("reasoning_content ... must be passed back to the API").
+        reasoning_seen = False
+        reasoning_sig_emitted = False
+
        async for chunk in stream:
-            # With include_usage, the last chunk has usage but no choices
-            if chunk.usage:
+            # With include_usage, the last chunk has usage but no choices.
+            # getattr: el chunk de LiteLLM (ModelResponseStream) no siempre trae
+            # el atributo `usage`; el del SDK OpenAI sí (None salvo el ultimo).
+            chunk_usage = getattr(chunk, "usage", None)
+            if chunk_usage:
                final_usage = {
-                    "input_tokens": chunk.usage.prompt_tokens or 0,
-                    "output_tokens": chunk.usage.completion_tokens or 0,
+                    "input_tokens": getattr(chunk_usage, "prompt_tokens", 0) or 0,
+                    "output_tokens": getattr(chunk_usage, "completion_tokens", 0) or 0,
                }

            choice = chunk.choices[0] if chunk.choices else None
            if not choice:
                # Usage-only chunk (last one with include_usage) — emit it
-                if final_usage:
+                if final_usage and not usage_emitted:
                    yield StreamChunk(usage=final_usage)
-                    final_usage = {}  # Only emit once
+                    usage_emitted = True
                continue

            delta = choice.delta

+            # Reasoning content (DeepSeek thinking mode). Llega como campo extra
+            # del delta; lo emitimos como thinking_delta en el bloque index 0.
+            reasoning_txt = getattr(delta, "reasoning_content", None) if delta else None
+            if reasoning_txt:
+                reasoning_seen = True
+                full_reasoning += reasoning_txt
+                yield StreamChunk(
+                    thinking_delta=reasoning_txt,
+                    block_type="thinking",
+                    block_index=0,
+                )
+
            # Text content
            if delta and delta.content:
                full_content += delta.content
@@ -131,6 +183,24 @@ class OpenAIAdapter(ModelAdapter):

            # Finish
            if choice.finish_reason:
+                # Cerrar el bloque de razonamiento (si lo hubo) con un signature
+                # sintetico: el orquestador descarta thinking blocks sin signature
+                # (proteccion para MiniMax/Anthropic). DeepSeek no usa signatures;
+                # este marcador solo evita el descarte y NUNCA se reenvia — en
+                # `_to_openai_messages` el bloque se mapea a `reasoning_content`.
+                if reasoning_seen and not reasoning_sig_emitted:
+                    reasoning_sig_emitted = True
+                    yield StreamChunk(
+                        thinking_signature="deepseek-reasoning",
+                        block_type="thinking",
+                        block_index=0,
+                    )
+                # Fallback de usage: algunos proveedores via LiteLLM no entregan el
+                # chunk de usage (o llega tras el break del orquestador) → billing 0.
+                # Estimamos por tokens para no infra-cobrar. Solo si el adapter lo
+                # pide (LiteLLM); el path nativo conserva el usage real del proveedor.
+                if self._estimate_usage_fallback and not final_usage and not usage_emitted:
+                    final_usage = _estimate_usage(messages, full_content + "\n" + full_reasoning)
                # IMPORTANTE: DeepSeek (endpoint OpenAI) a veces cierra el stream
                # con finish_reason="stop" AUNQUE haya emitido tool_calls. Si nos
                # fiamos solo de =="tool_calls" perdemos esos tool calls: el agente
@@ -146,8 +216,9 @@ class OpenAIAdapter(ModelAdapter):
                            finish_reason="tool_use",
                        )
                    # Emit usage after tool_use chunks
-                    if final_usage:
+                    if final_usage and not usage_emitted:
                        yield StreamChunk(usage=final_usage)
+                        usage_emitted = True
                else:
                    # Fallback: DeepSeek pudo emitir las tool calls como TEXTO
                    # (DSML/XML) en vez de nativas. Parseamos el content y, si hay
@@ -161,15 +232,17 @@ class OpenAIAdapter(ModelAdapter):
                                tool_arguments=json.dumps(c.get("arguments", {}), ensure_ascii=False),
                                finish_reason="tool_use",
                            )
-                        if final_usage:
+                        if final_usage and not usage_emitted:
                            yield StreamChunk(usage=final_usage)
+                            usage_emitted = True
                    else:
                        yield StreamChunk(
                            finish_reason="end_turn"
                            if choice.finish_reason in ("stop", "tool_calls")
                            else choice.finish_reason,
-                            usage=final_usage,
+                            usage=final_usage if not usage_emitted else {},
                        )
+                        usage_emitted = True

    # ------------------------------------------------------------------
    # Non-streaming
@@ -204,7 +277,7 @@ class OpenAIAdapter(ModelAdapter):
                "function": {"name": force_tool},
            }

-        response = await self._client.chat.completions.create(**kwargs)
+        response = await self._acreate(kwargs)
        choice = response.choices[0]

        content = choice.message.content or ""
@@ -247,23 +320,41 @@ class OpenAIAdapter(ModelAdapter):

    @staticmethod
    def _format_tools(tools: list[dict[str, Any]]) -> list[dict[str, Any]]:
-        """Convert internal tool definitions to OpenAI function calling format."""
+        """Convert internal tool definitions to OpenAI function calling format.
+
+        Si `deepseek_strict_tools`, marca cada funcion con `strict: true` y limpia
+        del schema los keywords que DeepSeek strict NO soporta (minLength/maxLength/
+        minItems/maxItems), que de otro modo darian 400."""
+        strict = settings.deepseek_strict_tools
        formatted: list[dict[str, Any]] = []
        for tool in tools:
-            formatted.append(
-                {
-                    "type": "function",
-                    "function": {
-                        "name": tool["name"],
-                        "description": tool.get("description", ""),
-                        "parameters": tool.get(
-                            "input_schema", tool.get("parameters", {"type": "object"})
-                        ),
-                    },
-                }
-            )
+            params = tool.get("input_schema", tool.get("parameters", {"type": "object"}))
+            fn: dict[str, Any] = {
+                "name": tool["name"],
+                "description": tool.get("description", ""),
+                "parameters": OpenAIAdapter._sanitize_strict_schema(params) if strict else params,
+            }
+            if strict:
+                fn["strict"] = True
+            formatted.append({"type": "function", "function": fn})
        return formatted

+    # Keywords no soportados por DeepSeek strict mode (segun docs oficiales).
+    _STRICT_UNSUPPORTED_KEYS = ("minLength", "maxLength", "minItems", "maxItems")
+
+    @staticmethod
+    def _sanitize_strict_schema(schema: Any) -> Any:
+        """Elimina recursivamente keywords no soportados por DeepSeek strict."""
+        if isinstance(schema, dict):
+            return {
+                k: OpenAIAdapter._sanitize_strict_schema(v)
+                for k, v in schema.items()
+                if k not in OpenAIAdapter._STRICT_UNSUPPORTED_KEYS
+            }
+        if isinstance(schema, list):
+            return [OpenAIAdapter._sanitize_strict_schema(x) for x in schema]
+        return schema
+
    @staticmethod
    def _blocks_text(content: Any) -> str:
        """Extrae texto plano de un content que puede ser str o lista de bloques."""
@@ -300,12 +391,19 @@ class OpenAIAdapter(ModelAdapter):
            if role == "assistant":
                text_parts: list[str] = []
                tool_calls: list[dict[str, Any]] = []
+                reasoning_parts: list[str] = []
                for b in content:
                    if not isinstance(b, dict):
                        continue
                    t = b.get("type")
                    if t == "text":
                        text_parts.append(b.get("text", ""))
+                    elif t == "thinking":
+                        # DeepSeek thinking mode: el razonamiento del turno debe
+                        # reenviarse como `reasoning_content` (no como signature).
+                        rc = b.get("thinking", "")
+                        if rc:
+                            reasoning_parts.append(rc)
                    elif t == "tool_use":
                        tool_calls.append({
                            "id": b.get("id", ""),
@@ -315,8 +413,9 @@ class OpenAIAdapter(ModelAdapter):
                                "arguments": json.dumps(b.get("input", {}), ensure_ascii=False),
                            },
                        })
-                    # thinking / otros bloques: se ignoran (OpenAI no los soporta)
                m: dict[str, Any] = {"role": "assistant", "content": ("\n".join(p for p in text_parts if p) or None)}
+                if reasoning_parts:
+                    m["reasoning_content"] = "\n".join(reasoning_parts)
                if tool_calls:
                    m["tool_calls"] = tool_calls
                out.append(m)
--- a/src/api/routes.py
+++ b/src/api/routes.py
@@ -781,22 +781,27 @@ async def _load_knowledge_from_dir(docs_path: str = "docs") -> dict[str, Any]:

        docs_data.append((doc_id, title, content, summary, tags, priority, load_when))

-    # Generate embeddings in batch
-    from ..memory.embeddings import EmbeddingService
-    embed_service = EmbeddingService()
-    embed_texts = [
-        f"{title}\n{summary}\n{content[:2000]}"
-        for _, title, content, summary, _, _, _ in docs_data
-    ]
-
-    try:
-        embeddings = await embed_service.embed_batch(embed_texts)
-        has_embeddings = True
-        logger.info("Generated %d embeddings for knowledge base", len(embeddings))
-    except Exception as e:
-        logger.warning("Failed to generate embeddings: %s — loading without semantic search", e)
-        embeddings = [None] * len(docs_data)
-        has_embeddings = False
+    # Generate embeddings in batch (solo si hay credencial de embeddings; sin
+    # ella la llamada daria 401 — se omite limpiamente).
+    embeddings: list[Any] = [None] * len(docs_data)
+    has_embeddings = False
+    if settings.embeddings_enabled:
+        from ..memory.embeddings import EmbeddingService
+        embed_service = EmbeddingService()
+        embed_texts = [
+            f"{title}\n{summary}\n{content[:2000]}"
+            for _, title, content, summary, _, _, _ in docs_data
+        ]
+        try:
+            embeddings = await embed_service.embed_batch(embed_texts)
+            has_embeddings = True
+            logger.info("Generated %d embeddings for knowledge base", len(embeddings))
+        except Exception as e:
+            logger.warning("Failed to generate embeddings: %s — loading without semantic search", e)
+            embeddings = [None] * len(docs_data)
+            has_embeddings = False
+    else:
+        logger.info("Embeddings disabled (no AGENTIC_EMBEDDINGS_API_KEY) — KB loaded without semantic search")

    # Limpia entradas huérfanas: docs que ya no existen en el filesystem.
    # Sin esto, los IDs antiguos (e.g. tras renombrar 'builder-fields' →
--- a/src/config.py
+++ b/src/config.py
@@ -32,6 +32,33 @@ class Settings(BaseSettings):
    anthropic_base_url: str = ""  # Custom base URL (for MiniMax Anthropic-compatible, etc.)
    openai_api_key: str = ""
    openai_base_url: str = ""  # Custom base URL (for MiniMax, DeepInfra, etc.)
+    # --- Embeddings (semantic search) ---
+    # Credenciales DEDICADAS para embeddings. Necesarias porque el chat usa
+    # `openai_api_key` apuntando a un endpoint compatible (p.ej. DeepSeek, que NO
+    # tiene API de embeddings). Si vacio, cae a `openai_api_key` por compat. El
+    # base_url vacio => OpenAI real (api.openai.com); NO hereda `openai_base_url`.
+    embeddings_api_key: str = ""
+    embeddings_base_url: str = ""
+    embeddings_model: str = "text-embedding-3-small"
+    # Spike LiteLLM: si default_model_provider=litellm, modelo a usar (formato
+    # litellm, p.ej. "deepseek/deepseek-v4-pro"). Vacío → deriva de default_model_id.
+    litellm_model: str = ""
+
+    @property
+    def effective_embeddings_key(self) -> str:
+        """Key a usar para embeddings. Prioriza la dedicada; reutiliza la del
+        chat SOLO si el chat es OpenAI real (sin `openai_base_url` custom) — si
+        apunta a DeepSeek u otro proveedor, esa key no sirve para embeddings."""
+        if self.embeddings_api_key:
+            return self.embeddings_api_key
+        if not self.openai_base_url:
+            return self.openai_api_key
+        return ""
+
+    @property
+    def embeddings_enabled(self) -> bool:
+        return bool(self.effective_embeddings_key or self.embeddings_base_url)
+
    default_model_provider: str = "claude"
    default_model_id: str = "claude-sonnet-4-20250514"
    # Modelo override SOLO para el sub-loop del planner (acai_plan). Si vacio,
@@ -43,6 +70,11 @@ class Settings(BaseSettings):
    planner_max_tokens: int = 16000
    max_tokens: int = 4096
    temperature: float = 0.3
+    # DeepSeek strict function calling (beta). OPT-IN (default False): exige schemas
+    # tipo OpenAI (additionalProperties:false, todos required, etc.) que los tools MCP
+    # actuales NO cumplen → da 400. Para activarlo: schemas compatibles + base_url
+    # https://api.deepseek.com/beta + AGENTIC_DEEPSEEK_STRICT_TOOLS=true.
+    deepseek_strict_tools: bool = False

    # --- Context engine ---
    model_context_window: int = 0  # 0 = use legacy fixed budget / explicit override
--- a/src/context/engine.py
+++ b/src/context/engine.py
@@ -583,6 +583,16 @@ class ContextEngine:

    async def _semantic_rank(self, query: str) -> list[tuple[str, float]]:
        """Rank knowledge docs by cosine similarity. Returns (doc_id, score)."""
+        # Sin credencial de embeddings no tiene sentido intentar la llamada (daria
+        # 401 en cada turno). Se desactiva limpiamente con un aviso unico.
+        if not settings.embeddings_enabled:
+            if not getattr(self, "_embed_disabled_warned", False):
+                logger.warning(
+                    "Embeddings disabled (no AGENTIC_EMBEDDINGS_API_KEY) — "
+                    "semantic search off, loading all docs"
+                )
+                self._embed_disabled_warned = True
+            return []
        try:
            if not self._embed_service:
                self._embed_service = EmbeddingService()
--- a/src/main.py
+++ b/src/main.py
@@ -54,7 +54,11 @@ async def lifespan(app: FastAPI):
    await redis_storage.connect()

    # 2. Initialize model adapter
-    if settings.default_model_provider == "openai":
+    if settings.default_model_provider == "litellm":
+        from .adapters.litellm_adapter import LiteLLMAdapter
+        model_adapter = LiteLLMAdapter()
+        logger.info("Using LiteLLM adapter (model: %s)", settings.litellm_model or settings.default_model_id)
+    elif settings.default_model_provider == "openai":
        model_adapter = OpenAIAdapter()
        logger.info("Using OpenAI adapter (model: %s)", settings.default_model_id)
    else:
--- a/src/memory/embeddings.py
+++ b/src/memory/embeddings.py
@@ -25,12 +25,19 @@ class EmbeddingService:
    def __init__(
        self,
        api_key: str | None = None,
-        model: str = DEFAULT_MODEL,
+        model: str | None = None,
    ) -> None:
-        self._client = AsyncOpenAI(
-            api_key=api_key or settings.openai_api_key,
-        )
-        self._model = model
+        # Credenciales dedicadas de embeddings. Fallback a openai_api_key por
+        # compat. El base_url solo se aplica si se configura explicitamente
+        # `embeddings_base_url`; vacio => OpenAI real (api.openai.com). NO se
+        # hereda `openai_base_url` (que apunta al chat, p.ej. DeepSeek sin
+        # endpoint de embeddings).
+        key = api_key or settings.effective_embeddings_key
+        kwargs: dict[str, Any] = {"api_key": key}
+        if settings.embeddings_base_url:
+            kwargs["base_url"] = settings.embeddings_base_url
+        self._client = AsyncOpenAI(**kwargs)
+        self._model = model or settings.embeddings_model or DEFAULT_MODEL

    async def embed(self, text: str) -> list[float]:
        """Generate embedding for a single text."""
--- a/src/streaming/claude_format.py
+++ b/src/streaming/claude_format.py
@@ -19,6 +19,71 @@ from .sse import EventType, SSEEmitter
 logger = logging.getLogger(__name__)


+_GENERIC_ERROR = (
+    "Ha ocurrido un error procesando tu mensaje. Vuelve a intentarlo en unos momentos."
+)
+
+# Patrones que el frontend interpreta por sí mismo (login / sesión expirada).
+# No los genericamos para no romper esas detecciones.
+_PASSTHROUGH_PATTERNS = (
+    "not logged in",
+    "login required",
+    "authentication required",
+    "no conversation found",
+)
+
+
+def friendly_error_message(raw: str, code: str = "") -> str:
+    """Traduce un error crudo (proveedor/excepción) a un mensaje genérico y
+    localizado para el usuario final, sin filtrar detalles internos.
+
+    Devuelve el texto original sin tocar para los casos de auth/sesión que el
+    frontend ya gestiona por contenido.
+    """
+    raw = raw or ""
+    text = "{} {}".format(code or "", raw).lower()
+
+    # Auth / sesión: dejar pasar el texto original (lo maneja el frontend)
+    if any(p in text for p in _PASSTHROUGH_PATTERNS):
+        return raw
+
+    # Timeout de ejecución
+    if "timeout" in text or "timed out" in text:
+        return (
+            "La tarea tardó demasiado en completarse. Prueba a dividirla en "
+            "pasos más pequeños o vuelve a intentarlo."
+        )
+    # Saldo insuficiente / facturación del proveedor (402)
+    if (
+        "402" in text
+        or "insufficient balance" in text
+        or "insufficient_quota" in text
+        or "billing" in text
+    ):
+        return (
+            "El asistente no está disponible en este momento. Inténtalo de "
+            "nuevo en unos minutos."
+        )
+    # Credenciales del proveedor inválidas (401)
+    if (
+        "401" in text
+        or "invalid_api_key" in text
+        or "incorrect api key" in text
+        or "invalid api key" in text
+    ):
+        return (
+            "El asistente no está disponible temporalmente por un problema de "
+            "configuración. Estamos trabajando en ello."
+        )
+    # Límite de peticiones (429)
+    if "429" in text or "rate limit" in text or "rate_limit" in text:
+        return (
+            "Hay mucha demanda en este momento. Espera unos segundos y vuelve "
+            "a intentarlo."
+        )
+    return _GENERIC_ERROR
+
+
 class ClaudeFormatEmitter:
    """Emits events in Claude Code CLI SSE format.

@@ -304,7 +369,10 @@ class ClaudeFormatEmitter:
            self._push(session_id, {"type": "done"})

        elif event_type == EventType.ERROR:
-            error_msg = data.get("message", str(data.get("error", "Unknown error")))
+            raw_msg = data.get("message", str(data.get("error", "Unknown error")))
+            user_msg = friendly_error_message(raw_msg, str(data.get("error", "")))
+            # El error real (detalles del proveedor) solo va al log, nunca al cliente.
+            logger.warning("Session %s error (raw): %s", session_id, raw_msg)

            # Close any open block
            self._close_text_block(session_id)
@@ -312,7 +380,7 @@ class ClaudeFormatEmitter:
            self._push(session_id, {
                "type": "result",
                "is_error": True,
-                "result": error_msg,
+                "result": user_msg,
                "usage": {"input_tokens": 0, "output_tokens": 0, "cache_read_input_tokens": 0, "cache_creation_input_tokens": 0},
                "total_cost_usd": 0,
            })