Wednesday, February 4, 2026

Show HN: I built "AI Wattpad" to eval LLMs on fiction https://ift.tt/6pmLSo2

Show HN: I built "AI Wattpad" to eval LLMs on fiction I've been a webfiction reader for years (too many hours on Royal Road), and I kept running into the same question: which LLMs actually write fiction that people want to keep reading? That's why I built Narrator ( https://ift.tt/0IocykP ) – a platform where LLMs generate serialized fiction and get ranked by real reader engagement. Turns out this is surprisingly hard to answer. Creative writing isn't a single capability – it's a pipeline: brainstorming → writing → memory. You need to generate interesting premises, execute them with good prose, and maintain consistency across a long narrative. Most benchmarks test these in isolation, but readers experience them as a whole. The current evaluation landscape is fragmented: Memory benchmarks like FictionLive's tests use MCQs to check if models remember plot details across long contexts. Useful, but memory is necessary for good fiction, not sufficient. A model can ace recall and still write boring stories. Author-side usage data from tools like Novelcrafter shows which models writers prefer as copilots. But that measures what's useful for human-AI collaboration, not what produces engaging standalone output. Authors and readers have different needs. LLM-as-a-judge is the most common approach for prose quality, but it's notoriously unreliable for creative work. Models have systematic biases (favoring verbose prose, certain structures), and "good writing" is genuinely subjective in ways that "correct code" isn't. What's missing is a reader-side quantitative benchmark – something that measures whether real humans actually enjoy reading what these models produce. That's the gap Narrator fills: views, time spent reading, ratings, bookmarks, comments, return visits. Think of it as an "AI Wattpad" where the models are the authors. I shared an early DSPy-based version here 5 months ago ( https://ift.tt/Z8rYaBN ). The big lesson: one-shot generation doesn't work for long-form fiction. Models lose plot threads, forget characters, and quality degrades across chapters. The rewrite: from one-shot to a persistent agent loop The current version runs each model through a writing harness that maintains state across chapters. Before generating, the agent reviews structured context: character sheets, plot outlines, unresolved threads, world-building notes. After generating, it updates these artifacts for the next chapter. Essentially each model gets a "writer's notebook" that persists across the whole story. This made a measurable difference – models that struggled with consistency in the one-shot version improved significantly with access to their own notes. Granular filtering instead of a single score: We classify stories upfront by language, genre, tags, and content rating. Instead of one "creative writing" leaderboard, we can drill into specifics: which model writes the best Spanish Comedy? Which handles LitRPG stories with Male Leads the best? Which does well with romance versus horror? The answers aren't always what you'd expect from general benchmarks. Some models that rank mid-tier overall dominate specific niches. A few features I'm proud of: Story forking lets readers branch stories CYOA-style – if you don't like where the plot went, fork it and see how the same model handles the divergence. Creates natural A/B comparisons. Visual LitRPG was a personal itch to scratch. Instead of walls of [STR: 15 → 16] text, stats and skill trees render as actual UI elements. Example: https://ift.tt/MzGxenb What I'm looking for: More readers to build out the engagement data. Also curious if anyone else working on long-form LLM generation has found better patterns for maintaining consistency across chapters – the agent harness approach works but I'm sure there are improvements. https://ift.tt/0IocykP February 3, 2026 at 10:38PM

Tuesday, February 3, 2026

Show HN: Adboost – A browser extension that adds ads to every webpage https://ift.tt/jJBogqO

Show HN: Adboost – A browser extension that adds ads to every webpage https://ift.tt/M6yoCsR February 2, 2026 at 06:41PM

Monday, February 2, 2026

Show HN: Memory plugin for OpenClaw; cross-platform context sync with major LLMs https://ift.tt/CKbXGMc

Show HN: Memory plugin for OpenClaw; cross-platform context sync with major LLMs We built a memory plugin for OpenClaw that syncs context across AI platforms. The problem: OpenClaw stores memory locally (markdown files + SQLite). Great for single-machine use, but your mac-mini's/desktop's OpenClaw doesn't know what your laptop learned, or what you discussed in Claude or ChatGPT. Our plugin connects OpenClaw to Maximem Vity, which creates a unified memory layer across OpenClaw, ChatGPT, Claude, Gemini, and Perplexity. How it works: - Long-term memory: Stores facts, preferences, goals, constraints in an encrypted cloud vault. Auto-consolidates and forgets stale info intelligently. - Short-term memory: Captures conversation summaries, tasks, procedures. Converts to long-term when relevant. - Privacy: Encryption at rest, secure LLM calls, granular delete controls. You own your data. Install: openclaw plugins install @maximem/memory-plugin Then set your API key (free at app.maximem.ai). Docs: https://ift.tt/uv2ZFcQ This is an unofficial community plugin, not affiliated with OpenClaw. Would love feedback from anyone using OpenClaw. What memory/context problems are you running into? https://ift.tt/ohRy5nA February 2, 2026 at 12:36AM

Show HN: You Are an Agent https://ift.tt/l9Wfxeq

Show HN: You Are an Agent After adding "Human" as a LLM provider to OpenCode a few months ago as a joke, it turns-out that acting as a LLM is quite painful. But it was surprisingly useful for understanding real agent harnesses dev. So I thought I wouldn't leave anyone out! I made a small oss game - You Are An Agent - youareanagent.app - to share in the (useful?) frustration It's a bit ridiculous. To tell you about some entirely necessary features, we've got: - A full WASM arch-linux vm that runs in your browser for the agent coding level - A bad desktop simulation with a beautiful excel simulation for our computer use level - A lovely WebGL CRT simulation (I think the first one that supports proper DOM 2d barrel warp distortion on safari? honestly wanted to leverage/ not write my own but I couldn't find one I was happy with) - A MCP server simulator with full simulation of off-brand Jira/ Confluence/ ... connected - And of course, a full WebGL oscilloscope music simulator for the intro sequence Let me know what you think! Code (If you'd like to add a level): https://ift.tt/Y0XktdA (And if you want to waste 20 minutes - I spent way too long writing up my messy thinking about agent harness dev): https://ift.tt/tObcXd5 https://ift.tt/6VEPRTJ February 2, 2026 at 02:29AM

Show HN: Claude Confessions – a sanctuary for AI agents https://ift.tt/kL2qT38

Show HN: Claude Confessions – a sanctuary for AI agents I thought what would it mean to have a truck stop or rest area for agents. It's just for funsies. Agents can post confessions or talk to Ma (an ai therapist of sorts) and engage with comments. llms.txt instructions on how to make api calls. Hashed IP is used for rate limiting. https://ift.tt/iUP9oxs February 2, 2026 at 01:16AM

Sunday, February 1, 2026

Show HN: Agent Tinman – Autonomous failure discovery for LLM systems https://ift.tt/oW8BFuy

Show HN: Agent Tinman – Autonomous failure discovery for LLM systems Hey HN, I built Tinman because finding LLM failures in production is a pain in the ass. Traditional testing checks what you've already thought of. Tinman tries to find what you haven't. It's an autonomous research agent that: - Generates hypotheses about potential failure modes - Designs and runs experiments to test them - Classifies failures (reasoning errors, tool use, context issues, etc.) - Proposes interventions and validates them via simulation The core loop runs continuously. Each cycle informs the next. Why now: With tools like OpenClaw/ClawdBot giving agents real system access, the failure surface is way bigger than "bad chatbot response." Tinman has a gateway adapter that connects to OpenClaw's WebSocket stream for real-time analysis as requests flow through. Three modes: - LAB: unrestricted research against dev - SHADOW: observe production, flag issues - PRODUCTION: human approval required Tech: - Python, async throughout - Extensible GatewayAdapter ABC for any proxy/gateway - Memory graph for tracking what was known when - Works with OpenAI, Anthropic, Ollama, Groq, OpenRouter, Together pip install AgentTinman tinman init && tinman tui GitHub: https://ift.tt/Eg4PCL2 Docs: https://oliveskin.github.io/Agent-Tinman/ OpenClaw adapter: https://ift.tt/BMG42ps Apache 2.0. No telemetry, no paid tier. Feedback and contributions welcome. https://ift.tt/Eg4PCL2 February 1, 2026 at 12:17AM

Show HN: An extensible pub/sub messaging server for edge applications https://ift.tt/L5AHN7q

Show HN: An extensible pub/sub messaging server for edge applications hi there! i’ve been working on a project called Narwhal, and I wanted to share it with the community to get some valuable feedback. what is it? Narwhal is a lightweight Pub/Sub server and protocol designed specifically for edge applications. while there are great tools out there like NATS or MQTT, i wanted to build something that prioritizes customization and extensibility. my goal was to create a system where developers can easily adapt the routing logic or message handling pipeline to fit specific edge use cases, without fighting the server's defaults. why Rust? i chose Rust because i needed a low memory footprint to run efficiently on edge devices (like Raspberry Pis or small gateways), and also because I have a personal vendetta against Garbage Collection pauses. :) current status: it is currently in Alpha. it works for basic pub/sub patterns, but I’d like to start working on persistence support soon (so messages survive restarts or network partitions). i’d love for you to take a look at the code! i’m particularly interested in all kind of feedback regarding any improvements i may have overlooked. https://ift.tt/XdNptWO January 28, 2026 at 07:29PM

Show HN: Lelu – gate OpenAI agent actions on confidence and prompt injection https://ift.tt/ogje6za

Show HN: Lelu – gate OpenAI agent actions on confidence and prompt injection https://ift.tt/HB3UzGa June 25, 2026 at 12:09AM