Monday, June 29, 2026

Show HN: Caliper – pass@k reliability testing for Claude Code and Codex skills https://ift.tt/zc2jOVN

Show HN: Caliper – pass@k reliability testing for Claude Code and Codex skills Skills for Claude Code and Codex are hard to test. What I mean by hard is that there's no standard way to do it. You evaluate the skill once on something, it looks like it works. You publish it. Then the new super model releases (GLM 5.2 anyone?), it will quietly break for some part, and you won't find out until your users complain. I also faced the same problem, so I tried to build something lightweight to stop doing that. Caliper. It's a local and lightweight harness that runs a skill k times in isolated environments and gives you a pass@k score (How much times it succeeded in these k times). As a non-deterministic technology, you can't just say "it worked once". You need to answer how much it passed in k times. You define success in a YAML spec. I picked YAML to keep a schema and make it still readable for a human. You either use a LLM judge, a Python assertion, or both: Here's an simple evaluation example with a JSON extraction, so you write this in a YAML file: tasks: - name: Extracts action items as clean JSON prompt: "Read /tmp/transcript.txt and write the action items to /tmp/actions.json." expect: "A valid JSON array where every item has owner, task, due. No markdown fences." assert: | import json items = json.load(open("/tmp/actions.json")) assert isinstance(items, list) assert all({"owner","task","due"} <= i.keys() for i in items) Then with the CLI, you'll run it: caliper run extract-actions.eval.yaml --k 5 --baseline What's cool about the --baseline flag is that it will re-runs everything without the skill, so you can see whether the skill is doing the work or the base agent was going to pass anyway: ID Task k(5) pass@k task-1 Extracts action items as JSON 5/5 100% PASS With skill 100% No skill 60% Delta +40% Most models know how to get the JSON right most of the time (JSON extraction was solved by 2 years old already). But that's it, "most of the time" is the bug. That delta shows how the skill actually helped. (It's sometimes 0%, sometimes -100%!) I also created two skills you can get started right away with your favorite harness, e.g. Claude Code, Codex or Pi: - evaluate-skill: run and manage evals without leaving your workflow - grill-skill: reads your SKILL.md, interviews you about what "good" looks like, writes a 3-task spec (happy path, edge case, adversarial), and runs it You can install the skill with the command: npx skills@latest add edonadei/caliper I for now support claude-code, codex, pi, claude-api, openai-api. You can run the agent and the judge as separate backends, so you can run a skill on one and judge with another. GitHub: https://github.com/edonadei/caliper PyPI: https://pypi.org/project/caliper-eval/ Of course, it's a first step. I think the autorater layer can be vastly improved, more handholding to create and iterate on evaluation specs, supporting more harness, why not including this layer into a self-improvement bigger system? If you're also building agentic evaluations, I'm genuinely interested to hear how you are handling that. https://github.com/edonadei/caliper June 28, 2026 at 11:12PM

Sunday, June 28, 2026

Show HN: E3d-pod2vid – AI pipeline that turns podcasts into YouTube-ready videos https://ift.tt/pQgLkKO

Show HN: E3d-pod2vid – AI pipeline that turns podcasts into YouTube-ready videos turn your .mpa files into animated videos. https://ift.tt/VhIXu35 June 28, 2026 at 03:39AM

Show HN: Wind particles on Mapbox from a single EXIF JPEG https://ift.tt/bvliysU

Show HN: Wind particles on Mapbox from a single EXIF JPEG https://ift.tt/i2U0atE June 27, 2026 at 11:46PM

Show HN: A Living Neural Web in HTML5 Canvas https://ift.tt/VXorU0M

Show HN: A Living Neural Web in HTML5 Canvas https://techoreon.github.io/verpad/canvas-playground.html June 27, 2026 at 10:05PM

Saturday, June 27, 2026

Show HN: Puzzle with Strangers. A free multiplayer jigsaw https://ift.tt/HDnN2bZ

Show HN: Puzzle with Strangers. A free multiplayer jigsaw I built this over the last few days. Me and handful of friends are successfully hooked. I recently went to a — for lack of a better word – social/collaborative performance at an art gallery in Berlin where a group of artists filled a huge industrial hall with wooden 10x10cm cubes for people to build structures with. It was beautiful how universal the concept of playing with wooden blocks is and how ephemeral the structures were, people of all ages were put back into a childlike play. The thought about what kind of games need zero explanation stuck with me and i built an anonymous multiplayer jigsaw. We've already spent hours in there and you're invited now as well. Hope you enjoy. https://ift.tt/vj8UY2b June 26, 2026 at 10:17PM

Friday, June 26, 2026

Show HN:Every Team Is Building the Same Cache https://ift.tt/4pYJ9NC

Show HN:Every Team Is Building the Same Cache https://ift.tt/FcwphYl June 26, 2026 at 03:10AM

Show HN: No chair fixed my back, so we built one that won't let you sit still https://ift.tt/p4I8kTe

Show HN: No chair fixed my back, so we built one that won't let you sit still https://ift.tt/pQAH0fr June 26, 2026 at 12:36AM

Show HN: OpenKnowledge – open source AI-first alternative to Obsidian/Notion https://ift.tt/GZhLYtq

Show HN: OpenKnowledge – open source AI-first alternative to Obsidian/Notion Hi HN, Nick here. We’re launching OpenKnowledge ( https://ift.tt/km0HdjE ), a “what you see is what you get” markdown editor that has direct integrations with Claude, Codex, and Cursor. Available as MacOS app or CLI. Fully free/local and OSS ( https://ift.tt/lmswRG4 ). We built this because we wanted a “Google docs” like experience for writing and sharing markdown files across our team. Obsidian is the best alternative we tried, but found it doesn’t have a true “what you see is what you get” UI and it didn’t integrate well with Claude/Codex outside of community plugins. So we built OpenKnowledge. It takes shape as: 1. A MacOS app with a file navigator, the WYSIWYG editor, and link explorer. 2. Integrations with the Claude, Codex, and Cursor desktop apps. The agents can open an OpenKnowledge editor within their embedded web browsers for a side-by-side experience. 3. Built-in mcps, skills, and RAG for LLM-wiki and “AI Second Brain” scenarios + spec writing 4. An embedded terminal and CLI for TUI-first users OSS stack includes: Tiptap/prosemirror, CodeMirror, yjs (CRDT), Electron (MacOS app), Orama, remark/rehype/micromark/mdast, @pierre/trees On the architecture side, the interesting eng. challenges included: 1. A pipeline to convert ProseMirror to markdown in a bidirectional lossless way. ProseMirror uses ASTs, which are not designed to have byte-fidelity. 2. A dual-observer CRDT to keep the ProseMirror and markdown state in-sync. The CRDT + git also power a collaborative experience that shows what Agents are doing in the markdown, have undo/redo, and version history. The “Share” and cloud-sync functionality are geared for team collaboration. They feel “no-code” but leverage git/GitHub under the hood, which also means data stays fully private. In that spirit, we made OpenKnowledge open source for anybody who’s curious or who’d like to contribute. We’re actively thinking about plugins/extensibility and what’s next. If you have suggestions or feedback, would love to hear it. https://ift.tt/lmswRG4 June 25, 2026 at 09:34PM

Thursday, June 25, 2026

Show HN: LookAway, a Mac break reminder that knows when not to interrupt https://ift.tt/wrZbBMy

Show HN: LookAway, a Mac break reminder that knows when not to interrupt Hello, I'm Kushagra and I am the indie developer behind LookAway (I've posted about it earlier but it has received quite a lot of updates since the last time so I am posting it again). LookAway is a native break reminder for macOS that doesn't interrupt. I built it because I work from home and I spend a lot of time in front of my screens. It's very easy for me to get lost in the flow and I can end up sitting for hours. Due to this, I started facing issues like eye strain and back pain by the end of the day. The solution to this was simply taking enough breaks throughout the day. But remembering to take breaks was difficult, especially when I was in the flow. I tried some reminder apps but the problem with those was that they always interrupted me at the worst moments. So I ended up not using them. LookAway is designed not to interrupt. It gives enough heads up before a break so that you're not caught off-guard. It's also context-aware and it automatically pauses when you go into a meeting, start watching a video, record screen, and much more. It even waits for you to finish typing or dictating when a break is due. One thing worth mentioning is the free iOS counterpart LookAway Mirror. When your Mac goes on a break, your iOS devices can also mirror the same break so you don't end up scrolling your phone screen during the Mac break. I've spent a lot of time in making LookAway the least annoying break reminder app and I would love to know your thoughts. It's a native Swift app so it doesn't take much resources (150MB RAM and <1% CPU when idle). It's available to download from the website (lookaway.com), Setapp, and the App Store. Thank you! https://lookaway.com June 24, 2026 at 06:59PM

Show HN: Lelu – gate OpenAI agent actions on confidence and prompt injection https://ift.tt/ogje6za

Show HN: Lelu – gate OpenAI agent actions on confidence and prompt injection https://ift.tt/HB3UzGa June 25, 2026 at 12:09AM

Show HN: Follow the Thread – a calmer, typographic way to read Wikipedia https://ift.tt/ZQ2Dx9y

Show HN: Follow the Thread – a calmer, typographic way to read Wikipedia https://ift.tt/PigdOXU June 24, 2026 at 10:46PM

Wednesday, June 24, 2026

Show HN: The Cascade Graph – An interactive map of AI and energy constraints https://ift.tt/O47EcUS

Show HN: The Cascade Graph – An interactive map of AI and energy constraints Hello, I wanted to share with you all a interactive map of the economics and physics constraints of the AI buildout. It has macro drivers, industrial chokepoints, and where that shows up in markets. I've added 393 nodes and 562 edges to capture other supply / physics constraints as well. There's no sign up, and no pay wall, it's all free. Please let me know what you think! https://ift.tt/Wqsz6Dp June 23, 2026 at 08:52PM

Show HN: I created agent skill based on Peter Lynch's books https://ift.tt/p0wT5nA

Show HN: I created agent skill based on Peter Lynch's books For the last few months I have been analysing Peter Lynch’s books on stock picking and doing prompt engineering to check if AI could create useful stock analyses. To my surprise it started making reports that allow me to understand companies much faster with well cited sources. I hope you find it interesting and useful :) Perter Lynch’s books I analyzed: Learn to earn, One up on Wall Street, Beating the street https://ift.tt/OuT2bgJ June 24, 2026 at 12:32AM

Tuesday, June 23, 2026

Show HN: Loft gives thumb-keys and split-layout on a standard laptop or keyboard https://ift.tt/pc2OxM1

Show HN: Loft gives thumb-keys and split-layout on a standard laptop or keyboard I've put up a homepage for my keyboard layout, LOFT, and thought I'd share in case anyone found it interesting... LOFT is free on macOS and remaps your laptop and standard keyboard into the thumb-keyed, split-layout, ergonomic dream you've been seeking! It positions your hands up and out in a creative way to get you all the goodies you thought you'd need some garage-built, geek-contraption for. All on a standard ANSI keyboard. Maybe of interest is that the site's keyboard graphics are all generated HTML+CSS via Hugo partials. https://ift.tt/bZ5PcIu June 23, 2026 at 08:20AM

Show HN: Durable Agent Sessions API (Preview) https://ift.tt/r7FUYks

Show HN: Durable Agent Sessions API (Preview) https://ift.tt/51rLRt3 June 23, 2026 at 07:07AM

Show HN: Kitcat 2.0 – A Matplotlib back end for terminal plotting https://ift.tt/2Fo34f5

Show HN: Kitcat 2.0 – A Matplotlib back end for terminal plotting https://ift.tt/rMFy7Xa June 22, 2026 at 11:00PM

Monday, June 22, 2026

Show HN: Pure Effect – Reproduce production bugs on your laptop without a DB https://ift.tt/0JPUXjn

Show HN: Pure Effect – Reproduce production bugs on your laptop without a DB Hi HN, I think it's safe to say that the majority of developers don't give a second thought to writing code with I/O tangled in business logic. It's all too common to see code like: const user = findUser(email); if (!user) await saveUser(user); Now, you may ask: what's the big deal? When we write code like this, two things happen: 1. It gets harder to debug production bugs. Unless you have the exact same database and remote API services to connect to, you may fail to reproduce the bug. 2. You have to use mocks and fakes in your tests, or use test containers, which only help somewhat, and they are slow! To solve these issues, I built Pure Effect, a tiny TypeScript/JavaScript effect library. The core idea is simple: if a function performs I/O, it isn't pure. But if it returns a description of the I/O it wants to perform, it is. So instead of await findUser(email), you return a Command object that says, "I would like to call this function, and when it finishes, here's what to do next." Your business logic becomes a pure function. Same input, same output, every time. The database never gets touched until the interpreter (runEffect) runs. When I first started the library, I didn't expect just how far that one idea would stretch. Once your pipelines are just data, a lot of wonderful things become possible: - No need for mocking libraries. You walk the tree in tests and assert on its structure: assert.equal(flow.cmd.name, 'cmdFindUser'). Nothing is executed. - Wrap any effect with Retry(effect, { attempts: 3, delay: 200, backoff: 2 }). The configuration is plain data, so you can assert on it in tests. - Every command's input and output flows through the interpreter, so you get a full execution trace for free. You can write a simple timeTravel() function that replays it locally without touching any I/O. Perfect for debugging complex production bugs. - An onBeforeCommand hook sits between your business logic and the interpreter. Since it sees every intended side effect before it fires, it can be used to enforce runtime guardrails. You can quarantine destructive calls before they happen for example. - You can review AI-generated code before it runs. Since Pure Effect pipelines are plain data, you can inspect what the generated code intends to do before it touches anything. There are just six primitives: Success, Failure, Command, Ask, Retry, and Parallel, plus effectPipe and runEffect. Zero dependencies. Under 1 KB minified and gzipped. How it compares to Effect-TS Effect-TS is the full-featured option in this space and has a large ecosystem. Pure Effect offers a different tradeoff. It covers the 80% case: testable pipelines, dependency injection, retry, and OpenTelemetry hooks, all in under 1 KB with zero dependencies and no new vocabulary to learn. Effect-TS is a framework you build around. Pure Effect, on the other hand, is a pattern you drop into existing code. I've been using Pure Effect in production since December. It's at v0.8.0, not 1.0 yet, but stable enough that I wanted to put it out there and hear what people think. GitHub: https://ift.tt/QSyexEs I wrote five posts that document how Pure Effect evolved. They are tagged at https://ift.tt/TfcXLNw if you want the longer story. https://pure-effect.org June 21, 2026 at 11:06PM

Show HN: DebugBrief – turn debugging sessions into reports, no AI https://ift.tt/F3yUf2V

Show HN: DebugBrief – turn debugging sessions into reports, no AI https://ift.tt/ltmZE45 June 22, 2026 at 01:27AM

Show HN: CleverCrow: give tokens to your favorite projects https://ift.tt/F9h2m7V

Show HN: CleverCrow: give tokens to your favorite projects Howdy all. I'm Zack :wave:. I've been thinking about the problem of misguided AI pull requests and figured I'd throw a possible solution out there for feedback. Basically, CleverCrow lets supporters give tokens to a GitHub repo (or set of issues in that repo) for the maintainers to use to build/fix stuff. The fun implementation challenges have been around implementing the pooling dynamics and keeping the maintainers in charge while the backers are motivated to support their work. https://clevercrow.io June 22, 2026 at 12:36AM

Sunday, June 21, 2026

Show HN: An n8n alternative where coding agents build the workflows, not humans https://ift.tt/vgkmR29

Show HN: An n8n alternative where coding agents build the workflows, not humans n8n is built for humans dragging nodes on a canvas. That breaks down at B2B scale (embedding in a product, multi-tenant scalability, etc). n8n does have an MCP server so agents can create workflows too, but it outputs raw JSON. That's fine for n8n's engine, but painful for a coding agent (or a human reviewing its output) to read, write, diff, or debug. I'm building an alternative where workflows are authored by a coding agent in [a more dev-legible format] instead of JSON blobs, and execute it at scale. https://velane.sh/ June 21, 2026 at 12:14AM

Show HN: Caliper – pass@k reliability testing for Claude Code and Codex skills https://ift.tt/zc2jOVN

Show HN: Caliper – pass@k reliability testing for Claude Code and Codex skills Skills for Claude Code and Codex are hard to test. What I mea...