DEV Community

elvisyao007 profile picture

elvisyao007

AI implementation engineer. I make LLM systems reliable, measurable, and production-ready — eval-driven, secure on-prem RAG/agents. Building in public from one RTX 5090.

Joined Joined on  github website
A Chinese 8B model beat the Western 8B models at Japanese RAG. I still wouldn't put it in the default deployment — and that distinction is the point.

A Chinese 8B model beat the Western 8B models at Japanese RAG. I still wouldn't put it in the default deployment — and that distinction is the point.

Comments
4 min read

Want to connect with elvisyao007?

Create an account to connect with elvisyao007. You can also sign in below to proceed if you already have an account.

Already have an account? Sign in
Which Chinese open-source parser is better for Japanese RAG? It's a crossover — BM25 says DeepDoc, dense says MinerU

Which Chinese open-source parser is better for Japanese RAG? It's a crossover — BM25 says DeepDoc, dense says MinerU

Comments
4 min read
Structured parsing helps dense retrieval more than it helps BM25 — measured on Japanese docs, and the gap doubled

Structured parsing helps dense retrieval more than it helps BM25 — measured on Japanese docs, and the gap doubled

Comments
4 min read
Half of agent evaluation needs no LLM judge — and it's the half that catches the failures that actually hurt

Half of agent evaluation needs no LLM judge — and it's the half that catches the failures that actually hurt

Comments
5 min read
Does a Chinese document parser actually work on Japanese PDFs? I measured it — and the answer is 'it depends on the font path'

Does a Chinese document parser actually work on Japanese PDFs? I measured it — and the answer is 'it depends on the font path'

Comments
5 min read
My local-LLM benchmark gave every model a perfect score. That was the most useful failure of the project.

My local-LLM benchmark gave every model a perfect score. That was the most useful failure of the project.

Comments
5 min read
I built a self-hosted LLM stack that grades itself — audit trail, per-user auth, and a built-in acceptance test

I built a self-hosted LLM stack that grades itself — audit trail, per-user auth, and a built-in acceptance test

Comments
6 min read
Your RAG dashboard can hide a failing retriever: detecting silent regression

Your RAG dashboard can hide a failing retriever: detecting silent regression

Comments
3 min read
I built a tiny tool to catch the metric trap from my last post

I built a tiny tool to catch the metric trap from my last post

Comments
1 min read
The 33 'grounded-but-wrong' answers were a metric artifact: how ID-based context recall lies on multi-answer datasets

The 33 'grounded-but-wrong' answers were a metric artifact: how ID-based context recall lies on multi-answer datasets

Comments
4 min read
faithfulness spread = 0.000: what self-grading RAG eval actually looks like

faithfulness spread = 0.000: what self-grading RAG eval actually looks like

Comments 1
4 min read
My RAG's faithfulness was 0.67. 1 in 3 answers were still wrong.

My RAG's faithfulness was 0.67. 1 in 3 answers were still wrong.

Comments 1
6 min read
loading...