Zac Zuo

SmolDocling - 256M VLM for end-to-end document AI

byβ€’

SmolDocling, from Hugging Face and IBM Research, is the ultra-compact (256M) open VLM for end-to-end document conversion. Extracts text, layout, tables, code, and more from images.

Add a comment

Replies

Best
Zac Zuo
Hunter
πŸ“Œ

Hi everyone!

Check out SmolDocling, a new open-source vision-language model from Hugging Face and IBM Research! True to its name, it's incredibly small – only 256M parameters! – yet it's designed for full, end-to-end document conversion.

You feed it an image of a document page (a scanned PDF, a photo, etc.), and it outputs a structured representation (called "DocTags") that includes everything:

πŸ“ Text (OCR): It extracts the text, of course.
πŸ“‘ Layout: It understands the page layout (paragraphs, headings, lists, etc.).
πŸ“Š Tables: It extracts table structure and content.
πŸ’» Code: It recognizes and formats code blocks (with indentation!).
βž• Equations: It handles mathematical formulas.
πŸ–ΌοΈ Figures: It identifies figures and links captions.

The key is that it does all of this in a single model, end-to-end, unlike traditional approaches that use separate OCR, layout analysis, and table extraction tools. And it does it with a model that's tiny compared to most VLMs.

It's built on SmolVLM (also open-source) and achieves competitive results with models many times its size.

You can try SmolDocling yourself here.

@zaczuo any built-in support for multiple languages or specialized vocabularies? I would love to try it on academic journals that mix English text with foreign-language citations.

Zac Zuo
Hunter

@hamza_afzal_butt Good question! It's primarily English-focused, but the OCR should handle other languages. Best to test it with your specific documents, though, as mixed-language performance isn't specifically benchmarked.

Jun Shen

Automated document parsing is a great solution! πŸ‘€

Denis Sigal

I used Docling a couple of months ago, it was already cool, now this mini version sounds even cooler!