SmolDocling, from Hugging Face and IBM Research, is the ultra-compact (256M) open VLM for end-to-end document conversion. Extracts text, layout, tables, code, and more from images.
Check out SmolDocling, a new open-source vision-language model from Hugging Face and IBM Research! True to its name, it's incredibly small β only 256M parameters! β yet it's designed for full, end-to-end document conversion.
You feed it an image of a document page (a scanned PDF, a photo, etc.), and it outputs a structured representation (called "DocTags") that includes everything:
π Text (OCR): It extracts the text, of course. π Layout: It understands the page layout (paragraphs, headings, lists, etc.). π Tables: It extracts table structure and content. π» Code: It recognizes and formats code blocks (with indentation!). β Equations: It handles mathematical formulas. πΌοΈ Figures: It identifies figures and links captions.
The key is that it does all of this in a single model, end-to-end, unlike traditional approaches that use separate OCR, layout analysis, and table extraction tools. And it does it with a model that's tiny compared to most VLMs.
It's built on SmolVLM (also open-source) and achieves competitive results with models many times its size.
@zaczuo any built-in support for multiple languages or specialized vocabularies? I would love to try it on academic journals that mix English text with foreign-language citations.
@hamza_afzal_butt Good question! It's primarily English-focused, but the OCR should handle other languages. Best to test it with your specific documents, though, as mixed-language performance isn't specifically benchmarked.
Replies
Hi everyone!
Check out SmolDocling, a new open-source vision-language model from Hugging Face and IBM Research! True to its name, it's incredibly small β only 256M parameters! β yet it's designed for full, end-to-end document conversion.
You feed it an image of a document page (a scanned PDF, a photo, etc.), and it outputs a structured representation (called "DocTags") that includes everything:
π Text (OCR): It extracts the text, of course.
π Layout: It understands the page layout (paragraphs, headings, lists, etc.).
π Tables: It extracts table structure and content.
π» Code: It recognizes and formats code blocks (with indentation!).
β Equations: It handles mathematical formulas.
πΌοΈ Figures: It identifies figures and links captions.
The key is that it does all of this in a single model, end-to-end, unlike traditional approaches that use separate OCR, layout analysis, and table extraction tools. And it does it with a model that's tiny compared to most VLMs.
It's built on SmolVLM (also open-source) and achieves competitive results with models many times its size.
You can try SmolDocling yourself here.
Success.ai
@zaczuo any built-in support for multiple languages or specialized vocabularies? I would love to try it on academic journals that mix English text with foreign-language citations.
@hamza_afzal_butt Good question! It's primarily English-focused, but the OCR should handle other languages. Best to test it with your specific documents, though, as mixed-language performance isn't specifically benchmarked.
Automated document parsing is a great solution! π
I used Docling a couple of months ago, it was already cool, now this mini version sounds even cooler!