
Regoxa Document Classification and Splitting
Automatically sort, divide, and organize incoming documents
Purpose-built AI that classifies and splits your documents
Document classification reads each file and groups it by type, using both the content and the surrounding context to decide where it belongs. Once a document is classified, it goes straight to the right extraction step, so the data comes out clean and ready to use.
Handle high document volumes with ease
01
Process any document, in any format, faster
Hand the manual sorting to AI that reads any format, from image-based to text-based, from structured and semi-structured to fully unstructured, in many languages. You move faster, cut down on errors, and give your team back time for work that matters.
02
Break large files into separate documents
When a single file packs several documents together, like invoices, purchase orders, and agreements, Regoxa pulls them apart so each one extracts cleanly. Split on blank pages or barcodes for simple cases, or let the AI recognize where one document ends and the next begins.
03
Identify document types automatically
With Mercury, Regoxa's LLM-based extraction product, classification happens as part of reading the document. The model identifies the document's domain, bank statement, invoice, contract, and more, in any language, then routes it correctly for extraction.
04
Built for your own documents
Regoxa adapts to the document types your business actually handles, so even unusual or specialized files get recognized and sorted accurately.
05
Gets sharper over time
A human-in-the-loop process keeps results sharp. People review the model's calls and correct any misses, and that feedback improves accuracy as your documents change.

Connect Regoxa Document AI and OCR to your existing applications
Regoxa connects cleanly with a wide range of business automation platforms, putting IDP and advanced optical character recognition (OCR) right where you work. Link it to your automation systems, robotic process automation (RPA), business process management (BPM), enterprise content management (ECM), and to everything from chatbots to mobile devices to email. Pre-built connectors make integration with key applications quick and painless, and for anything more bespoke, developers can use our REST API to build a connector to almost any system or device.
You can also import documents from multiple devices and locations. Once the data is extracted, both the processed information and the document's status come straight back to your application or shared folder, or you can export the data to another system via API to keep everything in sync.
How document classification works
Regoxa's purpose-built AI reads all your documents, whether structured forms like IDs, semi-structured files like utility bills, or unstructured contracts, and makes sense of the data inside them.
01
Prepare
Choose the categories you want documents sorted into, such as invoices, contracts, and resumes, so each can follow its own workflow. When a file holds several documents, Regoxa splits it into individual documents first.
02
Classify
Each new document is read to work out its type and given a confidence score along the way. Once identified, it's routed to extraction, where the data you need, such as ID numbers, shipping dates, or beneficiary names, gets pulled.
03
Improve
The system keeps learning through human-in-the-loop review, so automation gets more precise and you step in less over time.
What is document classification, and how does it differ from manual sorting?
Document classification is the automated process of identifying and categorizing incoming business documents by type, based on their content, layout, and contextual signals. Where manual sorting depends on human judgment and is inherently prone to inconsistency and delay, automated classification applies machine learning and natural language processing to make the same determination instantly and at scale, with a measurable confidence score attached to every decision.
What types of documents can Regoxa classify?
Regoxa classifies structured documents such as tax forms and identification cards, semi-structured documents such as invoices and utility bills, and fully unstructured documents such as contracts and correspondence. Classification works across image-based and text-based formats and covers more than 200 languages, so the system accommodates virtually any document your organization receives, regardless of origin or format.
How does document splitting work, and when is it necessary?
Document splitting becomes necessary when a single file contains multiple distinct documents bundled together, a common occurrence when scanning physical mail or receiving consolidated PDFs. Regoxa detects the boundaries between individual documents using configurable methods, including blank-page separation, barcode detection, or a trained neural network model that recognizes structural markers specific to your document types. Each document is then separated into its own file before extraction begins, ensuring that data is pulled from the correct document every time.
How accurate is Regoxa's document classification?
Accuracy depends on the quality of the training data and the complexity of the document types involved, but Regoxa's classification models are designed to achieve high straight-through processing rates from the outset. Every classified document receives a confidence score, and any result that falls below your defined threshold is flagged for human review rather than passed downstream unchecked. Over time, that human-in-the-loop feedback is fed back into the model, raising accuracy continuously.
What happens after a document is classified?
Once classified, the document is automatically routed to the extraction model built for that document type. That model knows which fields to locate and how to interpret the content within them, whether that is an invoice total, a policy number, or a shipment date. This tight coupling between classification and extraction is what makes the overall intelligent document processing pipeline accurate and efficient.
How does the human-in-the-loop process improve classification over time?
When a classification model makes an error, a human reviewer corrects it. That correction is not simply a one-off fix; it becomes training data that updates the model's understanding of how that document type should be recognized. Repeated across many documents and variations, this feedback loop produces a model that grows progressively more accurate, requiring less human intervention as it matures.
Can Regoxa classify documents that are specific to my industry or organization?
Yes. Beyond the pre-trained models that cover common document types, Regoxa allows you to build custom classification models tailored to the documents your business actually handles. You provide a representative set of examples for each document type, and the model learns to distinguish them by their layout, textual content, and visual characteristics. Custom models can be deployed quickly and continue to improve as more documents are processed.
How does Regoxa handle documents in multiple languages?
Classification and splitting operate across more than 200 languages without requiring separate models for each one. The underlying combination of image processing, natural language processing, and multimodal machine learning enables the system to identify document types by their visual structure and semantic content regardless of the language in which they are written. Multilingual documents, where content appears in more than one language within a single file, are also handled correctly.
How much technical expertise is required to set up and maintain classification models?
Regoxa's low-code platform is designed so that business users without deep technical backgrounds can build, train, and refine classification models. Providing labeled examples, reviewing flagged results, and adjusting confidence thresholds are all straightforward operations within the interface. For organizations that need more advanced customization, the platform also supports neural network-based model training, which developers can configure to recognize highly specific document boundaries and structures.
Frequently asked questions
Contact Us
Let’s Connect and Build Intelligent Business Solutions Together.
Ready to Partner with Us?
Contact us today.