For best experience please turn on javascript and use a modern browser!
You are using a browser that is no longer supported by Microsoft. Please upgrade your browser. The site may not present itself correctly if you continue browsing.
Barentz is a global distributor of life science ingredients in the chemical industry, operating in over 70 countries with a complex supply chain and numerous ERP systems.

They receive product information in PDFs from suppliers and currently rely on manual data entry into ERP systems. Barentz seeks an automated solution to accurately extract and standardize product data across various formats, aiming to reduce manual work, ensure consistency, and enable better data-driven decisions for long-term business sustainability. The project aims to convert unstructured data in PDF files into structured data for enhanced product taxonomy, enabling Barentz to maintain well-organized databases of product information that provide significant business value. The primary goal is to establish an effective data model, exploring both hierarchical and relational structures, and evaluate various information extraction (IE) techniques, particularly focusing on Zonal OCR (Optical Character Recognition) and regular expressions.

The implementation includes developing a pure regular expression algorithm and testing it alongside other text mining and information extraction methods. Key metrics and qualitative assessments evaluate the algorithms' performance, focusing on factors like accuracy, consistency, and efficiency. Results indicate that the pure regular expression algorithm performs well, demonstrating high accuracy, swift runtime, and the ability to capture key attributes efficiently. However, it has limitations, including limited flexibility and scalability for diverse document formats and languages.