I was doing some work for a client helping them implement Continia Software’s Document Management solution; this is a widely used extension for Microsoft Dynamics 365 Business Central which scans invoices, and other documents, and uses OCR (Object Character Recognition) to match the scanned invoice with the Business Central transaction.
While preparing for a training session, I discovered that one of the supplier invoices was not being read correctly; there was an Product Code heading followed by rows of item numbers, below which was the payment terms just outside of the lines table which was being picked up as an item number.
Fortunately, Document Capture allows you to add Rules to fields with which we could stop it reading the payment terms.
The item numbers in this case all followed a pattern which was fairly similar. Below are examples of the two formats of item number used by the supplier:
- AA1 – two letters followed by up to three numbers.
- AA1A – two letters followed by up to three numbers followed by a letter.
The rules allowed by Document Capture use regular expressions (commonly shortened to “regex”), which are a sequence of characters which specifies a match pattern in text.
The regex required for this supplier invoice is as below:
[A-Z]{2}[0-9]{1,2,3}[A-Z]{0,1,2}
To break this down:
- [A-Z]{2} – two alpha characters.
- [0-9]{1,2,3} – between 1 and 3 digits.
- [A-Z]{0,1,2} – none or between 1 and 2 alpha characters..
Once we put the rule in place and clicked Recognise Fields button on the action bar, the invoice was correctly read and the payment terms were no longer read as items.
As I say, I am not a regex expert and while the above does work, I’d be happy if anyone has any suggestions to make it more efficient.