PDF OCR - reading content of a PDF - VisualCron - Forum

Community forum

chris.hill@adsigroup.co.uk
2019-02-22T16:08:24Z
Hi

Has anyone had any luck using VC, perhaps with a 3rd party bit of software, of reading (OCR) a PDF file?

I need to extract details from the PDF and rename the file based on those values - we use a package currently that sits outside VisualCron but ideally for the benefit of the overall process it would be great to do a file at a time within my VC loop.

Thanks
Gary_W
2019-02-22T16:25:13Z
Does the tool you currently use have a command line interface? If so you could do it.
chris.hill@adsigroup.co.uk
2019-02-22T16:46:49Z
It does but we seem to have some license issues as it's installed on another server. It's not the greatest software anyway so if we can get a smaller util that can be used by VC, or some .Net code within VC that would be cool
Support
2019-02-22T19:28:05Z
Did you test the PDF Tasks? How far did you get?
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
thomas
2019-02-25T08:20:19Z
We use iTextSharp in .net code to extract from pdf. I believe iText7 is the new version of this library, but I haven't tested it yet.
https://www.nuget.org/packages/iTextSharp/ 

We have the code in a .net assembly and call it from vc.
chris.hill@adsigroup.co.uk
2019-02-25T08:34:56Z
Originally Posted by: Support 

Did you test the PDF Tasks? How far did you get?



Hi - which of the PDF tasks would actually do OCR - I couldn't see which one if any could do that?
chris.hill@adsigroup.co.uk
2019-02-25T08:36:47Z
Originally Posted by: thomas 

We use iTextSharp in .net code to extract from pdf. I believe iText7 is the new version of this library, but I haven't tested it yet.
https://www.nuget.org/packages/iTextSharp/ 

We have the code in a .net assembly and call it from vc.



Thanks for that - does it let you target specific areas of the PDF and look for placeholders and following values? For example we want to look in to top right of a page, say 400 x 400 pixels for a string like "Account:" and then read the value after it on the same line to extract the value.
thomas
2019-02-25T09:20:21Z
To be honest I am not sure. The trouble with pdf , is that it depends a lot on how it was created. We receive pdf's that were created from excel, ie the users have a spreadsheet, and converted it to pdf. The lines and columns will not always be aligned the same way every time (if the user changes the excel sheet). It basically sucks to extract anything from pdf. So I tend to convert to a textfile, and use c# and regex to extract what I need.

If it is a well made pdf, it could work with iTextSharp. Here is an example where they retrieve the fields (columns) from a pdf. You should be able to use the column position to get the value in the next line

https://stackoverflow.co...rm-data-using-itextsharp 
chris.hill@adsigroup.co.uk
2019-02-25T10:39:56Z
Originally Posted by: thomas 

So I tend to convert to a textfile, and use c# and regex to extract what I need.



Ah I hadn't thought of that - could be a great route to try. Will give it a go and see.

Do you have any examples of your RegEx that you could share?

Many Thanks

Support
2019-02-25T11:01:10Z
Originally Posted by: chris.hill@adsigroup.co.uk 

Originally Posted by: Support 

Did you test the PDF Tasks? How far did you get?



Hi - which of the PDF tasks would actually do OCR - I couldn't see which one if any could do that?



We do not use OCR but can extract text or images. Do you have one example and document of what you are trying to do?
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
thomas
2019-02-25T11:34:05Z
Here is some C# code to read pdf content into a list. It's very simple as you can see. As far as regex is concerned, I am not an expert. There are many on this site that can help you if you give them some concrete examples.

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

private List<string> ReadPdfFile()
{
var result = new List<string>();
using (var reader = new PdfReader(SourceFile))
{
for (var page = 1; page <= reader.NumberOfPages; page++)
{
var rr = PdfTextExtractor.GetTextFromPage(reader, page);
var temp = Regex.Split(rr, "\n");
result.AddRange(temp);
}
}

return result;
}
chris.hill@adsigroup.co.uk
2019-02-26T14:05:56Z
Thanks all - getting good results just converting to txt file, and then extracting the values using some C# code
Support
2019-02-26T16:09:52Z
Originally Posted by: chris.hill@adsigroup.co.uk 

Thanks all - getting good results just converting to txt file, and then extracting the values using some C# code



So basically you want to get the raw text on one or more pages?
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
thomas
2019-02-27T08:56:36Z
Good point. There is already a task for converting pdf to txt. I hadn't seen that.

Support
2020-05-22T14:47:53Z
In 9.2.0 we introduced the Scan document Task which can automate this. See video here: https://www.visualcron.com/tutorials.aspx 
Henrik
Support
http://www.visualcron.com 
Please like  VisualCron on facebook!
Scroll to Top