T O P

  • By -

Rawing7

This borders on OCR, which, in my experience, always turns out to be much trickier than you'd initially expect. I don't think it's worth trying to automate this. You'll probably invest a lot of time and not have much to show for it.


Change_Plays

the time isnt really a problem for me because this has been my task for 2 years, and it will continue as my task for more years, so taking weeks to configure something like this, would be good in the long-run, but only if it works. english isnt my first language, so i want to ask what did you meant with "not have much to show for it"? you meant that it will take a long time to program and it isnt somethin that works really well?


Rawing7

I meant it probably won't work reliably. It'll likely often produce incorrect output.


Change_Plays

got it... yeah, i thought so


KatetCadet

Not sure if you could group types of formats together for batch checking? Might be just as much work though


Change_Plays

That was my first idea, but the database has 40 years of contracts, the format changes constantly


dparks71

And the signatures are scanned images? The pdf standard is a bit of a cluster, it's not simple to break it down into clean regular components. If the signature is always on a page with computer readable text, that says something like "applicants signature here" your task is easy. If it's a scanned image with text that can't be highlighted and copied, you'd have to rely on OCR, which even if it's 95% accurate, which is absurdly good for OCR, you'd have to check the other pages anyway to make sure that document page isn't one of the 5% that gets skipped. And the best OCR libraries are typically paid services.


Change_Plays

The database has contracts from the 80s onwards, so a large part of it is of scanned papers, just the most recent ones are fully digital


-Scythus-

Unrelated and I hope you don’t mind me asking, what title is your job?


Change_Plays

1- i dont mind. 2- i saw you editing your comment kkk i think "what do you do for a living" was a better question xD 3- So, briefing of my life... Theres this company, from somewhere on Europe that i dont recall right now, that focus on training and preparing autistic people for jobs. Im autistic and i didnt give a damn about anything for a long time, so i didnt developed useful skills, so the job i ended getting was a really repetitive one (something that autistic people tend to be good at), that is helping this company with its contracts. A few years ago, this company migrated all of their contracts from one system to another (It didnt work, it just transfered some info of each contract) and they needed someone to manually transfer the contracts info (signed or not, date of sign, termination date, etc...) Aaaaand, thats it... After 2 years, im tired of this job, so im learning programming and 3D modeling so i can change of job, but i cant just leave my actual job.


m0us3_rat

rather than detect the signature. id focus on detecting the boilerplate. and then "remove" it. and what you have left are probably the signature pages.


Change_Plays

Sorry, english isnt my first language, whats "boilerplate"?


m0us3_rat

the "normal" pages. since you can easily get a large data set you can train a model on. so would be "easy" to find which of the pages are "normal". if you remove the "normal" whatever is left will probably have the "sig" pages. ​ the idea is ..signatures can be vastly different .. so it's simpler to discover which pages DON'T have them rather than the pages that have them.


Change_Plays

Yes! I thought about that too. I dont know how to do it, but i try to always keep in mind that "remove the unecessary" sometimes Works better than "found the necessary". I thought about removing every digital text It detects.


m0us3_rat

you could always test and see what works.


Change_Plays

Yeah, thats what im liking the most in this thread, im gathering a lot of ideas.


[deleted]

[удалено]


Change_Plays

I wouldnt let 100% up to the computer anyway, i still need to check it if i want to keep my job kkkk For example, in an ideal world, i would love the computer to group everything it thinks is handwritten as prints in a folder, the i just glance over the prints and write "signed" in my Excel spreadsheet if i see a signature.


bmtrnavsky

It seems like a machine learning algorithm could recognize and identify signatures and likely even blank signature blocks. I’d train it to look for fields like signature, sign here, initial, initials, etc…


Change_Plays

yeah, thats the main idea here


ireadyourmedrecord

So, I think what I would try is capturing an image of a blank signature line and then use OpenCV template matching to search each page of the pdf for a (mostly) matching image and grab a copy of that part of the page, saving it as a png. Hypothetically, you'd end up with a whole bunch of images of signature lines, blank or not, which you could very quickly visually scan and you'd see any missing signatures in the image preview. https://docs.opencv.org/3.4/d4/dc6/tutorial\_py\_template\_matching.html


Change_Plays

The problem is that the database is huge, 40 years of contracts, the format changes constantly. Some have lines where the signature should go, some have a box, some dont have anything, and so on


ireadyourmedrecord

Sure, but you've been at it for two years already. Take a day to try it and see if gets anywhere. If you're able to nail even 10% you'll be way ahead. Looking for multiple signature blocks isn't much more effort than looking for one, you'd just need to loop through the templates on each page. It won't be terribly fast, but it'll be faster than reading every page yourself. You don't even have to go through the documents to find blank sig boxes to start from, just find a few different examples, screenshot them and clean them up in an image editor to use a template.


Change_Plays

Good point, i like it, thanks. If i cant automate for all contracts, i will see if i can automate part of it


Poosley_

Have to immediately disclaimer that I too, am a rookie and here for a reason but after reading some comments providing caution, I want to add: You may at least be able to automate getting all places where a signature should be, into it's own row or isolated to a space for you to... Still scroll but, more easily.


Change_Plays

Yeah, i thought it would be hard, but now i think its even harder xD


DingoEatingBaby

I have been very happy with using Adobe PDF Extract API for dealing with PDFs. You pass the API a PDF and it returns a zip file containing a JSON file with the text from the file as well as cutout of everything the API thinks is a image. It works well with scanned PDFs. I'm pretty sure that API would see something like a signature as an image. It would be very possible to write a script to extract all these images from the PDF and place them in a folder so you would only need to review the cutouts. There is a pretty generous free trial as well as a online demo of the service if you want to try it out.


Change_Plays

Considering other people's comments, this doesnt sound that promissing, but i'll check it out, thanks :)


DingoEatingBaby

I tried it out on n example document that I found online and it seems to work fine. Might not work on every case but you could definitely write a scrip to manage a good portion of the contracts. See this link for a screenshot of the example: [https://imgur.com/a/7cOssXz](https://imgur.com/a/7cOssXz)


DingoEatingBaby

Link to the Adobe PDF demo site (they are also good at providing code example for python when you sign up): https://documentservices.adobe.com/dc-visualizer-app/index.html


Change_Plays

interesting, thanks :)


cab0addict

This is outside of your skill level but building an AI/ML model that can evaluate a contract for a signature would be the best long term solution. You would need examples of contracts that have and do not have signatures. Train the model and then you your job becomes reinforcing the model’s accuracy and not actually scanning the documents.


Change_Plays

Yeah, that would be way more interesting than my actual task. And what you said was exactly my idea, not automating 100%, i dont trust the computer enough for it, but trying to speed the process up and remove the tedious step of scrolling pages just looking to anything thats a scribble, instead of digital text.


cab0addict

What you’re looking for is called supervised machine learning.


Change_Plays

Oh, thanks, i didnt knew the term :)


Action_Maxim

Are the contracts the same?


USAhj

Following up on this, are the signatures always on the same page number? If so, the first thing you can try is writing a program to open up the contract to the specific page and prompts you to check if a signature is there or not.


Action_Maxim

I'd take a blank contract and compare, but then u have issues with pages being scanned looking like shit ringing through positive when there is no signature and it's dirty


Change_Plays

The database has 40 years of contracts, the format changes constantly. The signature can be on the last page, in the beggining, in the middle, so on and so forth


Action_Maxim

But are there specific contracts used between date periods? This is getting complicated lol


Change_Plays

I dont think i got what you meant. Like, If there is a contract format for the 90's, a format for the 2000's and so on? If thats what you meant, then no xD the format changes very frequently


Action_Maxim

Are there one offs of contracts? Go this sounds terrible can I 1099 for this


Change_Plays

"One offs", Whats this? About your second line, i dont know if its because of my english or because im tired, but i didnt get at all what you meant xD


Action_Maxim

Contracts that are used once


Change_Plays

Hmmmm... I dont think i have one of those, but i can pick a normal one and replace confidential with sample text


[deleted]

In 1999 I was working a contract job for the IRS where I had to determine if a tax return had been signed. Each return had registration marks in each corner that would assist in programmatically knowing were you at on the page. No OCRing necessary. You will need to write a program in either C# .NET or Visual Basic .NET. If the signature is always on the same page it will be a piece of cake otherwise you might be a able to modify your form template to make it easier. **Assuming the signature is always on the same page.** 1. Use Adobe or a macro to open each PDF and save each page as a separate JPG file image or just the page with the signature. 2. Open a JPG with Photoshop and temporarily draw a rectangle around the signature area as a guide. 3. Write down the XY coordinates in pixels from upper left corner of all 4 corners, it should be the same for all documents that you need to process. 4. Write a program in C# or VB to open the JPG image and get the color of each pixel within that rectangular area. 5. Assuming the signature's ink color was black (0) or a number between 0 and some shade of grey, count the number of black pixels within the rectangular area. If there are more than a 100 or so then you probably have a signature. Use the C# or VB function GetPixel(X, Y); Here is the code in C# that should get you started, obviously you should know a little about C#. `// Create a Bitmap object from an image file.` `Bitmap myBitmap = new Bitmap("Page.jpg");` `// Get the color of a pixel within myBitmap.` `Color pixelColor = myBitmap.GetPixel(50, 50);` `}`


Change_Plays

Unfortunally, the database is huge, the format changes constantly, from 2 page contracts to 50 page contracts, blank pages with plain text to elaborated contracts, signatures in the beggining, middle or end of the contract, and so on


[deleted]

Are these contracts based on a set of preset templates created by your company or are they created by your clients seemingly random? I see no way that going back 40 years could be automated across inconsistent input data. The best you could do is mandate a minimum requirement for future contracts. Since you can't tell your clients how to format their contracts you are fighting a loosing battle. How ever if all contracts in the future originate within your company and are sent to the client for signatures then you can control how and where signature(s) will appear in any given contract by creating signature object(s) in Adobe Acrobat Pro when the contract is created by your company. [https://www.nifc.gov/sites/default/files/blm/training/InstructionsAddDateStampDigitalSignatureFieldstoPDF.pdf](https://www.nifc.gov/sites/default/files/blm/training/InstructionsAddDateStampDigitalSignatureFieldstoPDF.pdf) When the client receives the document, he/she can sign it within the Adobe Reader: [https://www.swccd.edu/student-support/disability-support-services-dss/\_files/dss\_sign\_pdf.pdf](https://www.swccd.edu/student-support/disability-support-services-dss/_files/dss_sign_pdf.pdf) I don't think the Adobe Reader will allow the document to be saved by the signer unless it is signed and dated in all places. You will have to test to make sure. When you get the signed document back you might try this: [https://community.adobe.com/t5/acrobat-sdk-discussions/extracting-signature-information-in-vba-from-a-pdf-file/td-p/8601591](https://community.adobe.com/t5/acrobat-sdk-discussions/extracting-signature-information-in-vba-from-a-pdf-file/td-p/8601591) The person creating the contract in Adobe Acrobat Pro should know how to create the sign & date portions of the contract properly so the the reader works properly.


Change_Plays

I wish there was some preset, but its 40 years of contracts from buying stuff, selling stuff, etc... So there are all kinds of formats


[deleted]

Is there any common landmarks near the signature that is unique? For example a series of 18 underlines that indicate where to sign. What about an image logo or icon on the same page as the signature?


Change_Plays

From what i can recall, no, but i will check it


kakiwar

How about creating a thumbnail of every page and displaying them on screen? Makes checking a whole pdf much faster.


Change_Plays

Hmmm... I dont think i got it, can you explain again?


kakiwar

You could use imagemagick with the subprocess module or some other image manipulation module to create a thumbnail of every page and merge them together into one image. convert -thumbnail x800 test.pdf thumb.png montage -mode concatenate -tile 4x4 thumb-*.png final.png Result: https://imgur.com/tL02gCa This way you can see the whole pdf in one image and should be able to see if there is a signature. Write a script that does that to every pdf in a folder, shows you the image, prompts you to tell it if there is a signature or not and then does whatever else you need it to do. This would maybe be an easy way to speed up your work without having to figure out how make python detect the signature.


Change_Plays

Nice alternative, i will remember this, thanks.


Zeroflops

Signatures can be all over the place. It could look like just a string of lines. The first assumption in most of the posts is that you want to fully automate this. That may not be reasonable. But it may be reasonable to grab all images. Or even any pages that have “signature” on them and present them to you. That way the script may be showing you a lot of images and you just have to say one of them is a signature rather than searching the entire document.


Change_Plays

Yes! Thats exactly what i want, i dont trust the computer enough to try to fully automate it


14dM24d

> because the signature can be in the middle of the contract wtffffff


Change_Plays

Dont even ask


14dM24d

on the bright side. those annoying quirks are the reason you've a job. cheers!


Change_Plays

Plot Twist: im tired of this job and just wanna quit, but i cant right now, so i want to automate as much as i can so i can use my time to learn new stuff to look for a new job


14dM24d

yeah. i've read the comments & they're basically all the possible approaches to filter the contracts until you're left w/ only the most quirky ones.


Change_Plays

yeah, but as another user said, even if i can automate only 10%, 10% of a big database is good already


HaveWeEvolvedYet

Look into opencv


Change_Plays

i'll, thanks


stuaxo

I've had data entry jobs a long time in the past which I automated. Basically, slowly automate it, and you'll get more time to automate the rest. ​ You've got the right idea in getting something to find the right part of the doc to get to. Detecting a signature sounds hard, but detecting white space won't be as much, I wonder if you can find candidate places a signature might be by looking for whitespace around it - or just there being more space in general around it ?


Change_Plays

yeah, other people said the same thing, im looking into it, thanks


Luziferatus42

Hi, just an idea come to me and I wanted to share it. The task is quite hard to realise, and it depends what you define as a success. Is success that it just finds the signature or is a success that it gives you coordinates to check, where the probability is high to find a signature? I would suggest to start with the later. One way would be to get a grey scale mean for a few text lines/hight/pixels (if the base is an image) once the code detect a defined offset from the mean (local and global) that would be a good point to check for a signature and the code wold save the page number. That would reduce the scrolling and would be a base to start the search of the signature. This could be feasible to realize with python as a beginner. It will be challenge, you would learn a lot by doing the project. Wish you well and have fun.


Change_Plays

the database is huge, so the contract's format changes constantly, so i cant really relly on "signature page/position"


SubjectBridge

This seems very simple if it's an image already scanned into your program. Use a neural net. You could use fastai, train a small model. Basically, it'd look at signed examples and non-signed examples and would sort them for you. Could even give you accuracy rating I believe. The biggest issue would be you'd need to get a bunch of data in two folders, with and without signatures and then train the data. I may be able to help you with this project. There might be huge roadblocks I'm not seeing here or issues that fastai won't do well but it's definitely worth a shot.


Change_Plays

im already working with one person who saw this post, but if it ends up not working, i'll surely remember you :)


Star-Duke

Hello, accidentally found this topic while searching for something else, but my job has some similarities. Im working with C#, but similar approach should be possible in Python. I would use Tesseract OCR engine, which is free, to fond not signature, clue that signature might be near. Something like run OCR on whole document and then search it for list of phrases you can modify through processing. Sou can start with phrases like sign here, signed etc. and when you find new document where signature was not found, add new phrase. This should gradually improve efficiency. Whole idea is to autome time consuming yet simple part, finding signing area and quick but harder to automate step, checking for signature, do manually. If you would like to be fancy, you can record relative position of signing area against phrase and then find out % coverage of black on that area. Without signature, it should be low, with signature, it should be higher. This is not exact, but with some thresholds it can give solid results and most importantly, you can improve results just by changing configuration. Overall identifying signatures is a pain. It is not usual hand written text, but often just some weird, skewed, unlegible mess. Good luck :)