In my project, I need to compare tons of PDF files. I could not find any good FREE library which is working out of the box to compare the PDF files. I did not want just Text compare & I was looking for something which can compare PDFs pixel by pixel to find all the differences. Libraries which can do are NOT FREE.
So, I have come up with a simple JAVA library (using apache-pdf-box – Licensed under the Apache License, Version 2.0) which can compare given PDF documents in Text/Image mode & highlight the differences, extract images from the PDF documents, save the PDF pages as images etc.
Udemy – Java 8 and Beyond for Testers:
TestAutomationGuru has released a brand new course in Udemy on Java 8 and Beyond for Testers. 13 hours course with java latest features, lambda, stream, functional style programming etc. Please access the above link which gives you the special discount. You can also get your money back if you do not like the course within 30 days.
Maven Dependency:
Include the below dependency in your POM file.
Download:
PDF compare utility with all the dependencies.
taguru-pdf-utility-v1.1.zip (44476 downloads)
Github:
The source code for this project is here.
Usage:
- To get page count
import com.testautomationguru.utility.PDFUtil;
PDFUtil pdfUtil = new PDFUtil();
pdfUtil.getPageCount("c:/sample.pdf"); //returns the page count
- To get page content as plain text
//returns the pdf content - all pages
pdfUtil.getText("c:/sample.pdf");
// returns the pdf content from page number 2
pdfUtil.getText("c:/sample.pdf",2);
// returns the pdf content from page number 5 to 8
pdfUtil.getText("c:/sample.pdf", 5, 8);
- To extract attached images from PDF
//set the path where we need to store the images
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.extractImages("c:/sample.pdf");
// extracts and saves the pdf content from page number 3
pdfUtil.extractImages("c:/sample.pdf", 3);
// extracts and saves the pdf content from page 2
pdfUtil.extractImages("c:/sample.pdf", 2, 2);
- To store PDF pages as images
//set the path where we need to store the images
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.savePdfAsImage("c:/sample.pdf");
- To compare PDF files in text mode (faster – But it does not compare the format, images etc in the PDF)
String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";
// compares the pdf documents and returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.compare(file1, file2);
// compare the 3rd page alone
pdfUtil.compare(file1, file2, 3, 3);
// compare the pages from 1 to 5
pdfUtil.compare(file1, file2, 1, 5);
- To exclude certain text while comparing PDF files in text mode
String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";
//pass all the possible texts to be removed before comparing
pdfutil.excludeText("1998", "testautomation");
//pass regex patterns to be removed before comparing
// \\d+ removes all the numbers in the pdf before comparing
pdfutil.excludeText("\\d+");
// compares the pdf documents and returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.compare(file1, file2);
// compare the 3rd page alone
pdfUtil.compare(file1, file2, 3, 3);
// compare the pages from 1 to 5
pdfUtil.compare(file1, file2, 1, 5);
- To compare PDF files in Visual mode (slower – compares PDF documents pixel by pixel – highlights pdf difference & store the result as image)
String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";
// compares the pdf documents and returns a boolean
// true if both files have same content. false otherwise.
// Default is CompareMode.TEXT_MODE
pdfUtil.setCompareMode(CompareMode.VISUAL_MODE);
pdfUtil.compare(file1, file2);
// compare the 3rd page alone
pdfUtil.compare(file1, file2, 3, 3);
// compare the pages from 1 to 5
pdfUtil.compare(file1, file2, 1, 5);
//if you need to store the result
pdfUtil.highlightPdfDifference(true);
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.compare(file1, file2);
For example, I have 2 PDF documents which have exact same content except the below differences in the charts.
My PDFUtility gives the result as given below (highlights the difference in Magenta color by default. Color can be changed).
Features to be added soon:
- While comparing PDFs in VISUAL_MODE, ignore certain area.
- While comparing PDFs in VISUAL_MODE, return true / false based on certain threshold / sensitivity.