Read PDF file in java

PDF (Portable Document Format) is a popular file format for sharing documents. We will use pdfbox to read texts from pdf and extract images from pdf in java.

Apache PDFBox is an open-source Java library that provides a wide range of features for working with PDF documents. To get started, we need to add PDFBox as a dependency in our Java project.

in pom.xml file, add denpendency

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.32</version>
</dependency>

read texts from pdf file:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class PdfTextExtractor {
    public static void main(String[] args) {
        try {
            PDDocument document = PDDocument.load(new File("example.pdf"));
            PDFTextStripper textStripper = new PDFTextStripper();
            String text = textStripper.getText(document);
            document.close();
            System.out.println(text);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

In this example, we load a PDF document, create a PDFTextStripper object, and use it to extract text from the PDF. The extracted text is then printed to the console.

we can also read texts in pdf files page by page or select certain page by specifying start and end page index to read,

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.IOException;

public class PdfPageTextReader {
    public static void main(String[] args) {
        try {
            PDDocument document = PDDocument.load(new File("example.pdf"));
            PDFTextStripper textStripper = new PDFTextStripper();
            
            for (int page = 1; page <= document.getNumberOfPages(); page++) {
                textStripper.setStartPage(page);
                textStripper.setEndPage(page);
                String text = textStripper.getText(document);
                
                // Process or display text from the current page
                System.out.println("Page " + page + ":\n" + text);
            }
            
            document.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Extracting Images from PDF Files:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.IOException;

public class PdfImageExtractor {
    public static void main(String[] args) {
        try {
            PDDocument document = PDDocument.load(new File("example.pdf"));
            PDFRenderer pdfRenderer = new PDFRenderer(document);
            for (int page = 0; page < document.getNumberOfPages(); ++page) {
                BufferedImage image = pdfRenderer.renderImageWithDPI(page, 300);
                ImageIO.write(image, "PNG", new File("page-" + (page + 1) + ".png"));
            }
            document.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

In this code, we load a PDF document, create a PDFRenderer object, and then iterate through the pages of the PDF. For each page, we render it as an image and save it as a PNG file.

Leave a Reply

Your email address will not be published. Required fields are marked *