I am reading text from PDF documents using the iText library. However, some pdf documents might have an image embedded with-in them in addition to text. I'm wondering whether there is any way, through iText or something else, to determine if the pdf document contains an image?
38.3k 26 26 gold badges 201 201 silver badges 275 275 bronze badges asked Jun 20, 2013 at 20:58 35.5k 49 49 gold badges 177 177 silver badges 285 285 bronze badges look here stackoverflow.com/questions/7007917/… Use the same basic sets to see if one exists. Commented Jun 20, 2013 at 21:05If you don't want to switch to PDFBox add suggested by @Phil's reference. You can use the iText classes from the parser package for bitmap image extraction, too.
Commented Jun 20, 2013 at 22:10I came across this link, however, I need to find out whether an image even exists in the pdf. itextpdf.com/examples/iia.php?id=284
Commented Jun 20, 2013 at 22:17In that case simply create an own image render listener. If it is only to check for existence of an image, it'll be much simpler than the one used in that sample.
Commented Jun 20, 2013 at 22:35You can do a correct and 100% reliable check using a PDF library.
However you can probably do a fairly reliable check just by reading the PDF as text and processing it that way. You need to first check it is a PDF by looking for the PDF header at the start,
Then scan through looking for the phrase,
/XObject
When you hit this tag you need to check backwards and forwards in the stream to the > dictionary boundaries to pull out the full XObject dictionary. There may be nested > so you might want to check back to the 'obj' and forwards to the 'stream' entry. Anyhow you'll end up with something that looks like this,
The thing you need to check here is that there is this /Subtype entry and an /Image separated by some whitespace. If you hit that then you have an image.
So what are the limits of this approach?
Well it is possible to embed an image in the document but not use it. That would result in a false positive. I think this is pretty unlikely though. It would be very inefficient to do so and only a really skanky producer would do it.
Images can be embedded in page content streams as mentioned by Hugo above. That would result in a false negative. These are pretty uncommon though. It's one of those bits of the spec which was never a good idea and it's not widely used. If you have documents from a single producer (as is often the case) it will beome apparent very quickly if it does this or not. However I think it would be pretty uncommon. At a guess I can't imagine that more than 1% of wild PDFs would contain this construct.
It is possible to embed these XObject tags as references rather than direct objects. But I think you can completely discount that. While legal it would be absolutely bizare. I don't think you'll ever see that.
The correct way involves scanning and parsing all the content streams in the PDF. It's what we do in ABCpdf (which I work on) but it is a lot more work and a lot more processing power. It could be many seconds on a large document.
Think if 99% reliability is going to be good enough. :-)