[personal profile] rhn_journal

This task, while sounding easy, is surprisingly difficult to achieve. There are plenty of programs to do this on Linux: you can find mentions of pdfcrop script in various places, there's also PDFEdit, PoDoFo, and some random scripts all over the place.

All the tools I checked work fine - as long as you don't print the document (to a printer, to PDF, to PS). For some reason, they lose the boundaries in that operation.

Perhaps PDFEdit could do the job - after all, it advertises itself as an authoring tool, therefore it should do a deep reencoding. However, I didn't want to waste time manually adjusting the margins in millimeters.

Surely there must be an easy way to crop margins automatically. KDE's Okular can do it, therefore can I!

Finding the params

First off, if I can't find any automatic tool to give me nice margin values, let's find a semi-automatic one.

GhostScript comes to the rescue:

# gs -dBATCH -dSAFER -dNOPAUSE -dUseCropBox -sDEVICE=bbox doc.pdf
...
Page 2
%%BoundingBox: 110 130 506 723
%%HiResBoundingBox: 110.267997 130.139996 505.781985 722.177978
...

It outputs the bounding box after cropping any white space from around it. Now I can pick some average page and use the values.

The mechanism

Turns out, PDF files have several properties called "boxes" that are used for cropping the margin for different purposes:

  • MediaBox
  • TrimBox
  • CropBox
  • BleedBox
  • ArtBox

Most cropping programs modify the MediaBox parameter. That operation doesn't modify any original content, only sets the displayed boundaries when viewing the document in digital form. Therefore, it's easy to reset this value back to original - and print incorrectly.

The problem might be in the way CUPS handles printing: it converts PDF documents into PS format. Try it yourself: open the document in a viewer of choice and choose the option to print to PS. Result? Original margins restored.

uncropped This is how pages printed, and how they ended up after printing to PS.

The conclusion might be: there's something broken in PDF->PS conversion.

PDF to PS

If it's broken, best do it manually. My first attempt was to convert the original document to PS and then crop it, but the tools to do that are harder to find. Therefore, I came up with a different process:

Let's crop the PDF and convert to PS while respecting the boundaries.

It might be crazy, after all I just said that printing to PS (a form of conversion) is broken! Well, it's broken in the builtin mechanism. Perhaps there's another program to do it right. Indeed, there is! pdftocairo helped me out.

cropped pdftocairo could see my pages without margins!

Putting it all together

First, use gs command to find out the bounding box. My average page has the dimensions:

# gs -dBATCH -dSAFER -dNOPAUSE -dUseCropBox -sDEVICE=bbox doc.pdf

I have no idea what the units are, but as long as they work, I'm happy. The values appear to be left, top, right, bottom:

108 127 506 723

Now, set the bounding boxes. I used a custom program that I found randomly and adapted to change all the boxes (cause I don't know which one is the correct one, and as long as it works I don't care). The script requires pyPdf and you can find it at the end of the post.

./croppdf -b 108 127 506 723 -i doc.pdf -o marked.pdf

The new marked.pdf document will work on the screen, but still not in print.

Last stage: applying the crop.

pdftocairo -ps marked.pdf cropped.ps

Done! The cropped.ps document is neatly trimmed and ready for printing.

The script

#! /usr/bin/python2

# Originally found on http://www.mobileread.com/forums/showthread.php?t=25565

import argparse, sys
from pyPdf import PdfFileWriter, PdfFileReader

parser = argparse.ArgumentParser(description='Crop/mark pdf to size. Works best with ghostscript.', epilog='Originally by sjvr767. Modified by rhn')
parser.add_argument('-b', metavar=('left', 'top', 'right', 'bottom'), type=int, nargs=4,
                    required=True, help='bounding box')
parser.add_argument('-i', '--input', metavar='input', required=True, help='input file')
parser.add_argument('-o', '--output', metavar='input', help='Specify the name and path of the output file. If none specified, the script appends \'_cropped\' to the input file name (before extension).')
parser.add_argument('--skip', action='store_true', help='skip first page ')
args = parser.parse_args()
input_file = args.input
        
if input_file[-4:]!='.pdf':
    print "Input file must be a PDF."
    sys.exit(2) #exit if no appropriate input file

if args.output:
    output_file = args.output
    if output_file[-4:]!='.pdf':
        print "Output file must be a PDF."
        sys.exit(2)
else:
    output_file = "%s_cropped.pdf" %input_file[:-4] #default output


import collections
Box = collections.namedtuple('Box',
                             ('l', 't', 'r', 'b'))

box = Box(*args.b)

input1 = PdfFileReader(file(input_file, "rb"))

output = PdfFileWriter()
outputstream = file(output_file, "wb")

pages = input1.getNumPages()

right = input1.getPage(1).mediaBox.getUpperRight_x()
top = input1.getPage(1).mediaBox.getUpperRight_y()
left = input1.getPage(1).mediaBox.getUpperLeft_x()
bottom = input1.getPage(1).mediaBox.getLowerRight_y()

new_tr = (box.r, box.t)
new_br = (box.r, box.b)
new_tl = (box.l, box.t)
new_bl = (box.l, box.b)

start = 1 if args.skip else 0

for i in range(start, pages):
    page = input1.getPage(i)
    page.mediaBox.upperLeft = new_tl
    page.mediaBox.upperRight = new_tr
    page.mediaBox.lowerLeft = new_bl
    page.mediaBox.lowerRight = new_br
    page.cropBox.upperLeft = new_tl
    page.cropBox.upperRight = new_tr
    page.cropBox.lowerLeft = new_bl
    page.cropBox.lowerRight = new_br

    page.trimBox.upperLeft = new_tl
    page.trimBox.upperRight = new_tr
    page.trimBox.lowerLeft = new_bl
    page.trimBox.lowerRight = new_br
    page.bleedBox.upperLeft = new_tl
    page.bleedBox.upperRight = new_tr
    page.bleedBox.lowerLeft = new_bl
    page.bleedBox.lowerRight = new_br

    output.addPage(page)

output.write(outputstream)
outputstream.close()
(will be screened)
(will be screened if not validated)
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

Profile

rhn

rhn

An open source culture enthusiast (really "Free Source"), software developer, tinkerer in all things electronic. Slight privacy nut. Cyclist, occasional gamer. Some projects: http://rhn.github.com/.

Expand Cut Tags

No cut tags