Marcin Borkowski: 2011-08-05 Semi-automatic selection of pages in a pdf (en)

As I mentioned in my previous post, I’d like to show a simple trick which helped me with the following problem.

I had a bunch of pdf files, containing issues of a journal. I wanted to extract individual papers as pdfs. Following this great post, I decided to use the pdfpages package. What’s more, I wanted the pages to be properly numbered in the pdf, so that physical first page in the resulting pdf could be numbered, say, 217 (this can be accomplished by using the hyperref package (thanks to Martin Scharrer for answering my question about this on TeX.SX!).

This is all trivial—but time-consuming and very error-prone. So I came up with the following trick: I devised a short tex file, which processed by LaTeX resulted in a pdf consisting of given pages from a given file—where “given” means “given by the filename”. So I only had to copy this file several times with suitable names, and that did the job.

For example, assume that I want to extract pages 217-240 from the file wm-45-2. This file starts with two pages with table of contents, and then page 171 (since it is not the first journal issue in some particular year). So I created a file wm-45-2--217-240.tex with the following contents:

\documentclass[a4paper]{article}

\def\skippages{2}         % The first two pages are the table of
                          % contents and do not count in the numbering.
\def\firstpagenumber{171} % The first numbered page has number 171.

\usepackage{pdfpages} % We need to extract pages from the given pdf...
\usepackage{hyperref} % ...and we need to give them proper numbers in
                      % the resulting pdf.

\def\extractpages #1--#2-#3-{% This macro extracts the file name (#1),
  \def\filename{#1}%         % start page (#2) and end page (#3) from
  \def\startpage{#2}%        % \jobname, i.e. the name of the current
  \def\endpage{#3}%          % file.
}

\expandafter\extractpages \jobname- % The actual extraction is done
                                    % here.

\newcounter{startextractpage} % We will store the physical page
\newcounter{endextractpage}   % numbers to extract in these counters.
\setcounter{startextractpage}{% Actual calculation starts here...
  \numexpr\startpage+\skippages+1-\firstpagenumber\relax
}
\setcounter{endextractpage}{%
  \numexpr\endpage+\skippages+1-\firstpagenumber\relax
}                             % ...and ends here.
\edef\range{%     Here we store the pagerange in the format
            %     "start-end"...
  \arabic{startextractpage}-\arabic{endextractpage}%
}
\edef\doinclude{% ...and here we store the command to do the
                % extraction.  We have to take care of the expansion
                % timing---hence \edef and \noexpand.
  \noexpand\includepdf[pages=\range]{\filename.pdf}%
}

\begin{document}
\setcounter{page}{\startpage} % This (together with hyperref) sets the
                              % page numbers in the pdf file correctly.
\doinclude % And this physically includes the requested page range.
\end{document}

And that’s all.

Maybe not the easiest way of doing what I wanted—but much better than doing it manually, and I think that the trick with parsing \jobname might be worth sharing.

CategoryEnglish, CategoryBlog, CategoryTeX, KategoriaTeX, KategoriaLaTeX