GNU/Linux Desktop Survival Guide
by Graham Williams |
|||||
Rename File Based on PDF Contents |
20201115 A simple use case begins with a collection of
locally saved bank statements, each named something like
20140913_kt_odbc_saver.pdf, one for each month. The aim is
to rename rename each file by appending the bank statement's final
balance to the filename. For example, append _32k if the
statement's final balance is something like
$31,745.34
, resulting in
20140913_kt_odbc_saver_32k.pdf.
Using pdf2txt I noticed that the dollar balance amount extracted from the pdf is the only dollar amount starting in column 1. Thus we can build a command to rename the files beginning with a for loop, bracketed by do and done. Using echo the basic rename command is constructed, using mv with baseline to extract the base name of the resulting filename. To this resulting filename we append the dollar amount after some processing. The processing extracts just the dollar amount, using egrep, ensuring we have a single value using uniq, deleting the dollar and commas using tr, converting large numbers into SI format using numfmt, and printing the dollar amount to be added to the target filename, with the final extension, using awk. Each line is then a fully formed mv statement, which is then executed by passingg it to sh:
for f in *saver.pdf; do echo -n "mv" $f $(basename $f .pdf)"_"; pdf2txt $f | egrep '^\$[1-9]' | uniq | tr -d '$,' | numfmt --to=si --round=nearest | tr 'K' 'k'| awk '{print $1".pdf"}'; done | sh |
Before you run this command do check each step along the way. In particular, the pattern used to extract the dollar amount of interest will be different for different types of statements. Sometimes it might be embedded in a line that begins with the string Your next AutoPay amount of ..., for example.