2008-11-01 11:18:23 by abowman in Ravings of a Lunatic (no comments) permalink
I find myself liking the Ruby programming language more and more. It offers procedural, object-oriented and functional styles all at once. So when I’m writing some code I can choose some combination of the three to solve the problem.
What I’m finding though is that the easy access to multiple paradigms often obscures the easiest/shortest/most elegant solution to a problem. This can be a challenge.
For instance, I spent half of Thursday and most of Friday working on a problem where I had to extract related pages of data from a series of text files, filter the pages down to the ones of interest and then relate the pages back into the appropriate documents and output the documents as PDF files. So it went like this: many files -> many lines per file -> some lines per page -> some pages per document -> only interested in some documents.
I originally approached it with a series of loops over the files and lines in the files that gathered the lines constituting a page and put them in an object the represented the page. I then wrote the page objects to an array and at the end of the file processed the pages finding the ones that belonged together and put them into a document object. I then stored the document objects in an array and moved on to the next file.
The problem is that the files (all containing lines of text 88 characters long) varied in size from 160KB to 48MB. Rough calculation ensues: (48*1024*1024) / 88 chars long / 66ish lines per page = roughly 571950 lines and 8665 pages in the 48MB file. So consider that I had 39 of these files, and I was only interested in 16 documents consisting of 2 to 4 pages each scattered throughout.
Initial execution times were abjectly slow. Why? Because I was storing every page. Duh. Lots of memory! I went back to the drawing board since it was so obvious that I had made some bad decisions. I scrapped the code and started from scratch on Friday morning. This time I stuck with as simple a loop as possible.
hits = []
page = ""
files = Dir::glob("*.TXT")
files.each do |file|
File.open( file ).each do |line|
if line =~ /^\s*\#\#/
if page =~ /THE_SEARCH_STRING/
hits << page
end
page = nil
end
if page == nil
page = line
else
page += line
end
end
end
The pages split on lines beginning with whitepace followed by ##. I substituted the string I was looking for in place of THE_SEARCH_STRING and I was in business. Total execution time was 24 seconds.
Then I added additional logic to processes pages into documents (10 lines of code) and then create the PDFs (9 lines of code).
At this point I was faced with some PDFs that needed to be signed, placed at specific places in a complicated file structure and a matching text file with descriptive information needed to be generated for each one. I tried it with the first one and it was a pain in rump. Took a lot of time. So I added 24 more lines of code to handle all of the file renaming and information file generation and, since I was almost there anyway, some code to generate all of the scp commands to move the files to the right places and to update the processing flags on the files.
The whole thing took about 40 seconds to run from start to all files in their proper places. scp session setup and tear-down took most of the remaining time above the 24 seconds of file processing.
I could have done this by hand and it would've taken less time, but now the next time we have this problem, all I have to do is get the files, change the search string and run the script. All the rest is handled for me.
The end result is that I spent a lot of time chasing down a solution more complex than what I needed because I had extra paradigms available. Had I done this in Perl, I would have done it right the first time because I never really used perl object-orientation in my code. So I must keep in mind that sometimes the straight procedural solution is the fastest, most elegant way to solve a problem.