I’m embarrassed! Every now and then — okay, quite frequently — I’m reminded that when I started my I.T. career I had no interest in being a developer. It wasn’t until I got sick of getting certified in this technology or the other, and being married to a pager that I decided I wanted to learn how to be a developer. I had no official training and am pretty much “self taught” — and at times, I feel that it shows through.
For example, I have a large client that has a huge email subscriber list that they send monthly emails to. (Yes, they’re opted in so don’t go all *spammer* on me). Occasionally they like to do targeted blasts based on zipcode (for those members we actually have a zipcode for). This past month’s blas had over 28,000 zipcodes in the target list. Due to an oversight, I wound up with a list of zipcodes that overlapped. Basically it came down to having two files, one of 132277 lines and another with 113035 lines. In these two files were approximately 30,000 or so overlapping email addresses that would have received both the targeted and non-targeted blasts had I not caught it. *OUCH* that would NOT have been good.
I decided to parse through the two files with python since its syntax is fresh in my mind, and I’d have wound up googling too much had I decided to do it in bash and this was time-sensitive stuff. So I busted out vi and coded up the following code (don’t laugh):
INFILE1 = 'all-email.lst' INFILE2 = 'nofp-list.txt' destinations = [] dupes = [] for line in open(INFILE1, 'r').readlines(): for line2 in open(INFILE2, 'r').readlines(): if line != line2: print line2 destinations.append(line2) else: dupes.append(line2)
This code subsequently hung my machine as it struggled to loop over so much. I knew this wasn’t gonna be the final version as I was writing it — I had to get my creative juices flowing first — but really didn’t expect it to hang my machine. I had to power off my machine and then revise the code once my system came back up. After some initial tweaks I had this (thinking that it was just too much to read all that into memory, and failing to see the real problem for what it was –that nested for loop):
import fileinput INFILE1 = 'all-email.lst' INFILE2 = 'nofp-list.txt' destinations = [] dupes = [] for line in fileinput.input([INFILE1]): for line2 in fileinput.input([INFILE2]): if line != line2: print line2 destinations.append(line2) else: dupes.append(line2)
This didn’t work either as it was giving me an input already open error. Rather than investigate further, I turned around and ultimately ended up with this, and thought to myself “that was dumb I KNOW better than that”:
INFILE1 = 'all-email.lst' INFILE2 = 'nofp-list.txt' destinations = [] dupes = [] list1 = open(INFILE1, 'r').readlines() list2 = open(INFILE2, 'r').readlines() for line in list2: if line in list1: dupes.append(line) else: print line destinations.append(line)
*DUH* read them both into a list and use python’s “in” syntax to look for one in the other. Done. Now, I’m positive there’s even more iterations of this code that it could have eventually evolved into but this got the job done and didn’t suck my machine’s resources. Since it was a one-off — I stopped here. (NOTE: the code might not be exactly as I had it since this is largely from memory.)
In any case — its situations like these that I both despise and enjoy at the same time. I despise it because it should be easy and am embarrassed by the fact that I my first revision was so *dumb* as if it lacked any thought. But I enjoy them because its a finite problem to solve and allows me to exercise my brain a bit. Being someone who is starting to spend less time coding and more time in meetings, it feels good to do these exercises.
CLEARLY I’ve uncovered the need for me to code up a better process for doing these targeted blasts as it would appear they are going to be doing more and more of them. I look forward to writing that code so I don’t have to go through something like this that should have been so very elementary! * Embarrassing!*
I invite you share your approach for such a situation, A) so I can learn more from it and B) to find out if I’m really that far off anyway…
:wq!
Tbh, that’s the approach I probably would have taken – that is, start with what looks right at first glance, then test and refine until I come up with code that actually does. There’s a lot of value in not doing it entirely ‘right’ the first time – it’s a learning experience. :)
@Barbara — that actually makes me feel tons better. You’re always one of the people I imagine seeing some of the crap I’ve written and shuddering in disgust :)
I look at you as one of the ‘smart ones’ that I need to try and keep up with.
Uh, nice for this newbie to learn a bit about Python in the real world, but don’t we have databases and SQL to apply this kind of filtering before we fetch and process all that unwanted data?
SELECT … MINUS SELECT …
@Chris — sorry I’m just getting around to this but yes, SQL would be a way I could have done it. However only a portion of these addresses were in a database, the other portion was an excel spreadsheet. I supposed I could have imported them into a database but rarely do I opt to write SQL for one-time items such as this becasue a) I hate writing SQL, b) it helps me keep all of my skills exercised and c) I then had a re-usable script that could be used and executed should I run into it again. It would also help to cleanse that data possibly prep it for a clean import. I’d rather not use SQL to hide dirty data and instead opt to get it clean. For me DATA is KING! :) If you have to write SQL to get around dirty data — I feel — you’re doing it wrong.