View Full Version : Converting junk to data base
teach1st
09-09-2001, 02:55 PM
An alleged friend just sent me her mailing list, done as a text file, and wants it converted to DB form. The list has two thousand entries in this form:
Name
Aaddress 1
Address 2
City, State Zip
There is nothing separating entries, so that it looks like this:
Zircon Enterprises
Zircon Memorial Building
222 Dogs Knee Way
Stained Shirt, Idaho 99999
Scrubbed Clean, Inc
Laundromat Division
66 Washtub Way
Suds, Wash, 11111
etc
etc for 8000 lines
(and once in a while there is a space between addresses)
Anyway, I can separate the City State Zip once I get it into some sort of workable 4 field data base. Problem is, I don't know how to do that. Any hints? Thanks!
Arthur
09-09-2001, 03:37 PM
That's not a simple task (but you knew that already :) ). But with the help of some regular expressions it shouldn't be too hard. I'd recommend doing it with UltraEdit (http://www.ultraedit.com)
Are all addresses 4 lines? Or do they at least all end with a zip code? With the zip code you can split the addresses up, because it's the only line that ends on a digit. Use a regular expression that looks for a digit followed by a newline ("^([0-9]^)^p") replace it with the digit and two newlines ("^1^p^p") or another character (e.g. @ - "^1@"). Replace all single newlines with a semi-colon and then replace for instance the @ by a newline.
I doubt that you can still follow me, but if you need help or if you'd like me to have a go at converting, just send me a personal message.
Arthur
sheila
09-09-2001, 03:40 PM
Originally posted by teach1st:
Name
Aaddress 1
Address 2
City, State Zip
There is nothing separating entries, so that it looks like this:
Zircon Enterprises
Zircon Memorial Building
222 Dogs Knee Way
Stained Shirt, Idaho 99999
Scrubbed Clean, Inc
Laundromat Division
66 Washtub Way
Suds, Wash, 11111
etc
etc for 8000 lines
(and once in a while there is a space between addresses)
Alleged friend is right!
I would want to use some sort of script to convert the file into another format, possibly a pipe-delimited flatfile database. From there, you should be able to work with it easily.
In other words, a form like:
Zircon Enterprises|Zircon Memorial Building|222 Dogs Knee Way|Stained Shirt, Idaho 99999
Scrubbed Clean, Inc|Laundromat Division|66 Washtub Way|Suds, Wash, 11111
One database entry per line, fields separated by some delimiter, such as a pipe. That should be fairly easy to read into a database?
I don't know what scripting languages (if any) you know...
The key seems to be, to recognize the last line in each entry. Is the zip code included in all entries, and is always just five digits? (Or do you have some of those Zip+ entries with the extra digits?)
I'd try to knock of a Python script to reformat it for you, if that would help???
sheila
09-09-2001, 03:43 PM
Originally posted by arthur:
Are all addresses 4 lines? Or do they at least all end with a zip code? With the zip code you can split the addresses up, because it's the only line that ends on a digit. Use a regular expression that looks for a digit followed by a newline ("^([0-9]^)^p") replace it with the digit and two newlines ("^1^p^p") or another character (e.g. @ - "^1@"). Replace all single newlines with a semi-colon and then replace for instance the @ by a newline.
If all addresses are 4 lines, this helps immensely, true.
As far as the zip code line being the only one that ends in a digit, I'm not so sure about that. It's probably possible that a line looks something like this in Address 2 line:
66 Suds Lane, Ste. # 5
Ending in a Suite number, might make it possible there is a digit in the final position of a line that isn't a City/State/Zip Code line.
Arthur
09-09-2001, 03:49 PM
Well, the zip code should always be 5 digits, right? Then you can just search for five digits + a newline. Unless there are suite numbers with 5 digits...
If there are some addresses with 3 lines, that would be a bit of a problem, although if there's only a few, you could correct those by hand.
It all depends on how regular/irregular the original file is...
teach1st
09-09-2001, 03:49 PM
I'd try to knock of a Python script to reformat it for you, if that would help???
Thanks, Sheila, but I'm going to try to learn this. I don't like not being able to do these thing. I don't know any scripting languages, save for the semi-canned scripting of FileMaker Pro and enough CGI to really mess up UBB when I had it.
Arthur, thanks for your input. I'm going to try the regular expressions route. I don't have UltraEdit, but do have NoteTab Pro and TextPad, both of which support regular expressions. I've used them on occassion when formating ugly text (i.e homework e-mailed to me by fifth graders) to HTML.
If I understand this correctly, I replace CR's with a field deliminator and use the numerals in the zip to signify a line break. Then I replace the empty line with a record deliminator.
Thanks for the input!
Arthur
09-09-2001, 03:57 PM
I replace CR's with a field deliminator and use the numerals in the zip to signify a line break. Then I replace the empty line with a record deliminator That's one possibility; replace all line breaks with a delimiter like "|" or ";". Then you get one very big line which you then break up by replacing the five digit zip codes by the same five digits + a line break. There shouldn't be any empty lines then.
Any delimiter should be okay to use, as long as it doesn't appear in any of the address line. Usually "|" or ";" are used though.
sheila
09-09-2001, 03:59 PM
I'm going to try to learn this. I don't like not being able to do these thing.
Well, I can relate to that! Good luck!
teach1st
09-09-2001, 05:40 PM
Got it! :N
Well, except for the few that had more than four lines. I don't know how to work with that, and, in this case, it's easier to reformat those records by hand.
Thanks for the help!
vBulletin® v3.6.8, Copyright ©2000-2009, Jelsoft Enterprises Ltd.