I don't know if it is only relevant to Allegheny County or if it is more universal, but a friend asked me to write a parcel id (parcel/block/lot) parser to help him out in his real estate searches. It was kind of a fun project (3 hours) and it was neat to see how well it worked out in the end.
It takes stuff on the right, and turns it into the stuff on the left:
0318-C-00080-0000-00 <= 318 C 080
0387-S-00002-0000-00 <= 387-S-2 387-M-148
0160-K-00013-0000-00 <= 0160-K-00013-0000-00
0124-P-00095-000A-00 <= 0124-p-00095-000a-00
1213-F-00377-0000-00 <= 1213F00377
0180-B-00041-0000-00 <= 0180-B-00041-0000-00
0495-F-00201-0000-00 <= Lot & Block 495-F-201
0009-S-00305-0000-00 <= 9-S-305
0309-D-00100-0000-00 <= 0309D00100000000
Code follows below:
#!/usr/bin/php5
<?php
// output field separator, you can set this to "" if you want
// the numbers to be smushed together in the output
define(OSEP, "-");
// If run as a command line program, you can run it like:
// ./property.php < filename.txt
// Otherwise, comment out this while loop and just call
// the function below
while($line = fgets(STDIN)){
$match = guessNumber($line);
if($match)
print "$match <= $line\n";
else
print "Invalid lot format: $line\n";
}
// guessNumber is the main entry point, you can pass in
// all sorts of messed up text and it will try to figure it out
// Anything after a tab character will be ignored.
function guessNumber($line){
// strip out stuff after the tab character
$line = substr($line, 0, strpos($line, "\t"));
foreach(array("-", "", " ") as $isep){
$match = guessNumberHelper($line, $isep);
if($match)
return $match;
}
$isep = "[X_-]+";
$match = guessNumberHelper($line, $isep);
if($match)
return $match;
$isep = "[ ]+";
$match = guessNumberHelper($line, $isep);
if($match)
return $match;
$isep = "[ X_-]+";
$match = guessNumberHelper($line, $isep);
if($match)
return $match;
return false;
}
function guessNumberHelper($num, $sep){
// regular expression for matching
$pattern = "([0-9][0-9]?[0-9]?[0-9]?)".
$sep."([A-Z])".
"((".$sep.")?([0-9][0-9]?[0-9]?[0-9]?[0-9]?)".
"((".$sep.")?([0-9A-Z][0-9A-Z]?[0-9A-Z]?[0-9A-Z]?)".
"((".$sep.")?([0-9A-Z][0-9A-Z]?))?)?)?";
$num = trim($num);
$num = strtoupper($num);
$num = str_replace(array("*","="), "", $num);
if (ereg($pattern, $num, $regs))
{
$str .= padZeros($regs[1], 4);
if(!$regs[2]){
$regs[2] = 0;
}
$str .= OSEP.$regs[2];
if(!$regs[5]){
$regs[5] = 0;
return false;
}
$str .= OSEP.padZeros($regs[5], 5);
$str .= OSEP.padZeros($regs[8], 4);
$str .= OSEP.padZeros($regs[11], 2);
// check for mostly empty entries
if(strlen(str_replace(array("0","-"), "", $str)) < 3){
return false;
}
return $str;
} else {
return false;
}
}
function padZeros($val, $length){
$ret = "";
for($len = strlen($val); $len < $length; $len++)
$ret .= "0";
$ret .= $val;
return $ret;
}
Posted by
Jon Daley on
March 17, 2008, 10:13 pm
| Read 4932 times
Category
Programming:
[
first]
[
previous]
[
next]
[
newest]
Word significance is often scored using TfIdf, which is term frequency * inverse document frequency.
(Peter is commenting on my bayesian filter tests, that I figured I would do using an article that no one had previously commented on, so people would be less likely to see it...)
Fortunately, LifeType was smart enough to filter out the hundreds of comments I made while testing, so you didn't see all of those... :)
What I was really testing is that I've been getting tons of spam on my company blog and most of it has lots of common spammy terms, and so I'm not sure why the bayesian filter isn't filtering them out by itself. I think it is because there are lots of words, and so the really spammy words are getting reduced by the other words.
I think I want a pseudo-blacklist, where I don't have to type in terms, but if the spammer uses words that it thinks are spammy, then more heavily emphasize those words in the score.
Also, the spammy URLs should be caught more easily too.
Also, it looks like there might be a problem due to the character set I have for mysql - a non-case-significant one, which there were some weird problems with case-significant user names in German a long time ago, and we probably should have looked into it more carefully.