From talking with someone much smarter and experienced than me, I need to narrow the scope of my project in order to test some of the key back end pieces of a dispatch mapping platform. The following questions need to be answered at a minimum:
- Can DeepSpeech be trained to reliably transcribe addresses from scanner audio?
- Can addresses be detected and extracted from transcribed audio.
- Can those addresses be validated?
So what does the minimum viable product look like?
Scanner Audio is fed to DeepSpeech and a list of addresses is produced. The most basic thing I can gain from scanner audio is “a call happened at these addresses”.
To generate good training data, Amazon will transcribe audio at about 6 cents a minute, Google will do it for 2 cents a minute. Even better, Google cloud services gives a 300 dollar credit to free accounts for the first year. Without even signing up lets test if Google’s speech to text engine will detect addresses without any fiddling.
I uploaded this clip:
Here’s what Google’s demo was able to identify out of all that:
The model picked out the address! now on to question 2, can we pick out address from a bunch of other garbage. The best tool for this is probably RegEx. I am no RegEx expert so a I did a quick search and found this possibly over-complicated way of detecting most addresses.
The expression:
/(?:(?<=^)|(?<=[;:.,|][ ])|(?<=[[('"]))(?:[)]?P\.?O\.?(?:(?i)[ ]?Box)?[ ]{0,2}(?<PO>\d{1,5})[)]?|(?<HouseNumber>(?>(?:(?<NumberException>(?:19[789]|20[0123])\d)|\d+?(?:[-\\\/]\d{1,3})?)(?=(?:[;,]|[-\\\/]?[A-Za-z]\d?)?\s)))(?:(?<DoorSide>[-\\\/]?[A-Za-z]\d{0,2}))?,?\s{0,2}(?>(?:(?:^|[ ]{1,2})(?<StreetPrefix>AU|EI?|GR|H[AW]|JO|K|M[AEM]|N[EOW]?|O[HLMV]|RD|S[EW]?|TE|W)\b)?)(?:(?:^|[ ]{1,2})(?<StreetName>(?:\p{Lu}[-'\p{L}]*?(?:\.?[ ]{1,2}\p{Lu}[-'\p{L}]*?){0,8}?(?>(?<StreetNameIndicator>(?i)BOULEVARD|PLAZA|ROAD|STR(?:ASSE|EET)|WA(?:LK|Y))?)|(?<StreetOrdinal>\d{1,3}(?:[. ]?(?:°|st|[nr]d|th))))\b))(?:(?:(?>[ ]{1,2}(?i)(?<StreetType>A(?:C(?:CESS|RES)|LLEY|NX|PPROACH|R(?:CADE|TERY)|VE(?:NUE)?)|B(?:A(?:NK|SIN|Y)|CH|E(?:ACH|ND)|L(?:DG|VD)?|O(?:ULEVARD|ARDWALK|WL)|R(?:ACE|AE|EAK|IDGE|O(?:ADWAY|OK|W))?|YPASS)|C(?:A(?:NAL|USEWAY)|ENTRE(?:WAY)?|H(?:A(?:NN?EL|SE)?)?|I(?:R(?:C(?:LET?|U(?:IT|S)))?)?|L(?:B|OSE)?|O(?:MMON|NCOURSE|OP|PSE|R(?:[DK]|NER|S[OT])|UR(?:[VS]E|T(?:YARD)?)|VE)?|R(?:ES(?:CENT|T)?|IEF|OSS(?:ING)?)|T[RS]?|U(?:LDESAC|RVE)|V)|D(?:ALE|EVIATION|I[PV]|M|OWNS|R(?:IVE(?:WAY)?)?)|E(?:ASEMENT|DGE|LBOW|N(?:D|TRANCE)|S(?:PLANADE|T(?:ATE|S))|X(?:P(?:(?:(?:RESS)?WA)?Y)|T(?:ENSION)?))|F(?:AIRWAY|I(?:ELDS?|RETRAIL)|L(?:DS?|S)|O(?:LLOW|R(?:D|MATION))|R(?:D|EEWAY|ONT(?:AGE|ROAD)?))|G(?:A(?:P|RDENS?|TE(?:S|WAY)?)|L(?:ADE|EN)|R(?:ANGE|EEN|O(?:UND|V(?:ET?)?))?)|H(?:AVEN|BR|E(?:ATH|IGHTS)|I(?:GHWAY|LL)|L|OUSE|TS|UB|WY)|I(?:NTER(?:CHANGE)?|SLAND)|J(?:C|UNCTION)|K(?:EY|NOLL)|L(?:A(?:NE(?:WAY)?)?|DG|IN(?:E|K)|N|O(?:O(?:KOUT|P)|WER)?)|M(?:A(?:LL)?|DWS?|E(?:A(?:D|NDER)|WS)|L|NR|OT(?:EL|ORWAY))|NO(?:OK)?|O(?:L|UTLOOK|V(?:ERPASS)?)|P(?:A(?:R(?:ADE|K(?:LANDS|WAY)?)|SS|TH(?:WAY)?)?|DE|I(?:ER|[KN]E)|KW?Y|L(?:A(?:CE|ZA)|Z)?|O(?:CKET|INT|RT)|RO(?:MENADE|PERTY)|T|URSUIT)?|QUA(?:D(?:RANT)?|YS?)|R(?:A?(?:MBLE|NCH)|DG?|E(?:ACH|S(?:ERVE|T)|T(?:REAT|URN))|I(?:D(?:E|GE)|NG|S(?:E|ING))|O(?:AD(?:WAY)?|TARY|U(?:ND|TE)|W)|R|UN)|S(?:CH|(?:ER(?:VICE)?WAY)|IDING|LOPE|MT|P(?:PGS|UR)|Q(?:UARE)?|T(?:A(?:TE)?|CT|EPS|HY|PL|RAND|R(?:EET|IP)|TER)?|UBWAY)|T(?:ARN|CE|E(?:R(?:RACE)?)?|HRO(?:UGHWAY|WAY)|O(?:LLWAY|P|R)|R(?:A(?:CK|IL)|FY|L)?|URN)|UN(?:DERPASS|IV)?|V(?:AL(?:E|LEY)|I(?:EW|S(?:TA)?)?|L(?:GS?|Y))|W(?:A(?:L[KL](?:WAY)?|Y)|HARF|YND)|XING)\b\.?){1,2})??(?>(?:[ ]{1,2}(?<StreetSuffix>E|N[EW]?|S[EW]?|W)\b)?))(?:(?:^|[ ]{1,2}|[;,.]\s{0,2}?)(?i)(?<Apt>(?:[#]?\d{1,5}(?:[. ]{0,2}(?:°|st|[nr]d|th))?[;,. ]{0,2})?(?:(?:(?>(?:A|DE)P(?:AR)?T(?:MENT)?S?|B(?:UI)?LD(?:IN)?G?|FL(?:(?:OO)?R)?|HA?NGS?R|LOT|PIER|RM|S(LIP|PC|T(E|OP))|TRLR|UNIT|(?=[#]))(?:[ ]{1,2}[#]?\w{1,5})??|BA?SE?ME?N?T|FRO?NT|LO?BBY|LOWE?R|OF(?:C|FICE)|P\.?H|REAR|SIDE|UPPR)){1,3}(?:[#;,. ]{1,3}(?:[-.]?[A-Z\d]){1,3})?)[;,.]?)?)(?<CityState>[-;,.[(]?\s{1,4}(?<City>[A-Z][A-Za-z]{1,16}[.]?(?:[- ](?:[A-Z][A-Za-z]{0,16}|[a-z]{1,3})(?:(?:[- ][A-Za-z]{1,17}){1,7})?)?)(?<!\s[ACDF-IK-PR-W][AC-EHI-PR-Z])[)]?(?>(?<State>[-;,.]?\s{1,4}[[(]?(?<StateAbbr>A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])\b[])]?|[-;,.]\s{0,3}[ ][[(]?(?=[A-Z])(?<StateName>(?i)Ala(?:bam+a|[sz]ka)|Ari[sz]ona|Arkan[sz]as|California|Colorado|Con+ec?t+icut+|Delaw[ae]re?|Flori?da|Georgia|Haw+ai+|Idaho|Ill?inois|Indiana|Iowa|Kansas|Kentu[ck]+[iy]|Louis+ian+a|Ma(?:ine|r[iy]land|s+achuset+s)|Mi(?:chigan|n+es+ot+a|s+is+ip+i|s+ouri)|Montana|Ne(?:bra[sz]ka|vada|w[ ]?(?:Hamp?shire|Jerse[iy]|Mexico|York))|[NS](?:o[ru]th|[.])[ ]?(?:Carolina|Dakota)|Ohio|Oklahoma|Oregon|Pen+s[iy]lvan+[iy]a|Rh?oa?de?[ ]?Island|Ten+es+e+|Texas|Ut+ah?|Vermont|Washington|(?:W(?:est|[.])?[ ]?)?Virginia|Wi[sz]cou?nsin|W[iy]om[iy]+ng?)[])]?)?)(?(State)|(?:(?<=[)])|(?! [A-Z]))))?(?>(?:[-;,.\s]{0,4}(?:^|[ ]{1,2})[[(]?(?<ZipCode>(?!0{5})\d{5}(?:-\d{4})?)[])]?)?)(?(State)|(?(ZipCode)|(?(City)(?!)|(?(PO)|(?(NumberException)(?!)|(?(StreetNameIndicator)|(?(StreetType)|(?(StreetPrefix)|(?!)))))))))(?=[]).?!'"\s]|$)(?![ ]+\d)/gmx
I tried this expression against the output of Google’s speech-to-text demo on regex101.com and it correctly identified the address:
I am confident from these quick and dirty tests that Google is the best and cheapest option for generating training data, and the addresses can be identified in a string of text. The next steps towards a minimum viable product will be to learn how Google’s speech-to-text API works and start generating text that will be the starting point for DeepSpeech training data.