Feeds:
Posts
Comments

Archive for the ‘Programming’ Category

I’ve got a new article up at The Code Project about how the internal keyword in C# can be used safely while preserving encapsulation. In a nutshell, it is about using internal interfaces instead of giving internal access directly to properties or fields, and how this can help maintain strict encapsulation while nevertheless granting special access to service layers. Give it a read and don’t forget to vote!

Read Full Post »

Currency Parsing, Regular Expressions

ASP.Net 2.0 has some very powerful client-side web page validation features, including classes that emit javascript validation code to the user’s web browser.  The CompareValidator, allows one to test whether the value entered into a text box is convertible to a given data type.  All one has to do add the CompareValidator tag to your markup,  set the properties and ASP.Net does the rest.  But don’t expect it to handle a dollar sign: use a regular expression for that!

(more…)

Read Full Post »

Pre-scoring Candidates for String Matching

Keywords: Levenshtein, String Matching, Optimization

Introduction

In many service oriented businesses, the problem comes up to search your customers for names which are published on a government watchlist. The input names on either list may contain typographical errors, so a fault insensitive matching algorithm must be run on each name pair. A model algorithm to consider for this is the Levenshtein string matching algorithm. Levenshtein is a nice algorithm to consider because it is well behaved, easy to understand and work with, and it is unbiased in the sense that it is not parametric and tuned for “English” sounding names only. The Levenshtein algorithm is (somewhat whimsically) explained in another posting here with references.

The time to compare all possible combinations of names varies linearly with the length of each input list, quadratically if they are both growing.  Typical string matching algorithms that are insensitive to typographical errors run more slowly than linear time.  Levenshtein is itself quadratic in the length of the strings it is comparing, so it makes sense to attempt to find a fast, linear-time pre-scoring algorithm that enables fast rejection of as many comparisons as possible beforehand. Furthermore, and perhaps as importantly, it is desirable to find a pre-scoring strategy that is separable over two strings being matched so that the pre-score can be calculated “offline” and stored with each string for quick comparison later.

(more…)

Read Full Post »

String Matching and Zaxxon

Keywords: Levenshtein, Wagner-Fischer, Zaxxon

It comes frequently in the service sector to check your customer names against some list provided by regulatory agencies, and the checking has to be tolerant against slightly mistyped names. So for example if “Hugo Chavez” is on the list, and “Huge Shavez” is in your list of customers, you might want to at least flag it as something to investigate further and see if it is a real match. One of the classic methods for evaluating near matches for words and names is the Levenshtein algorithm, or its closely related cousin Wagner-Fischer. But how does it work? And what does it have to do with Zaxxon?
(more…)

Read Full Post »

Keywords: Open Office, Ghostscript, Do It Yourself Convert Microsoft Word, Excel, Powerpoint to PDF

I’ve written a small program that uses Open Office to open and save different kinds of Microsoft Office files to PDF, and optionally merge them into a single output PDF file using GPL Ghostscript. I posted the code and article at the Code Project: http://www.codeproject.com/KB/java/PDFCM.aspx.

It’s a command line program, and we’re using a simplified version of it in production to do back-office conversions and merges of office files that we get from filling out forms internally and others that we get from customers. There are potentially many documents, and they can vary in size, so it is very cumbersome to cut, paste, print and scan everything to PDF (which is what our staff were doing when I started this project.)

Fortunately, it turns out that (1) one can use PRNADMIN.DLL with a Postscript Printer driver and an ActiveX IE browser to render a web page to Postscript, (2) Open Office can batch convert Microsoft Office files (and many more) to PDF, and (3) Ghostscript will merge Postscript and PDF on the command line.
(more…)

Read Full Post »

Open Office and Plain Text Files

Keywords: Open Office, MSDOS line endings, a .txt file opens in Open Office Calc

I noted the following odd peculiarity recently. When opening a plain text file with the .txt extension using Open Office either on Windows with the MSDOS \r\n “CRLF” line endings or on Unix with Unix \n “LF” line endings, Open Office Writer correctly opens the file without complaint. However, when opening a plain text file with MSDOS “CRLF” line endings on Unix, Open Office tries to import the text file into a spreadsheet and pops a dialog. I don’t know if that is intended behavior or not, but it is useful to know if you’re processing text files in a Unix envoronment. This was observed in Open Office 2.4.

Update June 3, 2008 I reported this to Open Office with some extra details. Again I don’t know if it is a bug, but you should probably be aware of text file line endings, and have tools ready to change them if necessary.
(more…)

Read Full Post »

Keywords: .Net, Email, Attachments, Where are all the open file handles coming from?

Sending an Email message in .Net is easy using the System.Net.Mail namespace. Just make sure that if you use attachments, you dispose of them when you’re done. The reason is that creating and adding an attachment from a filename silently opens a file on your system and holds it open either until the attachment is disposed or until the entire message is disposed. The timing of opening/closing the attachment file has nothing to do with when you actually send the message. Given that the garbage collector is usually configured for best response time, that could very well be long after your message has gone out of scope; ie- until your process dies or is bounced by a web server.

(more…)

Read Full Post »

Older Posts »