Feeds:
Posts
Comments

Archive for the ‘Programming’ Category

Database access is challenging to unit test because databases exist separately from application code under test.  Therefore any unit test involving database access is really an integration test.  The problems however go beyond mere semantics.  The following is extra work that has to be done to do the unit test.

  • Maintaining A Database: In order to run the test, a working database has to be set up and accessible during the test.
  • Maintaining/Accessing State: In order to test expectations, the database must be brought to a known state before every test.

While not insurmountable, the extra work adds up.  Testing reliability also suffers because there are now more reasons for tests to fail that have nothing to do with the code under test.  (This is a consequence of the second law of thermodynamics!)

Mocking frameworks like NSubstitute can help by mocking your database connection.  We simply provide mocks for key interfaces that return expected test case data and substitute those mocks for real database connections in the test.  When using a mocking framework in place of your database there is no need to maintain a separate database installation just for unit testing application code, and tests won’t fail because of connection failure or unknown initial database state.

In the following, I will present mocks for the .Net interfaces IDbConnection, IDbCommand, and IDataReader.  Database connection interfaces are remarkably similar in many languages, so you should be able to apply these techniques in Java or similar as well.  Demonstration code is up on github in my MockDbConnections project.

Preparing to Mock the Database

Some preparation may be in order to be able to mock the database,  falling under the topic of writing testable code.  Many data access implementations rely on connection objects instantiated deep in encapsulated layers of code, often just before database access is made.  Though it is benign (unless you are migrating your database platform!), it is actually a violation of the Single Responsibility Principle because your data access code is responsible for both creating and using a database connection.  It also makes it hard to substitute a mock for your database connection.  In order to make such code cleaner and more testable, it is better to have it depend on a Factory Method that provides the needed data access objects.   Using a PostgreSQL driver as an example, instead of hiding the creation of the connection instance like this:

public class MyDataAccess {
    public string CxnString { get; set; }
    // ...
    public void SomeDataAccessMethod() {
        using (var cxn = new NpgsqlConnection(CxnString) ) {
            // ...
            var cmd = cxn.CreateCommand();
            // ...
        }
    }
}

create the connection using a factory method that can be injected at runtime, like this:

public interface IDbConnectionFactory { 
    IDbConnection CreateConnection(string cxnString);
    IDbCommand CreateCommand(string cmdText, IDbConnection cxn);
}

public class MyDataAccess {
    public string CxnString { get; set; }
    public IDbConnectionFactory CxnFactory { get; set; }
    // ... 
    public void SomeDataAccessMethod() { 
        using (var cxn = CxnFactory.CreateConnection(CxnString) ) { 
            // ... 
            var cmd = CxnFactory.CreateCommand(strCmd, cxn);
        } 
    } 
}

Now we can happily inject an NpgsqlConnectionFactory into the data access object at runtime, and an alternate factory that produces mocks at test time.

Mocking the Common Database Interfaces

The first thing to note about NSubstitute (and many mocking frameworks) is that it can only provide mocks for methods and properties that are virtual.  If you try to provide a mock for a non-virtual method of a concrete class instance, it will run the actual implementation instead of the mock!  So usually you’ll see mocks written to interfaces.

In the following, we’re actually going to create mocks from abstract classes that inherit the common database interfaces.  This is both because the classes need some state to collaborate,  and because we’d like to be able to inject expected results for the database operations.  The following abstract classes are implemented in the MockDbConnections project at github.

  • ADbConnection implements IDbConnection.  It provides a concrete implementation of the ConnectionString property.
  • ADbCommand implements IDbCommand.  It provides concrete implementations for the Connection property, the CommandText property, and the Execute() methods.  There are properties that can be used to set the expected return values of the Execute() methods.
  • ADataReader implements IDataReader.  It provides concrete implementations for the Read() method, the row indexer method, and the various typed getters for the fields in the current row like GetString() or GetDouble().  There is a property that can be used to set the RowSet.
  • Although I didn’t provide it for this example, if needed a mockable ADbTransaction that implements IDbTransation can be provided.

All of these implementations are simple, relying only on basic types and simple generic collections.  With these in hand, we can mock IDbConnectionFactory to produce a collaborating set of mocks that can effectively run test data through a data access class.

In the example code, there is a simple data access class for a fictional bank that has accounts, owners, and portfolios.  In this toy model a portfolio can have many accounts, and each account belongs to exactly one portfolio and has exactly one owner.   The data access class has a method which returns all of the accounts in a portfolio identified by name.  The method has to do two things:  it translates the portfolio name into an id, and then it returns a selection of accounts that match the portfolio id.   If an error happens, it produces a log statement.  (The example code also mocks the logger so that log strings get added to a local List that can be inspected during the test.)

The mock method for CreateConnection() simply takes the connection string and sets it on the mock connection.

// Mock a connection factory
var provider = Substitute.For();

provider.CreateConnection(Arg.Any()).Returns(
    x =>
    {
        var retval = Substitute.For();
        retval.ConnectionString = (string)x[0];
        return retval;
    });

The code for creating a command is more interesting.  The mock can inspect the given command text sql string and provide a different result depending on the parameters.  So for example, let’s say I have a list of accounts in the test case.

// A set of test accounts 
// id, portfolio id, owner id, acct number, balance
List<object[]> testCase = new List<object[]>();
testCase.Add(new object[] { 1, 1, 2, "ABC123", 100.00m });
testCase.Add(new object[] { 2, 2, 3, "DEF456", 200.00m });
testCase.Add(new object[] { 3, 3, 2, "GHI789", 300.00m });
testCase.Add(new object[] { 4, 2, 3, "JKL012", 400.00m });
testCase.Add(new object[] { 5, 1, 3, "MNO345", 500.00m });

And the mock data access contains an SQL query that returns a subset based on the portfolio id.  The mock CreateCommand() method can detect the SQL and return the correct response based on the data in the test case.

provider.CreateCommand(Arg.Any(), Arg.Any()).Returns(
    x =>
    {
        var retval = Substitute.For();
        retval.CommandText = (string)x[0];
        retval.Connection = (IDbConnection)x[1];

        // condition the result based on the SQL statement

        // Is it asking for a list of accounts?
        var regexGetAcct = new Regex(@"select id, portfolio_id, owner_id, account_number, balance from account where portfolio_id = (\d+);");
        if ( regexGetAcct.IsMatch(retval.CommandText))
        {
            var mtGetAcct = regexGetAcct.Match(retval.CommandText);
            var readerResult = Substitute.For(); 
            readerResult.RowSet = testCase.Where(y => ((int)y[1]) == int.Parse(mtGetAcct.Groups[1].Value)).ToList();
            retval.ReaderResult = readerResult;
        }

        // Other conditions...
 
        return retval;
    });

When this factory instance is passed into the toy data access class, the above code will set up a RowSet that matches the input.

// Create the object under test
GetAccounts getAccounts = new GetAccounts(provider, "ConnectionString");

// Do a bunch of positive test cases
var accts = getAccounts.GetAccountsForPortfolio("One");
Debug.Assert(accts.Count == 2);
accts.ForEach(x => Debug.Assert(x.PortfolioId == 1));

accts = getAccounts.GetAccountsForPortfolio("Two");
Debug.Assert(accts.Count == 2);
accts.ForEach(x => Debug.Assert(x.PortfolioId == 2));

accts = getAccounts.GetAccountsForPortfolio("Three");
Debug.Assert(accts.Count == 1);
accts.ForEach(x => Debug.Assert(x.PortfolioId == 3));

This is pretty simple so far, but things get more interesting when we introduce an error condition.

// Do a negative test case
accts = getAccounts.GetAccountsForPortfolio("Four");
Debug.Assert(accts.Count == 0);
Debug.Assert(loggerStatements.Count == 1);
Debug.Assert(loggerStatements[0] == @"Could not find portfolio id for 'Four'");
loggerStatements.Clear();

We’re able to test that an appropriate error condition is raised.  Finally, we can easily test what happens when the database returns an unexpected value.  This is easy to do with our mock framework since the RowSet it is returning a list of object[];  in the following I’ve just set a string “BadValue” in a slot where a decimal was expected.

// Do an exception case - should skip the bad record and return the others
testCase.Insert(0, new object[] { 6, 1, 3, "PQR678", "BadValue" });
accts = getAccounts.GetAccountsForPortfolio("One");
Debug.Assert(accts.Count == 2);
Debug.Assert(loggerStatements.Count == 1);
Debug.Assert(loggerStatements[0].StartsWith("Caught parse exception:"));

In this case, a parse error in a single row should create a log statement and not derail processing of other rows.  This is actually hard to do in a real DBMS since it is difficult to set up a bad value in a column of definite type.  Yet bad data happens, especially if you’re dealing with third party data, and it is important to test that your data access layer can deal with it.

Conclusion

In this post, we saw an example of using a mocking framework to help test a data access layer.  This has some distinct advantages; namely we were able to create unit tests for a database that did not require a lot of extra work in setting up and preparing a database.  Such tests are not vulnerable to external points of failure like a server being down or a database in unknown initial state.  And we were furhermore able to test error conditions that are not easily set up in a real database.

And yet, what are we really testing?  This kind of framework is useful for testing behavior in the data access layer.  If the data access layer is simply mapping database fields to object properties, then it’s probably better to use an automated framework to generate integration tests with random data.  The added value of using a mocking framework is that we could test data access layer behavior without setting up a database.  For the project I am working on it’s critical to be able to do this:  working on a homebrew replication system for PostgreSQL, it’s incredibly useful to be able to test application behavior in a failover event without setting up multiple databases.

 

Advertisements

Read Full Post »

Writing Testable Code

One hears the phrase “writing testable code” fairly often, and it sounds like a good thing, but what does it actually mean and why do we care?

Answering the second question first, we care because we’d like to be able to test software and catch errors before our audience does.  It’s embarrassing to say the least when a bug is released into production code, and potentially disastrous to the bottom line in rework and lost customers.  But as software systems have gotten larger and more feature rich, the possibilities for errors have grown exponentially.

Many automated build and test frameworks such as JUnit/NUnit and Jenkins were introduced to automate the process of running unit tests, but the task of creating unit tests is still largely an expensive, manual one.

Writing testable code then is ultimately about reducing the cost of writing unit tests.   So what design choices lead to expensive unit tests?

Code that violates the single responsibility principle (SRP) is difficult to test because functionality under test cannot be isolated.  If a module that implements dual responsibilities fails a test, was that a failure of one or the other part of the implementation mediated through shared state?  Someone has either to spend time tracing the error manually to determine the root cause, or to spend time writing more complex unit tests that can discern among all of the possible permutations of bugs.  Avoiding this source of complexity is a matter of refactoring code until it no longer violates SRP.

Also, code that gets its dependencies using either the new operator or static references to concrete implementations is difficult to test because the functionality under test cannot be easily isolated from its dependencies.   Developers often punt on this problem and back themselves into creating integration tests.  But now somebody is spending lots of time maintaining copies of production services in a dev/integration environment so that the integration tests can run.  Furthermore, such integration tests take longer to set up and run.  The problem is compounded: more time and expense per test implies that there will be fewer tests, poorer coverage, and more bugs released to production.

Obviously code needs access to its dependencies, so how can one write code in such a way that one avoids spending extra time maintaining integration tests?

Dependency Injection

If dependencies are abstracted by interfaces, and implementations are injected into code at runtime, then the problems mentioned above go away.  Alternate, controlled implementations of the dependencies are injected into code during a unit test.  This nicely isolates the code functionality under test and avoids having to spend time and money on maintaining specially controlled, production-like services in a dev or integration environment just for integration testing.

A common scenario where this problem arises is in code that accesses a database in order to do its work.  Let’s say we have a method that returns the Users who have accessed a website on a given date by querying a database.  The code might look something like this (for C# and a PostgreSQL database):

public class DataAccess {
// ...
  public ConnectionString { get; set; }
// ...
  public List GetTodaysUsers(DateTime dt) 
  
    using (var cxn = new NpgsqlConnection(ConnectionString)) {
      cxn.Open();
      var cmd = cxn.CreateCommand();
      cmd.CommandText = $"select user_name from application.usage where last_date = '{dt.ToShortDateString()}';";

      var retval = new List();
      var reader = cmd.ExecuteReader();
      while (reader.Read()) {
        retval.Add(reader.GetString(0));
      }

      return retval;
  }
}

It’s impossible to test this code without either breaking encapsulation or spending time automating a database setup with a usage table and populating it with reproducible test data, possibly with setup and teardown after every test, and tests that can’t be run in parallel.

An alternate implementation that uses dependency injection doesn’t suffer from these problems:

public interface IDbConnectionProvider {
  IDbConnection GetConnection();
  IDbCommand GetCommand(string cmd, IDbConnection cxn);
}

public class NpgsqlProvider : IDbConnectionProvider {
  IDbConnection GetConnection(string cxnString) {
    return new NpgsqlConnection(cxnString);
  }
  IDbCommand GetCommand(string cmd, IDbConnection cxn) {
    var retval = new NpgsqlCommand();
    retval.Connection = (NpgsqlConnection)cxn;
    retval.CommandText = cmd;
    return retval;
  }
}

public class DataAccess {
// ...
  public IDbConnectionProvider DbConnectionProvider { get; set; } = new NpgsqlProvider();

  public ConnectionString { get; set; }
// ...
  public List GetTodaysUsers(DateTime dt)   
    using (var cxn = DbConnectionProvider.GetConnection(ConnectionString)) {
      cxn.Open();
      var cmd = DbConnectionProvider.GetCommand(
           $"select user_name from application.usage where last_date = '{dt.ToShortDateString()}';",
           cxn);

      var retval = new List();
      var reader = cmd.ExecuteReader();
      while (reader.Read()) {
        retval.Add(reader.GetString(0));
      }

      return retval;
  }
}

This is now testable code.  At test time, a mockup implementation of IDbConnectionProvider can be provided to DataAccess that can monitor the SQL queries coming in and send back canned responses without hitting a database at all.   As many mockup implementations can be created as needed to run tests in parallel, thus saving time.

In this case, the mockup needs to collaborate with mockups of IDbConnection, IDbCommand, and IDataReader.  Most languages provide similar interfaces to these .Net examples already, and these are fairly easy to mock up quickly with standard collections. Also many excellent mocking frameworks such as NMock or Mockit exist to automate the process of providing mock implementations for interfaces with complete control over input/response and support for many languages.

Conclusion

In order to write testable code, stick to the Single Responsibility Principle (SRP) at design time, and make use of dependency injection so that unit tests make use of mocked up dependencies to isolate functionality.  While it doesn’t eliminate the need for a test environment mirroring production for overall system and deployment testing,  it does saves time and expense otherwise devoted to maintaining production-like services in a test environment with special conditions for integration testing.

Read Full Post »

I’ve spent many hours becoming (more or less) proficient in the ceremony around PostgreSQL functions;  hours that hopefully you won’t have to spend after reading this! Here is a one minute list.

  1. Functions and Stored Procedures are all “Functions” in PostgreSQL, so don’t waste time searching for syntax related to “stored procedures” in the PostgreSQL manual.
  2. Use the opening phrase “create or replace function …” so that you can quickly reload the code while debugging.
  3. A function name may be schema qualified (eg- “my_schema.my_function”).  If it is not schema qualified, it will go into the public schema.
  4. The argument is in parentheses and is a comma separated list of names and simple PostgreSQL types like int, text, or varchar.
  5. I usually prefix a “p_” to all my parameter names to make the following code more readable.  You should too; you’ll thank me later.
  6. PostgreSQL can overload functions by method signature, so if you change the argument list, you will be loading a different, distinct function.  Beware of this and always remember drop a function explicitly if you didn’t intend for an overload to be floating around.
  7. The return value can be void (if your function is like a stored procedure in other SQL variants);  it can be any PostgreSQL data type like int, text, or varchar; it can be a TABLE definition if it is returning a query; or it can be a special data type like TRIGGER.
  8. Don’t forget the “AS”
  9. The text of the function will appear between double dollar sign delimiters after the AS, $$.  These are just text delimiters, there is no special voodoo in these.
  10. The text delimiter dollar signs may have other identifying characters in between, like $aa$ or $bbb$. These make different delimiters.
  11. If you mark beginning of text with one delimiter, close it with the same delimiter, like “$aa$ blah; blah; blah; $aa$”.  “$aa$ blah blah blah $bbb$” won’t work.
  12. If you need to declare local variables, include a DECLARE block first.  Local variables can be declared with any of the simple PostgreSQL types like int, text or varchar in addition to some special types.
  13. I usually prefix a “l_” to all my local variable names to make the following code more readable.  You should too; you’ll thank me later.
  14. The DECLARE block does not need its own END statement.
  15. The next block is delimited by BEGIN … END;.  Your SQL code goes in there.
  16. Assigning values to local variables?  Use “:=” operator and not “=”.  Also, the expression on the right of a “:=” should go into “()” parentheses.
  17. If your function returns anything other than void, you need to have a RETURN statement in there.
  18. After the final $$ text delimiter, include the string “language plpgsql”. There are other languages; you will probably never use them.

So here’s an example of a wrapper function which returns the time with a hello message.

CREATE OR REPLACE FUNCTION public.hello_time(p_hello_msg varchar) RETURNS varchar AS
$$
DECLARE
    l_tmp_str varchar;
BEGIN
    l_tmp_str := ( cast ( now() as varchar) );
    return p_hello_msg || ' ' || l_tmp_str;
END;
$$ LANGUAGE plpgsql;

You can execute it like this:

> select hello_time('hello');
hello 2018-08-15 12:24:34.0764339-05

Happy coding! Look for more TL; DR; guides in the future at The Solarium’s TL; DR; Guides.

Read Full Post »

Installing a new PostgreSQL database AND being able to connect to it can be a daunting task at times.  When you use an installer (Windows) or a package manager (Linux), the installation usually comes with all of the sensible defaults turned on and ready to go.  But, if you are dealing with a “special” custom installation, you may run into some connection issues.  Ain’t that special?

In this blog post, I will assume that you have already installed a PostgreSQL database instance on Linux.  I am working with PostgreSQL 10.4 and CentOS 7, but these tips should be general enough to help you troubleshoot on other platforms.  I am also assuming, if you’re reading this, that you have a client program like DBeaver (https://dbeaver.io/) on Windows, or command line psql on Linux running in a terminal, AND that your client cannot make a connection to your brand spanking new PostgreSQL database instance.  Without going into thousands of pages of reference material, here are four of the most common things to check up on when troubleshooting PostgreSQL connection issues.

  1. Is there network connectivity between client and server?
  2. Is a firewall on the server blocking your connections?
  3. Is PostgreSQL listening to the correct network interface?
  4. Is the client allowed by pg_hba.conf?

Network Connectivity

It sounds dumb, but the first thing to check is the network connection between the client and server machines.  Everything is plugged in, right?  Then try to ping the server from the client and vice versa.  If the machines are not ping-able, then you most likely have a network connectivity problem.  Here are some general troubleshooting ideas for network connectivity.

  • Check that you can ping other sites, like http://www.google.com. If you can, that points to a specific problem between your client and server.
  • You just installed PostgreSQL on the server; is the database server also freshly installed?  Make sure it has the correct networking parameters, and make sure the interface is configured to come up at boot time.
    • On CentOS, the network configuration is kept in a file in /etc/sysconfig/network-scripts, named something like ifcfg-ifacename.  Check that the contents of this file are correct for your network interface, and ONBOOT=yes and run systemctl restart network
  • If your server is running in a Virtual Machine guest OS and the database client is on running the host, check the virtual machine manager and make sure that networking is enabled from host to guest.
  • If you work in a secure corporate “double-secret probation” network environment, it may also be that ICMP is blocked and that ping won’t work.  Submit a ticket to your networking team to find out if that is an issue and for help troubleshooting connectivity.

Firewall Issues

A firewall on the server may be blocking communication to the port that PostgreSQL is listening on.  On the server machine, you can create custom firewall rules to allow network traffic to that port.  Unless you changed it in the postgresql.conf configuration file, the default port that PostgreSQL uses is 5432.  On CentOS, the firewall service is firewalld.  You can check to see if it is running by issuing the command

systemctl status firewalld

It will tell you if the service is disabled/inactive, stopped or running.  As a matter of fixing firewall issues, I don’t recommend stopping/disabling the firewall entirely. Certainly do not do this in production, although it can be OK on an isolated testbench network segment, like a private Hyper-V virtual network.  It is much better to add a custom firewall rule to the CentOS firewalld service for PostgreSQL

  • Look up the port number in postgresql.conf file in the PostgreSQL data directory for your database instance.  (The default value for PostgreSQL is 5432.)
  • For each host or network segment you want to allow through the firewall, run the following command
    firewall-cmd --add-rich-rule='rule family=ipv4 source address=192.168.10.100/32 port port=5432 protocol=tcp accept' --permanent
    This will allow connections from the host with the exact IP address 192.168.10.100 to make connections to the server machine on port 5432. Your parameters will vary of course.
  • Yes, the word “port” appears twice in the above command.  For historical reasons, with firewall rules on CentOS, the port is actually called the “portport”.  (Just kidding, but not about the double occurrence of the word “port” – you really do need it there twice.)
  • Don’t forget to reload the firewall after creating new rules. Run the command
    firewall-cmd --reload
    to do this.

Listener Network Interface

In the PostgreSQL postgresql.conf configuration file there are many settings, including the listen_addresses setting.

#listen_addresses = 'localhost'

PostgreSQL provides configuration files with many settings like this one:  the setting is commented out, but it is set to the default value. This particular setting means that PostgreSQL is by default only listening for connections that come over the localhost network interface; eg local connections. If your client is trying to connect from another machine, then it won’t be able to connect to PostgreSQL. To fix this, uncomment the setting and change it to

listen_addresses = '*'

With this value, PostgreSQL will listen for connections on any interface.  Like with the firewall, before the setting takes effect you need to reload the server configuration.  You can reload the server configuration without restarting the server.  If the data folder for your instance is in the location /usr/data/pgdata, you can reload configuration with the command

pg_ctl -D /usr/data/pgdata reload

(Remember to substitute the folder location of the PostgreSQL on your machine.)  If you have only one network card in your machine, it makes sense to just listen on all interfaces. If you have multiple network cards, it is best practice to listen only on the interfaces over which you expect PostgreSQL clients to connect to prevent spoofed connection attempts.  More details about this setting can be found in the manual.

pg_hba.conf

This file provides another layer of connection security based on the database user that you specified in the client.  Fortunately, if this is the problem, the client will have received and displayed an error message to the effect that the user does not exist in the pg_hba.conf file.  Pretty.  Stinking.  Obvious.

So to fix this issue, it is sufficient to add an entry to the end of this file.  The file is pretty well commented, but briefly, each entry consists of a line

host database_name user_name ip_address authentication_method

So if I had a client connecting with user name “tom_bombadil” to the database “old_forest” from a single computer with IP address 192.168.10.100 and we want to require a password, the line would be

host old_forest tom_bombadil 192.168.10.100/32 md5

There is a complete discussion of these parameters at the PostgreSQL site: https://www.postgresql.org/docs/current/static/auth-pg-hba-conf.html

Conclusion

Connecting to a brand new PostgreSQL instance can be mildly challenging if you’re installing from source or if you have non-standard configuration parameters or went around the installer in some way.  Not all of these issues necessarily involve PostgreSQL either.  Hopefully this list of common issues helped get you up and running!  If you had any other experiences that you think could be helpful, feel free to relate those in the comments!

Read Full Post »

I’ve got a new article up at The Code Project about how the internal keyword in C# can be used safely while preserving encapsulation. In a nutshell, it is about using internal interfaces instead of giving internal access directly to properties or fields, and how this can help maintain strict encapsulation while nevertheless granting special access to service layers. Give it a read and don’t forget to vote!

Read Full Post »

Currency Parsing, Regular Expressions

ASP.Net 2.0 has some very powerful client-side web page validation features, including classes that emit javascript validation code to the user’s web browser.  The CompareValidator, allows one to test whether the value entered into a text box is convertible to a given data type.  All one has to do add the CompareValidator tag to your markup,  set the properties and ASP.Net does the rest.  But don’t expect it to handle a dollar sign: use a regular expression for that!

(more…)

Read Full Post »

Pre-scoring Candidates for String Matching

Keywords: Levenshtein, String Matching, Optimization

Introduction

In many service oriented businesses, the problem comes up to search your customers for names which are published on a government watchlist. The input names on either list may contain typographical errors, so a fault insensitive matching algorithm must be run on each name pair. A model algorithm to consider for this is the Levenshtein string matching algorithm. Levenshtein is a nice algorithm to consider because it is well behaved, easy to understand and work with, and it is unbiased in the sense that it is not parametric and tuned for “English” sounding names only. The Levenshtein algorithm is (somewhat whimsically) explained in another posting here with references.

The time to compare all possible combinations of names varies linearly with the length of each input list, quadratically if they are both growing.  Typical string matching algorithms that are insensitive to typographical errors run more slowly than linear time.  Levenshtein is itself quadratic in the length of the strings it is comparing, so it makes sense to attempt to find a fast, linear-time pre-scoring algorithm that enables fast rejection of as many comparisons as possible beforehand. Furthermore, and perhaps as importantly, it is desirable to find a pre-scoring strategy that is separable over two strings being matched so that the pre-score can be calculated “offline” and stored with each string for quick comparison later.

(more…)

Read Full Post »

Older Posts »