Telephone: +44 7973 504232


Many to one issues


Real world data can be difficult to model without adding complexity that would be out of place in a simple cloud system.  This article describes one method to shoehorn a many-to-one relationship into a simple database, and what happens when it goes wrong. 


Imagine you are the purveyor of an email marketing system which is effectively a flat file (think spreadsheet like), rather than a more sophisticated database management tool.  This is fine, simple in concept and implementation, but with limitations that you can’t really have a clean many-to-one relationship.  You could have multiple fields if you want to record, say, colour options.  You could have a colour1 field, colour2 field etc., or just a logical field for each colour, so as many fields as there are options.  You could have one large field and put all the selected options in separated by some character, perhaps a pipe or semicolon.  All these might work, but there are non-trivial issues with consistent data input/import and with selections (queries) that then return the right data. 

 

If the data consists of logical values (yes/no, on/off, 1/0), then it is relatively simple to store the values as a string, so if the options are Red, Orange, Yellow and Blue, you could, for example, store this as “1001” if Red and Blue have been selected, and Orange and Yellow have not.  You would also need a lookup table somewhere to map selections to their place in this string, but overall this is space efficient and the queries will run very fast.  Even better, this can be stored as a bit string, which will use about one eighth of the space and run even faster. 


Slightly technical paragraph follows: In this instance, our binary value for Red would be binary 1, (2 to the power 0) and the 1000 for Blue would have a value of 8 (2 to the power 3), so our storage of Red and Blue, 1001, would look like the decimal value 9 (8+1). Given that we can assign a value to our offsets in the bit string, we can actually pass the value for a particular bit to the check box in the HTML web form used for either data input or creating a selection query, and then receive that value on submit if the checkbox is checked.  So red’s return value would be 1, Orange 2, Yellow 4, and Blue 8 (they don’t return anything if they are not checked).  Minimal coding required, and all this can be hidden from the end-user. 


As one database field can be used to store a group of selectable values, we can have multiple fields and groups, so one for colour, another for size, perhaps another for location, or any group of selectable values the user cares to add as they customise their flat file. A standard integer field may be 32 bits long, which should be enough for almost all purposes. A web page with a group of more than 32 checkboxes is too much. 

Updates to such fields should be quick, we just have to amend bit values to 1 for those selected, and to 0 for those not selected.  Our query form will use the lookup table to create a mask that will pick out the selected values most efficiently.  Computers are good at binary.  Many to one, done.

 

Problems start to arise when the user wants to change their group of values.  They wish to add Green and Violet, and nobody is buying Yellow, so they want to take that out.  You could have implemented the lookup table as just a simple list, which would be appealing as you could then use the same list order as the presentation order of the options on the screen.   Removing yellow, insertions of new options or a change in order is now a tricky whole file operation, as you need to amend that group field in every record and shuffle the subsequent bits along.  If you have implemented the lookup as an index with a numeric offset into the bitstring then you could just leave yellow to fester by removing the lookup entry, but you may still wish to bite the bit (ha ha) and tidy up any unused options less you run out of space later.  The programmer behind all this has to cope with fields being removed, fields being added, possibly some name (field label) changes and changes of presentation order.  Doubtless this will be well tested, because get it wrong and all those user’s customer’s preferences will be mis-mapped.  


Get it wrong they did.  I have rectified this situation for a customer who was fortunate to have copies of the data before and after it all got a bit mixed up when they added some new choices in a location option group.  Menorca was added as the fourth option in the middle of the location set, but all the existing bit values were not adjusted and Menorca ended up pointing at what should be the New York values instead.  All the rest of the location values were similarly misplaced down to Zurich, which was then seemingly shunned by all the customers.  Their group of language choices were also shuffled.


The purveyor of the system declined to accept that their system had done this, but then they were about to sell the company.  They may also have broken data protection law, at least in the UK.  I couldn’t replicate the issue, so presumably they had fixed the bug, if not the data.  What was interesting was the minimal level of on-line support, and that there was, apparently, no community in which you could ask if others had seen this happen.  It was as if the application was so good that such facilities were not required. 


Here are some broad lessons one could draw from this:


·         When evaluating systems, check out the web support pages.  If there is no sign of bugs or feature requests, no community pages with problems and work-arounds, then caveat emptor.

·         Take and keep your own backups.  Insidious issues can damage your data for some time before you notice. You may need to repair your data using those aged backups. 

·         Normalisation is data science, not rocket science.  Nevertheless, the effort and complexity involved will be too much for someone who just wants to get their mailing out.  When we cheat we create an anomaly somewhere, and we have probably moved the complexity and risk into some program code to try to manage this.



John.Davis@cranfieldsoftware.co.uk