Monthly Archives: August 2017

Data Management Primer

data_analysisI spent the better part of 40 years working with data. As a scientific researcher and then as a software developer, my work was all about acquiring, managing, analyzing, and reporting data. That experience taught me lots of lessons. I shared many of those lessons by teaching database design and development at the college level and by writing books on the topic.

Chances are that as a professional in the information age, you do a lot of work with data as well. Here’s the most important thing to know right off… this doesn’t come naturally. Collecting and storing data properly is deceptively difficult. I know because as a software consultant I was called in to fix innumerable systems that were failing catastrophically because they were grown organically by clever, smart people using spreadsheets with complete confidence they were doing a great job.

Even good data gets ruined if it is badly stored.  This applies to small projects as much as large ones. Poorly stored data becomes the “garbage in” for your subsequent analyses and reports. Therefore I thought I would share a brief summary of some of the most important “do’s and don’ts” when I comes to collecting and storing your data. I won’t try to explain or justify all of these rules, but trust me, if you follow them your work will be far more efficient and effective:

  1.  Always create a key field. This field should uniquely identify each record. I recommend you identify each record with a simple incrementing number, one that has no real-word meaning and that you never report to external users of the data.
  2. Don’t create derived fields. It is generally not good practice to store fields of data that are derived or calculated from other fields. It is better to compute these when needed to ensure the values are current.
  3. Make sure every column or field of data is atomic. That is, each field value should contain only one irreducible piece of information. You never want to put say, two phone numbers in one field. Store only one piece of information per field by creating separate columns if necessary.
  4. Think ahead carefully about how atomic a field needs to be.  Will I need to break up phone numbers into separate parts? You have to consider how the data will likely be used and the best choice is still not always obvious. Do you ever need to know the street name in an address? If so, you might consider storing the house number and street in separate fields as this is extremely difficult to parse. Will you need the month from a date? You probably still don’t need to store that separately since it is extremely easy to extract months from dates when needed.
  5. Avoid series of columns. You don’t want Phone1, Phone2, and Phone3 fields. Better to have a separate “phone numbers” table with a record for each phone number along with the type of phone linked through a key field. A relational table like this is much more flexible, efficient, maintainable, and expandable.
  6. Don’t stick data where it doesn’t belong. When you don’t have a column, it is tempting to just jam some values into a column where they do not really belong. For example, you have a pregnancy field that does not apply to men. So why not stick prostate indicator into this column in records for males? This is a major no no. Every value in a column should describe only one attribute.
  7. Don’t append or prepend. Always avoid appending or prepending data values to add further information. For example, adding “fax” or “mobile” after phone numbers. Instead add a separate phone type column or better yet create a separate relational table to store these values.
  8. Don’t vary record types in a given table. You can have a column that indicates whether the record is a Parent or Child, but you should not then change the meaning of subsequent fields depending on whether the record belongs to a parent or a child. If parents and children require different information, create separate tables.
  9. Never use “special” values. Never put in a reserved value like “NONE” or “1/1/1999” “to indicate some special condition. This is a very bad practice that inevitably results in errors, hair-pulling, and tooth-grinding.
  10. Give your columns clear and meaningful names. Don’t use cryptic names. If the column contains first names, simply call the column “First Name.” This makes it completely unambiguous, reduces errors in analysis, and makes reporting clear, consistent, and easy
  11.  Format your data consistently. For example, store SSN with hyphens or without as you prefer, but don’t mix these. With hyphens is preferable since it is best to store data so that it does not require formatting upon reporting.
  12. Use Null and Empty values correctly. -Assuming your data storage system supports these, use them correctly. Null means that value was never entered. Empty means it was entered, and it was explicitly entered as empty.
  13. Allow Null and Empty responses on your forms. Forcing users to put in an answer they don’t know yet just to save the form only opens the door to garbage data. If you force an answer too early, you may be forcing the user to make something up just to move on, and later there is no way to spot this as “fake” data. Better to wait until the last possible point prior to finalization before verifying that all data entry is complete.
  14. Don’t allow bad values to be entered. Provide dropdown or selection lists wherever possible rather than text data entry. Text (freeform) data values are usually garbage data values that are very problematic to search, analyze, and report.
  15. Don’t duplicate any information. No information should be repeated in rows. For example, you should have one parent table with customers and another child table with orders records for each customer. You should not have one “flat” table which repeats the same customer information on every order record for that customer. This is one key difference between simple “flat table” data storage and a normalized relational database.
  16. Break the rules when needed. The only thing worse than not following these rules is mindlessly following them. Break these rules only when it makes good sense to break them. You can’t make that assessment if you do not understand them, and the reasons for them, intimately.
  17. Think long-term. Don’t assume you’ll never need your data again or that you will be the only one to ever look at it. You may know how you violated these rules, but others might need to understand this data in the future and they will not. Following these rules will ensure that this valuable data you collected is not wasted just because you could not anticipate how they might be used in the future.

Some of these rules simply cannot be achieved using a single (denormalized) table like you typically create with Excel. They require multiple tables following a normalized design structure of parent and child tables.

If you don’t see any way you can avoid breaking these rules, then your data storage requirements probably exceed the limitations of a simple table. If that is the case, the fact that you cannot implement these rules should alert you that you need to consult a data management professional to produce an efficient relational database schema using SQL Server or some other professional database.