A Non-Developer’s Guide to Relational Databases
This document will introduce the basic concepts of relational databases and the basic processes and best practices for the modeling of logical domains of data.
How This Document Came About
I am a developer. I have been working in application development and databases for almost two decades.
Corey is an employee of my company. He is not a developer. His title is “User Interaction Strategist,” which is a fancy title for The Guy Who Helps You Figure Out What You Really Need.
What Corey does is help you take your very abstract idea of “we need a new application” and figure out what that actually means. Why do you think you need a new application? What audiences will this application serve? What do those people want from it? What specific tasks are they trying to accomplish? What do we need to deliver to these people to make them happy?
Corey takes in all this information, synhtesizes it, and outputs a plan which tells a developer, “this is what this person really wants to do.” (In the enterprise, Corey would likely be called a Systems Analyst.)
One day, Corey and I met with a company who really had very little tracking around the core function of their job. This company made deliveries of a sort, and they had a handful of old applications to help them work out various aspects of these deliveries:
- One application would help them determine sales tax for a specific type of delivery.
- Another application would help them write a quote for a delivery.
- Another application would help them track the equipment used in a delivery.
And so on. These applications were completely isolated, knew nothing about each other, and had been abandoned by their IT departments. The client wanted something integrated so they could get a better view of their work.
As the client was talking about their problem, Corey was sifting through screenprints of the various applications they provided, and I was drawing boxes and arrows in my notebook. The client would say something, and I would draw another box, or cross a box out, and replace it with an arrow. Corey would drop one piece of paper and pick up another one.
When the meeting was done, I don’t know how Corey felt about it, but the problem was crystal clear to me. As the client had been talking, I had been distilling their problem down to pure data concepts.
This was quite different from how Corey was looking at it:
I was analyzing the problem at the data level. The doodles in my notebook were actually a crude form what what’s called an entity relationship diagram , which visualizes how different pieces of data fit together.
Corey was analyzing this problem from the user interface level, by examining each of the the different screens and interfaces that a user would touch to do parts of their job.
After the meeting, it occurred to me that the right way to search for a solution is to examine both sides of the problem at the same time.
While the user interface is what the user is interacting with at any given moment, a user’s higher level understanding of the data they work with is rooted in how the designer of an application has modeled that data. If the person who designed an application centered it around a specific type of data, then the interfaces for that application, and many of the ways the application will behave will be centered around that piece of data.
There’s a common premise that gets thrown around a lot in the design world:
“The interface is your application.”
The idea behind that is that it doesn’t matter what your application does, only how the user interacts with it. This is all they see, so to them this is the application.
I don’t subscribe to this. Users work with interfaces only to achieve a higher-level goal. Yes, that mortgage application is pretty and very intuitive, but what I’m really trying to do is input the data necessary to start getting a mortgage. I love all the design work that went into that car-selling application, but what I need is a clear understanding of how to form this search to find that Porsche 911 I’ve been dreaming about since I was a kid.
So, to this end, I offer this:
“The data is your application.”
Interfaces manipulate data. Get the data wrong, all the design in the world isn’t going to solve your problem and the entire application will suffer for it. In this sense, the interface level of an application is simply a reflection of how the data for that application has been modeled.
These interfaces can certainly be done well or poorly – and a visual designer is critical in this process – but if you tell a designer to give you an interface for working with a widget, and what the user really needs to work with is a doodad, then all the Photoshop in the world isn’t going to work that problem out.
I realize, of course, that I may have ignited a Holy War with this comment. Before I get any impeccably designed hate mail, let me say that my point is not to denigrate design, but at least elevate the data modeling behind an application to the same level as the pretty face that gets put on it.
It was right about this moment that I realized that Corey need to understand the concepts of a relational database so that he could develop the ability to examine problems from both a pure data perspective and from an interaction perspective at the same time.
Later on, while writing this document, I came to acknowledge that relational databases were simply a convenient medium through which to teach basic concepts of data modeling. The term “databases” encompases a large variety of information storage methods, but they’re all centered around rules which govern the management of data and require you to think in pure data concepts. Relational databases provide a very good introduction to thinking this way.
This document is a wild expansion from the original set of notes I created to teach Corey how relational databases worked.
Why Should a Non-Developer Care About Relational Databases?
If there’s anything non-sexy in the world of information, it’s probably databases.
Correction: If there’s anything non-sexy in the world, it’s probably databases.
There’s likely no drier subject – nothing deeper, darker, and more under-the-covers than systems which do nothing but hold...data. In fact it can almost be said that databases are designed to make data boring. Because “boring” is often a euphemism for “stable, clear, and unexciting.”
And this is this what you want with your business-critical data:stability, clarity, and a distinct lack of excitement.
So, as a non-developer, why should you spend any time learning any more than you have to about them? These are for people with neck beards, right?
Here’s why –
Because databases are pervasive
The fact is, you interact with databases all the time. Unless you never use anything but entirely transient applications that don’t need to persist data longer than the time you spend actively working with them, then that data has to be stored somewhere
This data is the only thing that gives that application any context or continuity from session to session, and it’s likely stored in a database somewhere. If not for this data and our ability to manage it, all your web applications would simply forget you from visit to visit, and where would that leave us?
Databases are really the collective memory of the enterprise and of the Internet as a whole. Everywhere you go, you leave digital footprints in a database somewhere, and it’s largely because of these that information technology doesn’t forget everything it knows from second to second.
Databases, therefore, are perhaps the most core part of an IT infrastructure after the computing and networking hardware itself. Almost every application in the enterprise uses a database in some form, and this is the backbone of the application’s functionality. Everything many applications do is simply create and manipulate data in a database somewhere.
Most web applications, in fact, are simply wrappers around a database that let users view the data in different ways in an attempt to draw some context out of it and have it makes sense to their current task.
Many enterprises are purely based around the data in their databases, and the management and security of this data is synonymous with the business itself. Banks have all their customer information and balances in databases. No data means no business.
Never forget this –
That balance you think you have in your checking account? It’s just a record in a database somewhere, and the only reason you “have” that money is because the database says you do.
So, sleep well tonight.
Because they affect almost every development project
For those of you in the creative industry, you’ll come into contact with databases more often than you think, often in the context of a development project. This will bring you into contact with database architects and engineers and require you to speak their language.
When they start throwing around terms like “referential integrity” and “normal forms,” you need to be able to understand what applies to you and what doesn’t.
Perhaps more important than speaking their language, however, you need to be able to think like they do and put yourself in their shoes. You might propose things that they’re uncomfortable with, and when they resist, you need to understand why so you can discuss and resolve the problem the best possible way for both sides. You need to have a basic understanding of the pressures they’re under to perform in their roles, so you don’t inadvertently push them to do things that cause them to push back because of some reason four levels removed from what you’re thinking.
Why are they so uptight about something called a “primary key”? Why do they always have to know if this piece of data is an exception, or if there are going to be more like it? Why do they want our logs written to files instead of a database table?
As we said before, databases are the core of many applications. Building an application without understanding how an absolutely key piece of infrastructure works is like building a house with no understanding of how a nail holds two pieces of wood together.
Because reification is a core information skill
Databases the most primary example of the need to operationalize logical concepts. Databases are where vague ideas are solidified into actual, manageable units of information, and the skills around doing this are critical for anyone working in IT.
In many projects, goals are discussed in very conceptual and abstract ways.
“You know what we need about here? A database of all our of press releases!”
While this is certainly a noble goal in the abstract, to address this need in practical terms, you need to operationalize this information – you need to drag the concept of a press release down to some manageable format.
To do this, you need to be able to distill a press release into a pure information unit, then dissect it into its constituent parts, define which combination of those parts represents logical objects, and determine how those objects relate to each other.
The proper name for this process is reification , which is literally translated from Latin as “to make a thing.” It refers to the process of taking an abstract concept and turning it into a concrete thing.
The more common name for this is data modeling , which is an extremely valuable skill, both in the direct sense of what you do with it, but also in the indirect sense of the secondary effects this process will have on your project.
Nothing helps define a project more clearly than asking hard questions about very abstract concepts in order to model these concepts in some structured format, like that which a database requires. It’s at this stage of a project where the needs become clearer, the goals become sharper, and the scope starts to take shape. If your overall goal is unreachable, it’s often at this point where that becomes clear. I would wager that more projects have been scrapped at the data modeling phase than in any other, because it’s at this moment that absurdities stick out like sore thumbs.
Databases are the gears through which abstract ideas and pleasant notions are ground through in order to see if they still make sense in the stark, functional arena of tables, rows and columns.
The ability to understand and manage the tools to accomplish this process is a core skill of working with information in any capacity.
History, Purpose, and Usage
As a generic term, a “database” is any structured source of data which can be manipulated and searched. A directory of text files could be considered a database, as long as they were in such a structure that we could search through them and find a particular piece of information we needed.
A “relational database” is a source of data in a specific format involving interelated sets of tabular data. The core concepts of the relational database were invented by Edgar Codd of IBM. In 1970, he published a paper entitled “A Relational Model of Data for Large Shared Data Banks” which established how a relational database should work in theory. All of today’s relational databases have been based on these theories. (Codd is often referred to as “the father of the relational database.”)
There are other kinds of databases – XML databases have been around for a while, and so-called “object databases” are all the rage right now – but in 99% of cases, when someone in this industry says “database,” they mean a standard relational database. For the purposes of this document, “database” and “relational database” are equivalent terms.
Very lightweight consumer-oriented databases can run completely on your local machine. The most common PC database is Microsoft Access (part of the Office suite). For Macs, it’s Filemaker.
However, at Blend, when we say “database,” we’re usually always talking about a system that runs in the background of a centralized server. These are systems that can manage data for multiple different applications. You may have applications on a dozen different servers all talking to the same database server.
The nomenclature can a little tricky here – a database server is a process running in the background of a server. It usually always manages multiple databases , even hundreds at the same time. These databases can be for completely separate things and have nothing to do with each other (much like a web server can serve hundreds of independent web sites).
Databases are the most industrial strength example of enterprise software. Very few other types of software are expected to support the sustained load and data integrity needs of databases. In many organizations, every piece of information the company needs to do business is in a database somewhere. No other type of software has been optimized, refined, tested, and developed to the level that database servers have been. Some of most cutting-edge research in computing and some of the most arcane computing architectures have been developed to make database servers perform faster and more reliably.
Companies take their databases very seriously, and spend massive amounts of money installing and maintaining them. There’s an entire job description dedicated to databases : the DBA (“Database Administrator”). All these DBAs do all day is plan, design, manage, and monitor the various databases of a company. Experienced Oracle DBAs, in particular, are likely the highest paid people in IT, short of C-level management.
(In IT lore, DBAs are also considered the crankiest, most easily irritated type of employee. Think of a troll trying to guard his treasure, and you have a rough idea of the perception. Generally speaking, no one enjoys a conversation with a DBA. And if you screw up their data, God help you.)
The central database server of a larger organization will traditionally be load-balanced across several servers (even an entire data center), and replicated in real-time to multiple physical locations to guard against data loss. For a company like a bank, the records in their databases is the business. The company doesn’t exist without it.
Microsoft’s commercial database system is called SQL (“sequel”) Server. It only runs on Windows servers. SQL Server is extremely common in the enterprise, and most every company with a Windows server architecture (90% of all U.S. companies, at least) will be running one or more instances of SQL Server.
The big competitor to SQL Server is Oracle , which runs on a variety of platforms. Oracle is vastly more expensive. It’s licensed by processor, and you can easily pay $100,000/year for an Oracle license on a machine with many processing cores. You see Oracle less and less these days as SQL Server has gotten better and better. Blend only has one client running Oracle (Imation).
The open-source system for Linux that most apps use is called MySQL (we say “my-sequel,” though you might hear “my-ess-cue-elle” too). Another common one on Linux is PostgreSQL (commonly just “Postgres”). These two systems offer less functionality than the commercial options, but are often considered good enough. (Additionally, MySQL is well-known for being extremely fast, as it’s considerably stripped down.)
Some other less common database platforms are DB2 , Firebird , Informix , SQLite , Sybase , and Ingres. (Trivia: the Firebird database server is why Mozilla’s browser was forced to change its name to “Firefox.”)
Database servers have no user interface – they exist as services that a server provides. You have a program on your local machine to connect to the database server to query or manipulate the databases residing on it – much like a web browser provides a temporary connection to a web site.
For instance, SQL Server runs on your server, and on your local desktop machine you install SQL Server Management Studio which connects to a SQL Server instance somewhere out on your network and lets you manage it. Oracle has something called Toad (for “Tool for Oracle Application Developers”). For MySQL, there are many tools, but the common one in use at Blend is called Navicat.
Additionally, most all platforms have a command line tool as well, which is text only and will only execute SQL (see below for more on SQL). With this, you can manage a database server with nothing but a command prompt which is helpful if your only interaction with a server is through a text-based connection like SSH or Telnet.
(For consumer databases like Filemaker and Access, the database itself and the user interface are combined together.)
All of these systems operate in fundamentally the same ways, as defined by Codd in his 1970 paper.
Databases today are a lot like spreadsheets – you can count on them to have a core set of functionality, like all spreadsheets have worksheets, rows, columns, functions, etc., though they will also each have their own idiosyncrasies and flavors to them. One system might have features that another does not, but they all implement the core principles in much the same way.
When you build an application, you have to carefully decide if you’re going to use a certain idiosyncrasy of a specific database platform. If you do, then your application is now forced to use that specific database platform – you now depend on that idiosyncrasy. Usually, it’s smarter to just use the core features, so that your application can use multiple different databases depending on what the customer owns.
Tables, Columns, and Rows
It’s important to understand that a database is not a single thing as much as it’s a container for many different objects.
The core object type is the table. This is where all of the actual data resides. All other database objects are in service of tables and the data they contain.
A database may have a single, dozens, or hundreds of tables. An average might be 7-10. A database with 30 tables would be considered big, and 50 tables would often be considered excessive. (Bigger systems do exist, however – the Oracle Financials product uses a database with almost 6,000 tables, though this is the largest I’ve seen by several orders of magnitude.)
Every table has a unique name. This should describe the data it contains. Table names traditionally do not have spaces (though some systems will allow it), but naming standards vary widely. You can see “AccountReceivable” “accounts_receivable,” “ar,” “accountsreceivable,” etc.
A table is like an spreadsheet – it has rows and columns. Columns are set up beforehand, and form the “structure” of the table. Rows are inserted and deleted as necessary and form the “data” of the table.
Columns are alternately called “fields.” Rows are alternately called “records.” I will use these terms interchangeably throughout this document.
A table represents a type of object – “Customers,” for example. A row in that table represents an instance of that type of object – a specific customer: “Bob Johnson.” You would have as many rows in that table as you have customers.
Columns represent some aspect of the object in each row. In our Customers table, we would have a column for each piece of data we wanted to track about each customer – “FirstName,” “LastName,” “BusinessName,” etc.
Every column has a name, unique to that table. This describes the data that’s in it. “FirstName,” “City,” “AccountNumber,” etc. This is very similar to the first row of a spreadsheet which contains the column headings (though, unlike a spreadsheet, the column headings do not form a row by themselves).
The same naming standards (or lack thereof) for tables also apply to columns. Some organizations include a letter prefix to specify what’s in it (“dHireDate” where the “d” means it contains a date; “iAge” where the “i” means it contains an integer), but this trend is falling out of favor. One very common convention that persists is that boolean fields (see below) will begin with “Is” or “Has”: “IsUpdated,” “HasNotification,” etc.
After a database is created, the columns don’t change unless you’re restructuring your data. Rows, on the other hand, are very volatile. Most database systems are designed to insert, change, or delete tens of thousands of rows every second, if required.
A table might have no rows (it’s considered “empty”; this is a temporary condition, as a table that was always empty would have no reason for existing) or it might have a billion rows. Some reporting databases in large companies have row counts into the trillions. (We maintain an analytics database at First Premier than has about 20 million rows and adds 30,000 a day – about nine rows every second during peak business hours.)
Columns are ordered in a table from left to right. There’s no reason for this except to make the data easier for humans to understand. The database itself doesn’t care about the order of columns.
Columns have datatypes. This limits the data that can go in that column. Some common datatypes (the actual names for these vary widely from platform to platform):
Text (called a “string” or “varchar” for “variable character”). This is usually limited to a few thousand characters.
Long Text. Will accept theoretically unlimited amounts of text.
Number. There are usually multiple specific datatypes for this, depending on the size of the integer you expect to store – TINYINT, BIGINT, etc. – and whether you expect it to have decimal or negative values.
Date or Datetime.
Boolean or Bit. A yes/no, true/false value.
Blob. For “binary large object.” This is for binary encoded data so you can store actual files in the database, like images. (Whether or not this is a good idea is a subject of a great debate.)
Depending on the particular database product you’re working with, there are anywhere from 10 - 20 different datatypes you can use on your columns.
Columns often also have a size or precision. For most datatypes, you need to tell the database the maximum size of data that can go into it (though, as noted above, this limit is sometimes built-into the datatype itself). This is mostly to allow the database to plan file usage – it needs to know how to structure its data files on disk, and how much space to reserve for each record. If you make a column too large, the database will reserve more space than is necessary. This might seem minor, but if you multiply this by a million rows, it can cause issues.
Many database systems also allow you define custom constraints as well. If your product IDs are a letter followed by four numeric digits (“A3287”), most systems will allow you to create a constraint to enforce that all values in that column conform to that pattern (or, again, the database will refuse to insert the row).
You cannot put conflicting data into a column. If you specify a column as an integer, and try to put the word “gumby” in it, it will refuse to insert your row. Databases are designed to be protective of their data – they will actively reject data that does not conform to the established parameters of what is acceptable.
(However, understand that a database can only be as protective as you tell it to be. It only knows what to accept because you designed it that way. Poorly designed databases will take anything. If you define every single column as a string, then you’re effectively not enforcing anything because any value can technically be a string.)
Columns may or may not be allowed to be null. A null value means “no value.” This is important, because a column like a number would otherwise have no way to specify that it has no value (even zero is a value, after all). Important columns should never allow nulls. For example, if you have a database of people, there’s no reason the columns “FirstName” and “LastName” should ever be null, so you should disallow this as a value.
(What gets weird is that an empty string is not a null value. So “” is technically a value – it’s an empty, zero-length string, and it can still go into a column which doesn’t allow nulls. This makes sense technically, but not logically. It would be bad form to put an empty string in a field where the logical value should really be null.)
Every table should have a special column designated as its primary key. This is a unique value for each row that you can use to identify that row. No table can have two records with the same primary key. Primary keys can be explicit or derived.
A derived primary key is a value that serves another purpose, but also happens to be unique. If you have a table of people, social security number might be a good one (so long as they are all living U.S. citizens). If you’re the phone company and have a table of assigned phone numbers, the phone number itself is an obvious key.
If explicit primary key is a column which exists for no other reason than to be a primary key. These are usually what are called identity columns. They are auto-incrementing numbers. The first row gets a “1,” the second row gets a “2,” etc. This column is usually always called “ID” and is usually always the first column in the table. The number is this field is assigned by the database, and has no actual relation to any of the content in the row – it’s just an identifier for that specific row.
Most primary keys are explicit. It’s quite common to have an auto-incrementing integer column called “ID” in every single table.
Sometimes, a table can have a composite key , or a key that spans more than one column. This is used when there is no single unique column, but the combination of one or more columns is guaranteed to be unique. (This will make more sense in the context of one of the type of table relationships, described below.)
At their basics, tables are very simple. They get much more powerful when you start to combining them tables together.
The Art of Data Modeling
Databases have to be designed to manage a particular type of data. The tables have to be created with the correct columns and relationships to other tables to accurately represent the data we’re trying to store and manipulate.
This is done in various degrees of success. A database can be well-designed so that it supports the needs of the data, or poorly-designed so that it becomes a never-ending source of problems and actually damages the data that it stores.
Designing a database is therefore an art form. There are some core concepts that can be taught, but a lot of valuable lessons simply come from designing a lot of databases and learning from a lot of mistakes.
The design of a database is often called a database schema. (Though, Oracle actually gives special meaning to the word “schema,” which confuses things a bit when working with that platform.)
The most important aspect of designing a database is to correctly divide up your logical concept (the real-world thing you’re trying to represent) into appropriate tables and columns (and appropriate relationships – see the next section).
A table should represent a single logical type of object. The trick is reviewing your domain of content and determining what object types exist. Once you start listing the attributes of an item (the information you want to track), it becomes fairly clear which ones are different from the others and need their own tables, but be prepared to make mistakes here – you will often look at a set of columns and have the sudden realization that they don’t fit in the table you’re creating, and actually represent an entirely different thing that needs its own table.
The cardinal rule of table design is that a single column should represent a single thing, in the smallest reasonable unit. You can divide a piece of information into multiple “chunks.” How small these chunks get is known as granularity.
Consider the Address field in a database of our customers. We could create a string field simply called “Address,” and store this value in it:
231 S. Phillips Ave., Suite 200, Sioux Falls, SD 57104
This is not very granular – everything is thrown into one field. If we only ever want to use this data in the exact format above, then this would theoretically work. However, the real-world usage of this is poor.
- What if we wanted all customers with an address in South Dakota?
- What if we wanted all customers sorted by zip code?
These are simple, common questions. But by combining all that address data in a single field, we’ve limited our ability to work with the component parts of it. Things like the state (“SD”) and the zip code (“57104”) are “trapped” inside that single field. We can’t filter or sort on these values.
At the opposite extreme, let’s take the address to its most granular form:
- Street Number (“231”)
- Street Direction (“S”)
- Street Name (“Phillips”)
- Street Suffix/Modifier (“Ave.”)
- Partition Name (“Suite”)
- Partition Number (“200”)
- City (“Sioux Falls”)
- State (“SD”)
- Postal Code (“57104”)
This represents the highest level of usability and accessibility of this data. With the data modeled this way, you can ask questions like:
- How many addresses are odd numbers on the second floor of a city in northeast region of the country?
- How many addresses are within two blocks of the city center of all cities in Ohio?
These are very granular questions, and our proposed table design will support those answers. But this also introduces some complications, because the users of your application likely aren’t going to enter their addresses like that, so you’re going to have to break them down manually. And what if you have a legitimate address that doesn’t fit in that format (what’s known as an edge case )?
The fact is, you’ve likely made your table definition too granular, in this case. You’ve introduced more cost than benefit, and the benefits would only bear fruit in extremely specific situations which aren’t likely to occur.
(This depends on your intended usage, of course. If you are the U.S. Postal Service, and this is the central database to track all deliverable mail addresses in the country, then this is an appropriate level of detail. However, if you’re just maintaining a list of your customers, then it’s overkill.)
In reality, you would usually break and address up into fields like this:
- Line 1
- Line 2
- Postal Code
This gives you a nice balance between granularity and usability. And this is one of the core decision points of designing a database – _how granular should you model the data? _Many situations are obvious, others are less obvious. Learning how to balance granularity is a skill that only comes with experience.
Sadly, mistakes made in this stage are very dangerous. Imagine if you modeled all your addresses in a single field as shown in the first example, and then decided later on (after 10,000 records) than you wanted to change to the correct format in the last example? How are you going to pull apart all that data and get it in the correct fields? There are very few good options for this.
Another key aspect of designing a table is determining when a column doesn’t store a simple value, but rather stores an entirely different object, that should be managed in its own table.
This is the subject of our next section.
The One-to-Many Relationship
In the real world, objects are related to one another in many different ways. One of the most common is a parent-child relationship, where one object “owns” another object.
- A father and his children.
- A shopping cart and its items.
- A building and its rooms.
In all of these cases, the relationship is called a "one-to-many” relationship because the relationship is single in one direction, but multiple in others.
- A father can have many children, but a child can only have one father.
- A shopping cart can contain many items, but a specific item can only be in one cart.
- A building can contain many rooms, but a room can only be in one building.
This is the core type of relationship – one of parentage. One object is a parent of multiple other objects, and each one of those child objects only has a single parent.
Let’s demonstrate this by creating a table to store all the buildings on a fictional college campus. First, we’ll do it the wrong way, then we’ll correct it.
Building ID (integer, not null, auto-increment, primary key) Name (string, not null) Rooms (string, null)
In the Rooms field, I will just enter my rooms, separated by a comma, like this:
Bob's Office, The Main Theater, The Broom Closet
This demonstrates the problem of granularity that we discussed previously. I am using one column (“Room”) to represent multiple things – three different rooms. Thus, it would be hard to ask a question like this:
How building on campus has the most rooms?
What am I going to do...count the commas in the Room field? And what happens when my boss wants me to also store the date each room was last cleaned? How could I possibly do that?
Maybe I can make it a little better by changing my table to look like this:
Building ID (integer, not null, auto-increment, primary key) Name (string, not null) Room1 (string, null) Room1LastCleaned (date, null) Room2 (string, null) Room2LastCleaned (date, null) Room3 (string, null) Room3LastCleaned (date, null) Room4 (string, null) Room4LastCleaned (date, null)
I hope the problem with this is clear. I am limited on the number of rooms, or else I have to keep creating columns ad infinitum – in this situation my number of columns has to be twice the number of rooms of my largest building, which just isn’t manageable. (And finding out the building with the most rooms is still very difficult under this design.)
The key here is that a “room” is actually its own object, much like a building. It needs its own table, with each room record clearly “owned” by a parent record in the Building table.
In a database, this is represented by relating two tables together. A record in one table can “own” multiple records in another table.
Let’s consider a building and its rooms. In this scenario, the building is the parent, and the room is the child.
To represent this, we have two tables.
Building ----- ID (integer, not null, auto-increment, primary key) Name (string, not null) Room ----- ID (integer, not null, auto-increment, primary key) Name (string, not null) LastCleaned (datetime, null)
To tie these two tables together, we add a column to the child table – the Room table. This column is “special” because we tell the database that it’s a foreign key – a reference to a primary key in another table. So, a column on the child table references the ID of a record on the parent table.
Here’s our new Room table:
Room ----- ID (integer, not null, auto-increment, primary key) _BuildingID (integer, not null, FK to Building/ID)_ Name (string, not null) LastCleaned (datetime, null)
When we specify that our BuildingID column is a foreign key to the ID column on the Building table, this tells the database that every row in the Room table must tie back to a record in the Building table.
You can name a foreign key field anything (it’s just another column, after all), but it’s conventional to name it for the table and column it references. Hence, referencing the “ID” field of the “Building” table results in a name of “BuildingID.”
When ordering columns, foreign key columns are often lined up directly to the right of the ID column – after it, in sequential order.
Note that our BuildingID foreign key is not nullable – there has to be a value, and the database will ensure that the value is only one of the values that appears in the ID column of the Building table.
(Can you have a nullable foreign key? Yes, but it’s rare. There are limited instances where you want to represent that an object may be a child of another object, but might stand on its own with no parent. In this case, BuildingID could be null, but if it has a value, the value has to tie back to a Building record. This is not common.)
With this design, a Building can have multiple room records. It might have none (so no Room records contain a Building’s ID in their BuildingID column), though this would logically be odd (what kind of building has no rooms?). On the other extreme, the single record for The Pentagon or The MGM Grand in the Building table might have 5,000 child records in the Room table. With this model, the “depth” of the child records in the relation to the parent – the number of rooms a building can have – is theoretically unlimited.
So what happens if we try to delete a Building record while there are Room records tied back to it? The database will not let us. This is called referential integrity – the database protects its data by making sure the relationships or references stay valid. To delete a Building which still “owned” Rooms would mean “orphaning” Rooms, and that would be data anomaly , which the database will prevent. To delete a record in the Building table, you have to delete all the related records in the Room table first.
Conversely, you cannot insert a Room record until you have its parent Building record. To insert a Room record, the database is going to require a valid Building ID, so you need that first.
Foreign keys can necessarily only reference primary keys, since the child value they has to be able to reference a single record in the parent table, and the only way to reliably reference a single record is by using a primary key. (Hence the name “foreign key” – it’s a primary key in a foreign table.)
While foreign keys are usually always integers (because auto-incrementing integers make handy ID columns), they don’t have to be. For example, you could have a table of States where the state abbreviation is the primary key (it’s unique, after all). Then the foreign key in the City table would be a string field of the state abbreviation. The database doesn’t care, but the datatypes between the two columns have to match.
A table can have multiple foreign keys. For instance, lets say we want to specify the type of a room. A room can be one of the following types:
A “room type” is a managable object (we might have other types in the future, or we may want to delete a type). So, let’s create a table for it.
RoomType ----- ID (integer, not null, auto-increment, primary key) Name (string, not null)
Now let’s change our Room table to add another foreign key:
Room ----- ID (integer, not null, auto-increment, primary key) BuildingID (integer, not null, FK to Building/ID) RoomTypeID (integer, not null, FK to RoomType/ID) Name (string, not null) LastCleaned (date, null)
Now, the database will enforce referential integrity to both Building and RoomType. The RoomType of “Office” can be considered to “own” all the office records in the Room table.
Now, to insert a record in Room, we have to first have both a Building ID and a RoomType ID in hand.
This is a lot of work, so why didn’t we just create a “RoomType” column as a string, then insert the word “Conference” or “Auditorium” in it?
You could do this, and for such a trivial example, it might be a reasonable solution. But the problem comes when upper management decides that “Offices” should now be called “Workspaces.” If we have 10,000 records in the Room table with the word “Office” in their RoomType field, we have to go back to update all of these records. Alternately, if we did this with a foreign keyed table, we just have to update one column – the Name column of the RoomType table which currently contains the value “Office.”
Can we do this? Sure. And, again, for this example, it’s perhaps not a big deal. But what a foreign key allowed us to do was link to the concept of something, not the specific label for something. What we wanted to represent with our RoomType table was the concept of a space where one person worked at a desk. We created a row in the RoomType table to represent that concept, to the exclusion of other concepts, and we just happened to label it “Office.” If we foreign keyed correctly, we can change the label to “Workspace” without disturbing the fact that 10,000 Room records are representative of that concept. Correctly foreign-keyed, the current label becomes incidental to the concept, which is the goal.
As databases get bigger and bigger, you get more and more foreign keys to the point where very few tables are “orphans” – every table is, in some way, related to one or more other tables.
The process of discovering and creating these relationships is called database normalization.
Other Relationship Types
So far we’ve discussed the one-to-many relationship, which is, by far, the dominant relationship type. But there are two others to be aware of.
A one-to-one relationship is rare, but it happens occasionally. In this case, one record in a table is related to exactly one record in another table.
Why is this rare? Because if those two records in those two different tables are so closely related, why not just add the columns from one of the tables into the other? It’s a valid question, and if you find yourself creating a one-to-one relationship, you need to ask it.
However, an instance when you might not want to do this is when relatively few of the records in one of the tables needs the “extra” data provided by the other table.
Going back to our Room table, let’s say management wants to track which service rooms have a floor drain and which ones have a faucet. We could just add HasFloorDrain and HasFaucet columns onto our Room table, but if only 3% of our records are for service rooms, then we end up with extra columns in service of very few rows – columns that will be null (or false) in 97% of records.
Is this a big deal? Honestly, probably not, and in most cases you would just add the columns. But in some edge cases – like tables with billions of rows – this is a legitimate issue in term of data storage size and other considerations. And say management wants to track SeatingSize and HasSoundSystem for auditorium rooms. Now you have two more columns in service of another small group of rows. This scenario could play out over and over.
Additionally, you may not have the authority to change the Room table. Perhaps that table is part of a commercial software package you can’t modify, or modifying it would cause other problems.
Permissions might be another reason. Perhaps there’s a security consideration in allowing someone to know a room has a floor drain, and you don’t want people who have access to the Room table to have access to this other information.
So, we’ll create another table:
ServiceRoomFeatures RoomID (int, not null, primary key, FK to Room/ID) HasFaucet (boolean) HasFloorDrain (boolean)
Note that this table has no auto-incrementing primary key. It doesn’t need one. We can guarantee that there will only be one record in this table for (at most) each record in Room, so the RoomID should remain unique. To put it conversely, we won’t have two records in ServiceRoomFeatures with the same RoomID (what would be the point? To say that it has a faucet...twice?).
This table can be looked at as an extension of the Room table. For selected records of Room, it “tacks on” a couple fields at the end.
(PostgreSQL has an interesting feature involving one-to-one relationships. You can “inherit” tables, so that ServiceRoomFeatures that be explicitly specified as an extension to Room, and the data is viewed, it automatically adds all the fields and data from Room onto ServiceRoomFeatures to present a “virtual” table, as if the two tables were permanently joined on that relationship. While a neat feature, it’s specific to Postgres, so it’s not something used by many applications, lest they get locked into Postgres.)
What’s interesting is that the only enforcement of the one-to-one relationship comes from the fact that RoomID is both a foreign key and a primary key. It’s a foreign key to Room, so it has to exist in Room, and it’s a primary key on its own table so it has to be unique to the ServiceRoomFeatures table. Those two things combined will ensure this there is only a single record in this table for a companion record in the Room table.
(If this example seems contrived, it is. One-to-ones are rare, and I think in my entire career, I can count the number of times I’ve done this on one hand.)
The last relationship type is also the trickiest – the many-to-many relationship. This is where a record in Table A can “own” many records in Table B, but the record in Table B can also “own” many records in Table A. This is fairly common.
One college professor has many students, and each student has many professors.
One category can hold many news articles, and each news article can be in many different categories.
One project can have many different team members, and one team member can work on many different projects.
Consider if we tried to model the first example as a one-to-many relationship instead: One college professor has many students, and each student can only have one professor. Obviously, that’s wrong. How about, one student can have many professors, but each professor only has one student? Wrong again. The fact is, this relationship is “many” on both sides.
Many-to-many relationships are complicated by the fact you need another table to make them work. This is called an interstitial table , and it sits between the two other tables in the relationship. An interstitial table is a normal table like any other, but it serves the purpose of “gluing” two other tables together, hence it’s referred to as “interstitial.”
Back to our building and room database, management has now decided that a room can actually serve several purposes. You can take an office and use it as a conference room, so they want to know all the possible usages for a space, not just the primary usage.
This is clearly a many-to-many relationship:
One Room can have many different RoomTypes, and each RoomType can be assigned to many different Rooms.
To create a many-to-many relationship, you put a table “between” the two tables in the relationship, and fro it, you form one-to-many relationships to both of the other tables.
RoomTypeAssignment RoomID (int, not null, partial primary key, FK to Room/ID) RoomTypeID (int, not null, partial primary key, FK to RoomType/ID)
This table has two rows, both foreign keys to other tables. Each row in this table represents the “marriage” of a row from the Room table and a row from the RoomType table.
It’s also representative of the composite primary key we talked about before. The primary key for this table doesn’t exist in a single column. If a particular Room has more than one RoomType, then its ID will appear more than once. If a RoomType is assigned to more than one Room, then its ID will also appear more than once. What becomes unique is the combination of the two – there should be no row with the same combination of RoomID and RoomTypeID (what would be the point? To specify that this room is an office...twice?)
(Could you add an auto-incrementing RoomTypeAssignmentID field and use that as a primary key? Sure, but it’s bad form, and it opens the possibility that you could get a record with a duplicate combination of RoomID and RoomTypeID, which would be a data anomaly, so it makes sense to use a composite key here to ensure uniqueness.)
Interstitial tables are often named in a such a way to indicate that their purpose is completely interstitial. One convention you often see the the two other table names with a numeric “2” between them. In this case, “Room2RoomType.” Literally, this means “this table connects a Room to a RoomType.” This convention is a dead giveaway that this is an interstitial table.
What’s implied in that naming convention, however, is that the only purpose of an interstitial table is that it connects two other tables, and this isn’t always true. An interstitial table can also have its own data that informs or describes the relationship.
Let’s say that we want to include a note with our room type assignments. So that if someone wants to use an office for an auditorium, they can find out where the sound system is kept, or how to lower the projection screen.
Where does this data belong? On the Room table? No, because a Room can have more than one RoomType, so it would need more than one note. On the RoomType table? No, because a note is not related to a single type, but rather to the usage of that type in a particular room.
Our note, in fact, is specific to the relationship of a Room and a RoomType. The interstitial table models this relationship, so we put the note there.
RoomTypeAssignment RoomID (int, not null, partial primary key, FK to Room/ID) RoomTypeID (int, not null, partial primary key, FK to RoomType/ID) Note (string, null)
Now we can include a note that says, effectively: “When using this room for this purpose, here is some information that would be helpful.”
In this case, the table is can be thought of as another object in itself – a “room type assignment” (hence the name). This assignment has a room, a type, and some extra information, but each record is a clear instance of this type of object. This assignment can be created and deleted without affecting either of its constituent parts.
Finally, it’s important to note something subtle which you may have already noticed – a many-to-many relationship doesn’t really exist. What we think of as a many-to-many relationship is really just two one-to-many relationships off of the same interstitial table.
A database really has no concept of a “many-to-many relationship.” It can’t enforce them and gives them so special consideration or handling. As far as your database is concerned, the RoomTypeAssignment table just has foreign keys to two other tables, which represent two one-to-many relationships.
The concept of a many-to-many relationship, therefore, is purely logical. It’s fine to say “one room can have many types, and one type can be assigned to many rooms...therefore this is a many-to-many relationship!” However, in a technical sense, when implemented, this is represented by multiple one-to-many relationships and a composite key.
Once you have data in a database, you need to get it out, and there are countless ways to filter, order, and present it. The tool for this is a language called SQL (usually pronounced “sequel,” though many people will pronounce the letters, as “ess-cue-elle”). It stands for “Structured Query Language”, though you’ll occasionally hear claims that the “S” stands for “Standard” or “Simple” (it doesn’t, for the record).
SQL was developed to provide a common method to access data in relational databases.
SQL has gone through several standard revisions by the American National Standards Institute (ANSI). A very large revision to the standard occurred in 1992, hence most databases claim to be “ANSI-92 compliant,” which means they theoretically support all of the functionality in the 1992 specification. There have been five revisions since 1992, so there are a lot of further standards that may not be supported in an ANSI-92 compliant database.
Like different databases have different functionality outside of the core, they also support different SQL syntax outside the core to support their non-standard functionality. A large portion of the SQL you will ever need is standard across all databases, but slight differences do exist.
Some database vendors give their interpretations of SQL a name, mainly for branding purposes and to signify that their version of SQL includes proprietary features specific to their platform. Microsoft calls theirs “Transact-SQL” or “T-SQL.” Oracle calls their “Procedural Language/Structured Query Language” or “PL/SQL.” The only time you’re likely to hear these names used for SQL is from the vendors themselves or in the DBA job market.
SQL is a language in which you can write a query to retrieve a dataset from one or more tables. SQL is quite readable, even to the untrained eye.
A simple SQL statement looks something like this:
SELECT ID, Name FROM Rooms WHERE BuildingID = 4 ORDER BY Name
It should be fairly apparent what this does, simply from reading it. It looks in the Rooms table for any record with BuildingID field equal to 4. Then it retrieves the ID and Name columns, orders those alphabetically by the Name column, and returns them as a two-column dataset.
A query always returns a dataset, which is simply data in the form of a table. In the case of the above, that table is going to have two columns, named ID and Name, and as many rows as there records are with a BuildingID of 4 – this could be zero, one, or one million (in fact, you may perform this query specifically to find this out).
By default, columns in the returned dataset are named for the underlying column they were retrieved from. They can be renamed and it’s possible to return a column in a dataset that doesn’t actually map directly to a column in a table. For instance:
SELECT COUNT(\*) FROM Room WHERE BuildingID = 4
This will find all the rows where BuildingID = 4, but instead of returning them, it will count them up and return how many it found as a single number (the COUNT function, just after SELECT). This number doesn’t map directly to any column in the database.
When this happens, the data still comes back as a dataset, but it will simply be a one column, one row dataset. This is a bit of a cheat, but it lets SQL always return a table-like structure, even when it’s a single value. The name of the single column varies depending on the database product. In some, it will have no column name at all. In others, it might be named “COUNT(*)”.
So far, we’ve only talked about a SELECT statement, which is how you get data out of a database. However, there are two other major uses for SQL:
- Modifying data, using an INSERT, UPDATE, or DELETE statement.
- Modifying the structure, settings, and core function of the database itself (adding, changing, or deleting columns), using the ALTER and CREATE statements, and any number of other statements specific to your platform.
In practice, changes to a database’s structure are usually made using a visual tool – like SQL Server Management Studio – but in most databases, you can do all database creation and management using SQL, if you so choose. Few people do, but some grizzled veteran DBAs might. Additionally, it allows a database server to be managed by any tool so long as that tool can execute SQL.
The INSERT, UPDATE, and DELETE statements are fairly simple, so for the purposes of this document, we’re going to concentrate on the SELECT statement and all its permutations, of which there are many.
A Basic SELECT Statement
To get data out of a database, you write a SQL query that begins with the word SELECT. The most simple SELECT statement we can write looks something like this:
SELECT \* FROM Buildings
This will return a dataset consisting of all the rows in the Buildings table. The dataset will have the same columns as those in the Buildings table, in the same order, and as many rows as in the Buildings table. For all intents and purposes, the returned dataset will be an exact duplicate of the Buildings table.
Incidentally, whitespace doesn’t matter. This statement is the same thing:
SELECT \* FROM Buildings
Capitalization of keywords or reserved words like SELECT and FROM doesn’t matter either. This statement is also the same thing:
SELECT \* from buildings
However, it’s common to use all caps for keywords like SELECT and FROM to make it easier to pick them out from the names of database objects like tables.
Capitalization of table names may matter, depending on your database server and, in some cases, the operating system on which is is running. On SQL Server, for instance, capitalization of table names does not matter. WIth MySQL running on Windows, it doesn’t matter either. However, MySQL on Linux is case sensitive, as is Oracle (on any operating system).
The simplest SQL statement possible essentially breaks down into two parts:
- What I want (“SELECT *”)
- From where (“FROM Buildings”)
The asterisk is shorthand for “all columns, in their default order.”
If you only wanted the Name column, then you’d write this:
SELECT Name FROM Buildings
This would give you a one-column dataset. If you wanted the ID too, you can separate multiple columns with commas:
SELECT ID, Name FROM Buildings
This would give you a two-column dataset.
If, for some reason, you wanted the column names in the returned dataset to be different than the columns they were taken from, you can use the AS statement to rename them:
SELECT Name AS [Building Name] FROM Buildings
This will give you a one-column dataset of the building names, but the column will be named “Building Name.” Note the space, which was allowed because we surrounded the phrase “Building Name” with square brackets in the SQL.
In most databases, square brackets let you use words and phrases which you wouldn’t normally be allowed to use (like column names with spaces). For instance, say we wanted to return a column named “Select.” I would do this:
SELECT Name as [Select] FROM Building
If I didn’t do this, the database would get confused because “SELECT” is a reserved word – a word SQL uses as an instruction. By surrounding it in brackets, we’re saying “this is not an instruction in this case, but rather a word we want to use to name something.”
Using the ORDER BY statement at the end of of SQL lets us define a specific order for the records in the dataset. The sort is based on datatype, but should be fairly obvious:
- Text columns sort in alphabetical order.
- Date columns sort in chronological order.
- Numeric columns sort in numeric order.
Your database knows how to sort because of its collation settings, which dictates how different values are ordered. While dates and numbers are universal, text may be ordered differently based on the language and other factors, such as:
- For instance, does an uppercase “A” come before a lowercase “a”?
- If text has a space has its first character, do we consider that when sorting?
- Into what order do we put text that starts with various punctuation symbols?
All of these questions are answered by configuring what collation algorithm your database will use. This is set by your DBA when creating the database, and can be difficult to change after the fact. Thankfully, you shouldn’t have to give this much thought.
For these examples, we’re going to expand our Building table to have a couple of extra columns:
Building ID (integer, not null, auto-increment, primary key) Name (text, not null) NumberOfFloors (integer, not null) DateBuilt (date, not null)
Here’s an example of the ORDER BY statement at work:
SELECT \* FROM Building ORDER BY Name
This will return all the records, sorted by their Name in alphabetical order. To reverse the order, add the DESC clause (for “descending”) to the end.
SELECT \* FROM Building, ORDER BY Name DESC
(Technically, there’s an ASC modifier to represent ascending order, but since this is the default sort direction anyway, it only gets used in the case of severe anal-retentivity.)
Say we have two buildings named the exact same thing. In this case, we want them to be ordered by DateBuilt. So, we want to sort by Name first, and DateBuilt second. To do this, separate the desired fields by commas.
SELECT \* FROM Building ORDER BY Name, DateBuilt
Each field can have its own sort direction, so we can sort by Name alphabetically in ascending order, then for buildings with the same name, we can sort them by DateBuilt in descending order.
SELECT \* FROM Building ORDER BY Name, DateBuilt DESC
In this case, the DESC clause only applies to the DateBuilt column.
The ORDER BY statement comes in handy when we want to find out the top X items in a list. Say we wanted to find out the top three Buildings when ordered by Name. We could execute the SQL listed above, and get the entire list back and just count off the top three, or we could explicitly limit our returned dataset like this:
SELECT TOP 3 \* FROM Building ORDER BY DateBuilt
The TOP X statement will do just want it appears to do – take the first X rows to form the dataset, and ignore the rest. In this case it gives us the oldest three buildings in the Building table.
This also gives us a chance to demonstrate syntax differences across platforms, because this functionality isn’t very standard. TOP is valid for SQL Server, but MySQL would do the same thing by using a LIMIT clause on the end.
SELECT \* FROM Building ORDER BY DateBuilt LIMIT 3
Oracle uses yet a different method by returning a pseudo column called ROWNUM on which you can use a WHERE clause (more on that below). This column doesn’t really exist in the data, but is something Oracle adds to the dataset after sorting it precisely so you can use it to limit the resulting rows.
SELECT \* FROM Building WHERE ROWNUM < 4 ORDER BY DateBuilt
(So, what’s the standard here? In the 2008 revision, ANSI specified this syntax:
SELECT \* FROM Building ORDER BY DateBuilt FETCH FIRST 3 ROWS ONLY
DB2 used this prior to the 2008 revision. Virtually no other vendor has implemented it.)
If you don’t specify an ORDER BY, how is the returned dataset sorted? It depends on the product, but most databases will sort by the primary key field, which is often a numeric ID column.
In most of our previous examples, we returned all the rows in a table. In many cases, however, we want to find out what number of rows match some specific criteria. To do this, we use the WHERE clause.
In its simplest form, a WHERE clause looks like this:
SELECT \* FROM Buildings WHERE Name = 'The Pentagon'
The returned dataset will contain only the rows in which the name exactly matches “The Pentagon”.
This filtering is based on the part of the SQL statement which comes after the keyword WHERE, in the format “[column name] [comparison type] [value]”. In our case:
- Column Name: Name
- Comparison Type: equals (represented by “=”)
- Value: The Pentagon
Notice that the value is in single quotes. For text or dates, always enclose the value in single quotes so that your database knows this is a value. For numbers, you don’t need to use quotes, but many database platforms will let you, which means you can enclose all values in quotes without a problem, giving you one less exception to remember.
Multiple filters can be created using the AND or OR connectors:
SELECT \* FROM Buildings WHERE Name = 'The Pentagon' OR Name = 'The White House'
The returned dataset in this case will contain rows in which the Name column equals “The Pentagon” or “The White House”.
Previously we said that capitalization of keywords didn’t matter, however whether or not the capitalization of data values matters depends on your particular database server and how it was configured.
For instance, would the SQL above return a row in which the name column contained “the white house”? By default, SQL Server would return this and so would MySQL; however, Oracle would not. The former are case insensitive by default, while the latter is case sensitive by default. However, all platforms can be configured to support or ignore case sensitivity as desired as part of their collation settings.
In our sample above, “=” is called an operator. Operators allow the comparison of values. Some other common operators are:
- < for “less than”
- <= for “less than or equal to”
- > for “greater than”
- <= for “greater than or equal to”
- <> for “not equal to”
- “LIKE” for “containing this text string”
To use each of them, we need to follow the column-operator-value format:
SELECT \* FROM Building WHERE DateBuilt >= '2012-01-01' AND DateBuilt <= '2012-12-31' AND NumberOfFloors > 10
This would return all buildings built in 2012 with more than 10 floors. Note that we used “<=” and “>=” in both date comparisons (meaning “greater/less than or equal to). We had to because “<” and “>” would not have included the date listed, and would have instead returned buildings built from January 2 to December 30.
Also note the date formatting. Each platform is different, but YYYY-MM-DD should be fairly consistent among them.
Note too that the precision of the date column matters here. If our DateBuilt column was instead a datetime column, then it would include a time after it. In this case, not including the time usually means “00:00:00”. In this case, the first clause works fine, since we’ll get anything after 00:00:00 on January 1, which is to say the whole day.
However, the second clause might be a problem, since it wouldn’t include anything after the very start of December 31, effectively excluding that entire day. We would need to change that clause to:
DateBuilt <= '2012-12-31 23:59:59'
This is an instance where too much precision is a problem. Since the time of day doesn’t really apply to DateBuilt (how would you decide what time of the day a building was completed?), we should ensure this is a date column without the additional precision of a time.
LIKE is a special operator that will match columns with text that contains specified text inside it, subject to wildcards. Consider:
SELECT \* FROM Building WHERE Name LIKE '%Plaza'
The percent sign (“%”) is a wildcard, which means any text string. So the SQL above would return any building where the name ended with “Plaza.” To return a building where the name starts with “Plaza”, you would put the percent sign on the other end.
SELECT \* FROM Building WHERE Name LIKE 'Plaza%'
To return any building where the text “Plaza” appears anywhere inside the name, use percent signs on both sides.
SELECT \* FROM Building WHERE Name LIKE '%Plaza%'
If you didn’t use the percent sign in the value, LIKE works exactly like equals.
(It’s tempting to think that this is a great way to build a search engine, but it’s really not. First, the LIKE operator is terribly inefficient, as the database has to search through and break apart the text of each and every record. Consequently, queries using LIKE are very slow compared to other queries, and on tables with millions of rows, they would quickly bring the database to a standstill.
Second, searching for a specific text string is usually less than helpful. For instance, searching for “my cat” will only return rows with that exact six-character combination in that order. It won’t find “my old cat” or “our cat,” which is sometimes what the searcher wants. Modern search engines are designed to find words even if they’re not adjacent, and to look for alternate forms of words that mean roughly the same thing. LIKE includes none of this logic.)
Finally, WHERE clauses can be parenthetical. Any clauses in parentheses will be evaluated for truth or falsity as a set. For example:
SELECT \* FROM Buildings WHERE DateBuilt <= '1999-12-31' OR ( DateBuilt >= '2000-01-01' AND NumberOfFloors > 10 )
This will return all buildings built in the 20th century or buildings built in the 21 century that have more than 10 floors. So a nine-floor building built in 2003 would not be returned, because it fails on one of the clauses in the parenthetical.
Other Database Objects
Remember when we said that a database is a container that holds lots of different objects? Until this point, we’ve been talking about nothing but the table. This is appropriate, because the table is the core object in any database, and remember that it’s the only place where data is actually stored.
However, there are other types of objects, several that you should know about. All of these are very common, and should be supported by most any database system.
Remember how SQL returns a dataset, and how a dataset looks a lot like a table? Wouldn’t it be neat if you could do a query on a dataset? It would be essentially querying the results of a query, and that might be handy.
Well, you can do this by using a view. A view is simply “pre-recorded SQL” which generates a pseudo-table which can then be queried like an actual table. Views have names, which have to be unique. Once created, the names of views can simply be substituted for the names of tables in SQL.
Say we create a view using this SQL:
CREATE VIEW BuildingsBuiltIn2010 AS SELECT \* FROM Buildings WHERE DateCompleted >= '2010-01-01' AND DateCompleted <= '2010-12-31'
After this SQL is run, there will be a new object in the database – a view called “BuildingsBuiltIn2010.” This view will act like a table which stories a subset of records from the Building table. And it can be queried like a table, so we could execute this:
SELECT \* FROM BuildingsBuiltIn2010 WHERE NumberOfFloors > 1
This would give us the number of multi-story buildings built in 2010. The WHERE clause in the SQL we just ran will filter out the single-story buildings, and the view itself will only return buildings built in 2010.
It’s important to note that views do not store any data. If you open a view in your database management program (or run a “SELECT * FROM...” statement) you will see data, but this data doesn’t reside in the view. The data is in an underlying table, and the view simply generates data real-time, at the second you look at it. It’s like a “window” into a table.
You can delete a view, and this won’t cause you to lose any data, you just won’t be able to query the view any longer.
Note too that the SQL used to create the view can return a dataset that doesn’t look anything like the underlying table. The SQL can can be advanced as you like – you can query more than one table, rename columns, pivot data, etc. Any dataset you can create with SQL you can turn into a view and query as if it was a table.
Why would you use a view? A few reasons –
- To encapsulate complicated SQL that you might use a lot in different parts of your application. If you’re always filtering a table a certain way, then modifying it slightly for different reasons, you can abstract out the common aspects of it into a view, then just query the view in different ways.
- To ease understanding of complicated data. If someone needs to view data in a certain way, you can create a view for them to use and don’t have to bother explaining how the view works. For all intents and purposes, the view is the data to the person who uses it and they wouldn’t have any idea that the data is stored differently than they’re seeing it (they might not even know they were looking at a view instead of a table unless you told them).
- To restrict data from people who shouldn’t see it. If someone needs to query the master employee table, but they shouldn’t be able to see salaries, create a view with every column except Salary, and give them access to the view. They would never even know the Salary column existed in the underlying table.
So far, we’ve only been using SQL to filter data using a SELECT statement. However, SQL is actually a Turing-complete programming language, which means it’s a fully-functional language in which complete programs can be written. SQL has if/then statements, flow control, variables, etc.
These language features can be used to do very sophisticated manipulation of data when necessary. Stored procedures can be written and either run on-demand or scheduled to run at specific times to do maintenance tasks on your data.
Unlike SQL, stored procedures often use aspects of SQL that are specific to a certain flavor of SQL. Since you’re using more features of the language that are subject to less standardization, it becomes tougher to drop a procedure from, say, Oracle into SQL Server and expect it to run without error.
Stored procedures are a good way to “roll up” multiple SQL statements that a performed a group to accomplish some task.
Stored procedures can also be used to limit data access for people who don’t need it or shouldn’t have it. For instance, if a certain user only needs to insert building records and nothing, you could write a stored procedure that looks like this:
CREATE PROCEDURE InsertNewBuilding @Name varchar(50), @Floors int, @DateCompleted date INSERT INTO Buildings (Name, Floors, DateCompleted) VALUES (@Name, @Floors, @DateCompleted) SELECT @@IDENTITY
That SQL creates a stored procedure that takes in three pieces of information, inserts the data in the database behind the scenes, then returns the ID of the newly-created record.
Your user could now run this SQL like this:
EXEC 'The Empire State Building', 102, '1931-12-31'
The stored procedure would execute, and return a one row, one column dataset with the ID of the record it inserted.
(Note: the @@IDENTITY statement used there is specific to Transact-SQL in SQL Server. It returns the last identity column inserted into the database. Depending on the volume of your database insertions, this may or may not work well. There is a slight chance that another record could insert between the INSERT statement and the SELECT statement above, which would give you the wrong identity value back.)
Since our user can only use that stored procedure, he has no idea what the underlying table looks like. He might not even know that he is inserting into a single table. If we decide, for instance, that every building has a lobby, perhaps we want to modify our stored procedure to always add a “Lobby” record to the Rooms table whenever a new building is added.
CREATE PROCEDURE InsertNewBuilding @Name varchar(50), @Floors int, @DateCompleted date INSERT INTO Buildings (Name, Floors, DateCompleted) VALUES (@Name, @Floors, @DateCompleted) SELECT @NewBuildingId = @@IDENTITY INSERT INTO Rooms (BuildingId, Name) VALUES (@NewBuildingId, 'Lobby') SELECT @NewBuildingId
Our user would still call this stored procedure the same way, but now it does several things:
- Inserts a new record in the Building table.
- Gets the ID for that new building back.
- Uses the ID to insert a record in the Rooms table called ‘Lobby’.
- Records the ID for the Building.