The ultimate spreadsheet guide
This guide is aimed towards journalists who use Excel or similar software to find news in data. It covers the most common tasks involved in cleaning up and analyzing spreadsheets, as well as some ‘best practices’. There will be errors and omissions. Please let us know as you find them, using the Trello board! From Journalism++ with love!
This is a dynamic guide. Content will change as you select different combinations in the form above.
Table of contents
- What software should I use?
- Guiding principles
- Cleaning up your sheet
- Analyzing: Pivot tables
- Combining data
- Splitting columns
- Find the winner
What software should I use?
A spreadsheet program, like Excel, acts as both a datastore (think of it like a simple database), and an interface for “interviewing” the data. The available programs all have their pros and cons. These are some of the differences that matter most to data journalists.
- Has the most powerful pivot table features
- Good performance on large sheets, when used with the Power Pivot add-in
- Google Sheets
- Allows you to do collaborative research
- Does web scraping
- User friendly interface
- OpenOffice/LibreOffice Calc
- LibreOffice is the currently maintained clone of the now discontinued OpenOffice. When they share version number, they are nearly identical.
- Good for importing and exporting
- This is basically just LibreOffice for MacOS, with some extra integration to other Apple services.
- Good for importing and exporting
- Python + Pandas
- If you know a bit of programming, Python with the Pandas library can replace a spreadsheet program.
- Reusable analysis
- Transparent analysis
- Works on large datasets
It is worth having more than one of these tools in your toolbox, e.g. using Google Sheets when working together, LibreOffice when importing esoteric file formats or encodings, and Excel when analyzing data on your own.
While Google Sheets is both poweful and easy to use, remember that it is a cloud service. You do not control the servers. Be careful not to upload sensitive data.
This guide is designed to follow a few principles of sane data handling. It all boils down to transparency and reproducibility. Transparency allows our editors (and readers) to fact check what we have been doing. Reproducibility allows us to easily rerun to whole analyzis when new data is available, or the original data contained an error.
Never alter data. We avoid using features like
Text to Columns or
Search and Replace to clean up data. Instead, we rely on functions, like
Never mix data. We keep each dataset in a separate tab.
Make the analysis readable. We avoid overly complex nested formulas, and prefer doing the analysis in smaller steps, to make it easier for others (and ourselves!) to follow what we have done.
Cleaning up your sheet
So you have downloaded some data from a statistics site? These are the first things that you do. (Example)
Let your table start in cell A1
Make sure to remove all headers, footers and comments. Your data sheet should contain only data, and it should start in cell
Remove all aggreggates
You do not want any rows with totals and sub totals in your data. Every row should represent the same thing. In a sheet containing municipality populations, each row should represent a municipality. Rows with county population do not — they belong in a different sheet and should be deleted.
Fill out row labels
Sometimes only the first in a series of rows is labeled. We want to explicitly label reach and every row. You can achieve this with a clever little
=IF() formula. This formula will check if column A is empty in the example pictured, and if so, take the value from the previous cell:
=IF(A2="", D1, A2). Finally autofill to the bottom.
If you want to sort, or otherwise reorder this sheet, you have to copy and paste values only to remove the formulas, as they will break when the row order changes.
Remove any merged cells
Simply never use them.Select all cells by Ctrl-A, right click, chose
Format Cellsand make sure that the box
Merge Cellsis unticked.
Make sure numbers are numbers and text is text
Any spreadsheet software, much lika a programming language, needs to know if a numeric value is supposed to be treated as a text string or a number. Otherwise it has no clue adding 1 and 1 together should equal 11 or 2 – both answers make sense, in different context. And computers are not very good at context.
Most of your values will be either text or numbers, and you need to make sure that each column is of the right type. A numerical municipality code, for instance, is most likely something you want to store as text, not as a number. You are not going to make maths with it, and storing it as text makes sure that Excel does not remove a leading zero, or does other funny stuff with it.
By default numbers are right aligned, and text is left aligned. In the spreadsheet below, the municipality code i stored as text, and therefore left aligned, just like the municipality name. Population, on the other hand, is a number, and right aligned.
To change the value type of a column, select the whole column and use the value types dropdown menu, where by default it says
General, Excel tries to guess how to best represent your data).
Using only number and text formats is often enough. Unless you really need all the extra functionality of the date/time value types, stay with text. Strings like "2017-01-21" sort and compare well, so you can do quite a lot of calculations with them as they are. Storing dates as text also makes it easy to export you data, avoiding a few pitholes.
If Excel refuses to recognize you number as text, you will have to use a formula to convert it. There are many ways to represent a number textually (1102 could be written ”1,102”, ”1 102,00”, ”001102”, ”١١٠٢”, ”MCII”, etc), and because of this one-to-many relationship, you have you instruct your computer on exactly how you want it. The formula
=TEXT() takes two argument: The number to convert to text, and the format you want the text in. In cell
=TEXT(A2, "0") tells Excel to just output digits; no decimal part, no separators, or anything else. In cell
Optionally add an index column
Adding an extra column with a unique number for each row is convenient not only for undoing any sorting (in case the order of your rows matter), but having that extra column also provides an extra visual clue as to the sanity of your data. By sorting on the index column, it will be easier to spot missing rows or forgotten filters early.
Insert an extra, leftmost column, give it a name, add a “1” to cell
Analyzing: Pivot tables
Pivot tables let analyze you data by turning it around, and rearranging it in different ways. Given a table with number of students per school, each row representing one school, we can create a new table giving the number of students per municipality instead:
Creating a pivot table
Start by selecting all data (click anywhere in you data and press Ctrl-A).
Insert > Pivot table or click Pivot table at the Insert toolbar tab. Click ok to continue.
You will now end up with an empty pivot table, and a pivot table editor where you get to select what each row in your pivot table should represent, and what values to display for each row. Click and drag a field to the rows box to make each row represent that field. In the example above, ”municipality” would be in the rows box. Then drag additional fields to the values box (in our case ”students”). You pivot table will update immediately.
Building a pivot table with formulas
While pivot tables are supposed to automatically reflect any changes in the original data, there are still cases where you might prefer to mimic the behaviour of a pivot table with regular formulas only. However, the only software that currenty allows you to do that in a simple way is Google Sheets. For Excel, you will have to resort to a rather scary looking formula, as outlined in this Stack Overflow answer.
- First make a new column in your data, that checks if the value is the last of its kind in the column:
=IF(EXACT(COUNTIF($B2:B$994, B2), 1), 1, 0)
- Then find the position of these values, using a combination of
=INDIRECT()(as pictured here) or
- Finally use the values on those rows to build your emulated pivot table:
=SUMIF(B:B, "="&G2, C:C)
You could also use
=COUNTIF() in column
This chapter will guide yout through combining data from two tables with the
Given a table of Finnish MP's, and their party name, we want to add a third column separating them into opposition and government politicians. This information can be looked up in the table to the right. We want to look up the party name from
=VLOOKUP(B2, E$2:F$9, 2), where
B2 is the value to look up, the
E$2:F$9, the table to fetch data from, and
2 meaning ”second column”. An optional fourth argument should, if included, be
FALSE in this case.
Before autofilling this formula to the bottom, you need to make sure that the row numbers in the reference to the table you are getting data from is locked, by putting $-signs in front of the row number. Otherwise the row number will be increase as you autofill down, move off the table boundaries. An alternative approach is to use whole columns in the reference, like this:
=VLOOKUP(B2, E:F, 2). If you have a clean sheet, this should normally work fine (except for the edge case when the column title can also appear as a value in that column.)
In the illustration above, the tables we want to combine are place next to each others, for illustrative purposes. Usually it is a much better idea to put the tables you want to combine on different sheets, in the same file. You can switch sheets while writing a formula, to drag and select a cell reference area on a different sheet. Just be careful to finnish the formula before going back, or you will very likely end up polluting it with accidental cell references. Moving the second table a new sheet called
parties, our formula would look like this:
=VLOOKUP(B2, parties!A:B, 2), assuming we put our table to the left (columns A and B) in the new sheet.
2 still refers to the second column, that is now column B.
Finding missing values
Let's say you have a dataset with municipality names in one column, and want to check if any municipality is not present. You can use a pivot table to count the number of municipalities, but how do you find which ones are missing? A quick way is by doing using a
=VLOOKUP() function “in reverse”, starting from the list of municipalities, and checking for each municipality in the original data.
Having one municipality too much? Try applying the same logic, but the other way around, using a
=VLOOKUP() from the data sheet to the sheet with the list of municipalities.
Finding the pair with the biggest difference
When you have geographical data, it is ofter worth comparing neighbouring regions. If two neighbouring municipalities display very different values, you might have a story, or at the very least an interesting place to look for a case! For this to work, you will need a sheet with all neighbouring pairs, one pair on each column. Creating that sheet is outside the scope of this guide, but you can with relative ease use any GIS software, e.g. the open source QGIS, to produce such a table. You will find that this is a very useful sheet to have laying around.
=VLOOKUP() to fetch the value for both municipalities in the row. In the example below, we have the data in a sheet called
data, and can fetch it with
=VLOOKUP(A2, data!A:B, 2, 0) and
=VLOOKUP(B2, data!A:B, 2, 0) respectively. Finally we calculate the difference, simply writing
=C2-D2 in the last column (if your table has to rows for each pair; both A, B and B, A), or
=ABS(C2-D2) (to make all differences positive number) if your table has only one row for each pair of neighbours. Finally sort the sheet on the last column, to find out that Orust and Stenungsund are the two neighbouring municipalities with the greatest difference between them.
=VLOOKUP() formula can be used for finding the nearest match. This is useful for categorizing data. In the example below, we want to translate the average score from an icecream test to labels according to the table to the right. By setting the fourth argument to “TRUE” in column C, we allow non-exact matches, making the formula pick the nearest value:
=VLOOKUP(B2, E2:F6, 2, TRUE).
Do you have more than one piece of information ín one column? Maybe a municipality number and name, like
0182 Stockholm; or first name and last name, like
Lovelace, Ada; or text formated date, like
2016-12-24, where you want a separate column for month and year respectively? Here's how to extract that data, without destroying your original columns.
Using a delimitor
For the case when data is seperated by a certin character (a space in
0182 Stockholm, a comma and a space in
Lovelace, Ada and a dash in
2016-12-24), we can combine
=RIGHT() to extract the part we are after.
=FIND() gives us the position of one string in another, so that
=FIND("+", "Journalism++") returns
11, as the first ‘+’ sign is at position 11.
=MID() gives us a part of a string, so that
=MID("Journalism++", 11, 2) returns
++, starting at position 11, and extracting 2 characters from there.
=LEFT("Journalism++", 10) gives us the 10 leftmost characters,
=RIGHT("Journalism++", 2), as you might expect, gives us the 2 rightmost,
Journalism. Now we put it all together, extracting the names and country codes from thiss list of athletes banned for doping.
Note that we subtract 1 from the position return by
=FIND(), to end before the delimitor, and in the same way add 2 to start after it (we got the position of the space, the first character in our delimiter, and don't want that or the parantheses in the nationality code).
Using a fixed with
Let's use a similar logic, but to extract the year from a text formatted ISO date (YYYY-mm-dd), and this time with some foolproofing! From the same dataset, here are the athletes with the end date of their penalties. The year, of course, is the first four characters, so we don't even need
=FIND(), we already know that the dash will be the fifth character. Except for when the content of the cell is not a year, but something else!
In the spreadsheet below, we extract the four leftmost characters, and then check if they form a number or not. If they do, we add them to column
To do the number checking, we use
=VALUE() tries converting something to a numerical value, and
=ISNUMBER() returns TRUE if it succeeded, or FALSE if it didn't. We then use an
=IF() formula to do different things depending on the outcome.
Find the winner
Given a set of local election results, how do you find the winner in each locality? This is one of many cases where it's useful to be able to look up the column header of the column matching a certain criteria (largest value, in this example).
Slight variations of this method can be used to do pretty advanced lookups, that would typically be done in a relational database like SQL.
Finding the “winner” for each row is a three-step process: We first find the maximum value with
=MAX(), then we find the (first) position of that value in the row using
=MATCH(), and finally we fetch the column header of that position using
Note that this formula will fetch only one value in case of a draw (the first).
You can obviously replace
=MAX() with e.g.