The ultimate spreadsheet guide

This guide is aimed towards journalists who use Excel or similar software to find news in data. It covers the most common tasks involved in cleaning up and analyzing spreadsheets, as well as some ‘best practices’. There will be errors and omissions. Please let us know as you find them, using the Trello board! From | (en) with love!

This is a dynamic guide. Content will change as you select different combinations in the form above.

Table of contents

  1. What software should I use?
  2. Guiding principles
  3. Cleaning up your sheet
  4. Analyzing: Pivot tables
  5. Combining data
  6. Splitting columns
Back to top ↑

What software should I use?

A spreadsheet program, like Excel, acts as both a datastore (think of it like a simple database), and an interface for “interviewing” the data. The available programs all have their pros and cons. These are some of the differences that matter most to data journalists.

Excel
  • Has the most powerful pivot table features
  • Good performance on large sheets, when used with the Power Pivot add-in
Google Sheets
  • Allows you to do collaborative research
  • Does web scraping
  • User friendly interface
  • Free
OpenOffice/LibreOffice Calc
LibreOffice is the currently maintained clone of the now discontinued OpenOffice. When they share version number, they are nearly identical.
  • Good for importing and exporting
  • Free
NeoOffice
This is basically just LibreOffice for MacOS, with some extra integration to other Apple services.
  • Good for importing and exporting
Python + Pandas
If you know a bit of programming, Python with the Pandas library can replace a spreadsheet program.
  • Reusable analysis
  • Transparent analysis
  • Works on large datasets

It is worth having more than one of these tools in your toolbox, e.g. using Google Sheets when working together, LibreOffice when importing esoteric file formats or encodings, and Excel when analyzing data on your own.

While Google Sheets is both poweful and easy to use, remember that it is a cloud service. You do not control the servers. Be careful not to upload sensitive data.

Back to top ↑

Guiding principles

This guide is designed to follow a few principles of sane data handling. It all boils down to transparency and reproducibility. Transparency allows our editors (and readers) to fact check what we have been doing. Reproducibility allows us to easily rerun to whole analyzis when new data is available, or the original data contained an error.

Never alter data. We avoid using features like Text to Columns or Search and Replace to clean up data. Instead, we rely on functions, like =LEFT() or =MID().

Never mix data. We keep each dataset in a separate tab.

Make the analysis readable. We avoid overly complex nested formulas, and prefer doing the analysis in smaller steps, to make it easier for others (and ourselves!) to follow what we have done.

Back to top ↑

Cleaning up your sheet

So you have downloaded some data from a statistics site? These are the first things that you do. (Example)

Let your table start in cell A1

Make sure to remove all headers, footers and comments. Your data sheet should contain only data, and it should start in cell A1, with the first column name. You want column names on row 1, and values in the remaining rows. Any footnotes or comments that you might need later can be dropped in a separate sheet in your workbook, or in a text document in the same folder.

Bad
Good

Remove all aggreggates

You do not want any rows with totals and sub totals in your data. Every row should represent the same thing. In a sheet containing municipality populations, each row should represent a municipality. Rows with county population do not — they belong in a different sheet and should be deleted.

Bad. “Män” and “kvinnor” are both genders, but “totalt” is the sub-total of them both. Those rows do not belong here.
Good

Fill out row labels

Sometimes only the first in a series of rows is labeled. We want to explicitly label reach and every row. You can achieve this with a clever little =IF() formula. This formula will check if column A is empty in the example pictured, and if so, take the value from the previous cell: =IF(A2="", D1, A2). Finally autofill to the bottom.

Loading spreadsheet ...

If you want to sort, or otherwise reorder this sheet, you have to copy and paste values only to remove the formulas, as they will break when the row order changes.

Remove any merged cells

Simply never use them.

Select all cells by Ctrl-A, right click, chose Format Cells and make sure that the box Merge Cells is unticked.

Make sure numbers are numbers and text is text

Any spreadsheet software, much lika a programming language, needs to know if a numeric value is supposed to be treated as a text string or a number. Otherwise it has no clue adding 1 and 1 together should equal 11 or 2 – both answers make sense, in different context. And computers are not very good at context.

Most of your values will be either text or numbers, and you need to make sure that each column is of the right type. A numerical municipality code, for instance, is most likely something you want to store as text, not as a number. You are not going to make maths with it, and storing it as text makes sure that Excel does not remove a leading zero, or does other funny stuff with it.

By default numbers are right aligned, and text is left aligned. In the spreadsheet below, the municipality code i stored as text, and therefore left aligned, just like the municipality name. Population, on the other hand, is a number, and right aligned.

Loading spreadsheet ...

To change the value type of a column, select the whole column and use the value types dropdown menu, where by default it says General (with General, Excel tries to guess how to best represent your data).

Using only number and text formats is often enough. Unless you really need all the extra functionality of the date/time value types, stay with text. Strings like "2017-01-21" sort and compare well, so you can do quite a lot of calculations with them as they are. Storing dates as text also makes it easy to export you data, avoiding a few pitholes.

If Excel refuses to recognize you number as text, you will have to use a formula to convert it. There are many ways to represent a number textually (1102 could be written ”1,102”, ”1 102,00”, ”001102”, ”١١٠٢”, ”MCII”, etc), and because of this one-to-many relationship, you have you instruct your computer on exactly how you want it. The formula =TEXT() takes two argument: The number to convert to text, and the format you want the text in. In cell B2 and B3 below, the zero in =TEXT(A2, "0") tells Excel to just output digits; no decimal part, no separators, or anything else. In cell B5 and B6 we've used four zeros to say that we want a four characters long string, adding leading zeros if needed.

Loading spreadsheet ...

Optionally add an index column

Adding an extra column with a unique number for each row is convenient not only for undoing any sorting (in case the order of your rows matter), but having that extra column also provides an extra visual clue as to the sanity of your data. By sorting on the index column, it will be easier to spot missing rows or forgotten filters early.

Insert an extra, leftmost column, give it a name, add a “1” to cell A2, a “2” to cell A3, and then autofill to the bottom.

Back to top ↑

Analyzing: Pivot tables

Pivot tables let analyze you data by turning it around, and rearranging it in different ways. Given a table with number of students per school, each row representing one school, we can create a new table giving the number of students per municipality instead:

With a pivot table, we can summarize the number of students by municipality.

Creating a pivot table

Start by selecting all data (click anywhere in you data and press Ctrl-A).

Chose Insert > Pivot table or click Pivot table at the Insert toolbar tab. Click ok to continue.

You will now end up with an empty pivot table, and a pivot table editor where you get to select what each row in your pivot table should represent, and what values to display for each row. Click and drag a field to the rows box to make each row represent that field. In the example above, ”municipality” would be in the rows box. Then drag additional fields to the values box (in our case ”students”). You pivot table will update immediately.

Pivot tables with columns

TBD

Building a pivot table with formulas

While pivot tables are supposed to automatically reflect any changes in the original data, there are still cases where you might prefer to mimic the behaviour of a pivot table with regular formulas only. However, the only software that currenty allows you to do that in a simple way is Google Sheets. For Excel, you will have to resort to a rather scary looking formula, as outlined in this Stack Overflow answer.

  1. First make a new column in your data, that checks if the value is the last of its kind in the column: =IF(EXACT(COUNTIF($B2:B$994, B2), 1), 1, 0)
  2. Then find the position of these values, using a combination of =MATCH() and either =INDIRECT() (as pictured here) or =INDEX().
  3. Finally use the values on those rows to build your emulated pivot table: =SUMIF(B:B, "="&G2, C:C)
Loading spreadsheet ...

You could also use =COUNTIF() in column H, to get the number of values, rather than the sum.

Back to top ↑

Combining data

This chapter will guide yout through combining data from two tables with the =VLOOKUP() function.

Merging tables

Given a table of Finnish MP's, and their party name, we want to add a third column separating them into opposition and government politicians. This information can be looked up in the table to the right. We want to look up the party name from B2 in the table to the right, and insert the value from the second column in that table. This is written =VLOOKUP(B2, E$2:F$9, 2), where B2 is the value to look up, the E$2:F$9, the table to fetch data from, and 2 meaning ”second column”. An optional fourth argument should, if included, be 0 or FALSE in this case.

Loading spreadsheet ...

Before autofilling this formula to the bottom, you need to make sure that the row numbers in the reference to the table you are getting data from is locked, by putting $-signs in front of the row number. Otherwise the row number will be increase as you autofill down, move off the table boundaries. An alternative approach is to use whole columns in the reference, like this: =VLOOKUP(B2, E:F, 2). If you have a clean sheet, this should normally work fine (except for the edge case when the column title can also appear as a value in that column.)

In the illustration above, the tables we want to combine are place next to each others, for illustrative purposes. Usually it is a much better idea to put the tables you want to combine on different sheets, in the same file. You can switch sheets while writing a formula, to drag and select a cell reference area on a different sheet. Just be careful to finnish the formula before going back, or you will very likely end up polluting it with accidental cell references. Moving the second table a new sheet called parties, our formula would look like this: =VLOOKUP(B2, parties!A:B, 2), assuming we put our table to the left (columns A and B) in the new sheet. 2 still refers to the second column, that is now column B.

Loading spreadsheet ...

Finding missing values

Let's say you have a dataset with municipality names in one column, and want to check if any municipality is not present. You can use a pivot table to count the number of municipalities, but how do you find which ones are missing? A quick way is by doing using a =VLOOKUP() function “in reverse”, starting from the list of municipalities, and checking for each municipality in the original data.

Loading spreadsheet ...

Having one municipality too much? Try applying the same logic, but the other way around, using a =VLOOKUP() from the data sheet to the sheet with the list of municipalities.

Finding the pair with the biggest difference

When you have geographical data, it is ofter worth comparing neighbouring regions. If two neighbouring municipalities display very different values, you might have a story, or at the very least an interesting place to look for a case! For this to work, you will need a sheet with all neighbouring pairs, one pair on each column. Creating that sheet is outside the scope of this guide, but you can with relative ease use any GIS software, e.g. the open source QGIS, to produce such a table. You will find that this is a very useful sheet to have laying around.

Use =VLOOKUP() to fetch the value for both municipalities in the row. In the example below, we have the data in a sheet called data, and can fetch it with =VLOOKUP(A2, data!A:B, 2, 0) and =VLOOKUP(B2, data!A:B, 2, 0) respectively. Finally we calculate the difference, simply writing =C2-D2 in the last column (if your table has to rows for each pair; both A, B and B, A), or =ABS(C2-D2) (to make all differences positive number) if your table has only one row for each pair of neighbours. Finally sort the sheet on the last column, to find out that Orust and Stenungsund are the two neighbouring municipalities with the greatest difference between them.

Loading spreadsheet ...

Categorizing

The =VLOOKUP() formula can be used for finding the nearest match. This is useful for categorizing data. In the example below, we want to translate the average score from an icecream test to labels according to the table to the right. By setting the fourth argument to “TRUE” in column C, we allow non-exact matches, making the formula pick the nearest value: =VLOOKUP(B2, E2:F6, 2, TRUE).

Loading spreadsheet ...
Back to top ↑

Splitting columns

Do you have more than one piece of information ín one column? Maybe a municipality number and name, like 0182 Stockholm; or first name and last name, like Lovelace, Ada; or text formated date, like 2016-12-24, where you want a separate column for month and year respectively? Here's how to extract that data, without destroying your original columns.

Using a delimitor

For the case when data is seperated by a certin character (a space in 0182 Stockholm, a comma and a space in Lovelace, Ada and a dash in 2016-12-24), we can combine =FIND() with =MID(), =LEFT() and =RIGHT() to extract the part we are after.

=FIND() gives us the position of one string in another, so that =FIND("+", "Journalism++") returns 11, as the first ‘+’ sign is at position 11.

=MID() gives us a part of a string, so that =MID("Journalism++", 11, 2) returns ++, starting at position 11, and extracting 2 characters from there. =LEFT("Journalism++", 10) gives us the 10 leftmost characters, Journalism, and =RIGHT("Journalism++", 2), as you might expect, gives us the 2 rightmost, Journalism. Now we put it all together, extracting the names and country codes from thiss list of athletes banned for doping.

Loading spreadsheet ...

Note that we subtract 1 from the position return by =FIND(), to end before the delimitor, and in the same way add 2 to start after it (we got the position of the space, the first character in our delimiter, and don't want that or the parantheses in the nationality code).

Using a fixed with

Let's use a similar logic, but to extract the year from a text formatted ISO date (YYYY-mm-dd), and this time with some foolproofing! From the same dataset, here are the athletes with the end date of their penalties. The year, of course, is the first four characters, so we don't even need =FIND(), we already know that the dash will be the fifth character. Except for when the content of the cell is not a year, but something else!

In the spreadsheet below, we extract the four leftmost characters, and then check if they form a number or not. If they do, we add them to column E. If not, we'll leave that cell blank. There is no function like “LOOKS_NUMERIC” in Excel, but we can still check if something looks like a number, by first converting it to a value, and check if we succeeded. This works well even in some weird edge cases. You still risk a false positive if your date column contain other values starting with four digits, of course, but in this case that seems unlikely.

To do the number checking, we use =ISNUMBER(VALUE(C2)), where =VALUE() tries converting something to a numerical value, and =ISNUMBER() returns TRUE if it succeeded, or FALSE if it didn't. We then use an =IF() formula to do different things depending on the outcome.

Loading spreadsheet ...