Results tagged “journalism”

Consistency and simplicity are keys to better databases (and spreadsheets) for journalists

DSCF0712.jpg

This post originally appeared on PBS MediaShift Idea Lab

I'm working on a fun little project at the moment that involves pulling data out of a Google Spreadsheet that is being curated by a team of journalists. The interesting thing about this project is that it so clearly illustrates the difference in thinking between people who regularly work with databases (herein called programmers) and those who use them less frequently (herein called journalists). In this particular case, the spreadsheet was initially developed by a more technically inclined editor, but -- nonetheless -- it exhibits some humorous and illustrative shortcomings, which we'll explore here.

What is a column, what is a row

In one of the earliest books that I read on databases design (and still a personal favorite), "Database Design for Mere Mortals," under the heading of "Objectives of good design," it says:

Each table in the database represents a single subject, is composed of relatively distinct fields, keeps redundant data to an absolute minimum, and is identified throughout the database by a field with unique values.

In fact, most day-to-day use of SQL-type databases utilizes terminology that matches a spreadsheet quite nicely, specifically: tables, rows, and columns (or, as C.J. Date would insist in his excellent book, "Database in Depth: Relational Theory for Practitioners": relations, tuples, and attributes).

Thus, we can think of a Google Spreadsheet as a simple database of sorts that describes a single domain, for example, an upcoming election. Within the spreadsheet there are worksheets (those little tabs along the bottom that I encourage you to use!), and each worksheet represents a table that describes "only one kind of thing, an entity, with its (non-repeating) attributes. A common way to name each table is with a plural noun name, e.g., Customers, Products, Aircraft, etc."

So, extending that logic to our example domain, an upcoming election, in our Google Spreadsheet we might want to have:

  • worksheets for the various "subjects" that are often part of an election, like Districts, Candidates, Incumbents, and so on.
  • In each of these worksheets, the columns would represent attributes of that type of thing: For example, a Candidates worksheet might have a column for party affiliation, website, and so on.
  • Each row would represent one instance of that type of thing, a candidate, and would contain all of the various attributes that relate to that one candidate.

My sense is that most journalists who've worked with a spreadsheet -- knowingly, or unconsciously -- kind of "get" the concepts presented above. However, where the conceptual meets the practical, there can be some bumpy landings.

It's all in the name

The bumpy landing that I experience most often, and most viscerally, is the seemingly innocuous worksheet column called "Name." How often have I stared in dismay, dumbfounded even, and grumbled to myself "You didn't really put all of that into one column, did you!?"

Names are surprisingly tricky things, programatically, because they're made up of several component parts like the first name, middle name, and last name, not to mention the honorific, suffix, and so on. And, in today's multicultural world, even these distinctions are not always straightforward or obvious.

Take for example a name like "Jenny Wai Ching Kwan." What is the first, middle, and last name? For a computer program, names like that can be difficult to guess (or to "parse") from the individual words. This is where the editorial judgment trumps robot journalism: A subject matter expert needs to break that name up into its component parts so that it's easy to work with.

Thus, as a rule of thumb, it's often best not to exclusively use "Name" as a column in your worksheet, and instead to separate that into "First name" and "Last name," and -- if you need more information about the name -- create more columns.

Another common challenge with names is that they can be represented in all kinds of short forms -- for example, Caitlin/Kate/Katie, Michael/Mike, Robert/Bob, and so on. Back to our running example, an election, even though a candidate might prefer to be called Kate and might publicly promote herself as Kate in her material or on her website and so on, that doesn't mean you should enter "Kate" in the "First name" column for her row in the worksheet. Why not? Well, technically speaking, that's not her proper name -- the one most likely used on the government website, and in other online databases and data sources -- and that could mean missed opportunities.

Just one quick example: If you happen to have a fantastic Open Data non-profit in your country and they happen to provide a great data source for accessing basic information about elected representatives, like photos, contact information, and so on, you might want to use that service instead of collecting the information yourself because it would save a lot of time. However, being good open data citizens, this data service has used the proper names for each representative because that's how it was presented on the government website where the data was pulled from.

So, in your worksheet you have "Kate" and you do a lookup for "Kate Smith," and sadly the fantastic Web service responds with "No results." Why? Well, because the lookup you needed to use was "Caitlin Smith." Enough said about that. If you want to use preferred names or nicknames in your worksheet, then -- that's right, you guessed it! -- create more columns.

A little consistency please

So, at this point, hopefully it has hit home that it's really helpful to have each distinct "bit" of information in it's own field, and each field under a column that clearly identifies that bit of information as one single attribute of the subject. That said, should you choose to ignore this guideline -- which no doubt you will at some point -- there is one other guideline that can help to mitigate the potential problem: consistency.

For example, if for some unknown reason, it was agreed upon to put three distinct bits of information into one field under one column heading, it should be entered the same way -- exactly -- every time it's entered. So, continuing with our running example of an election, maybe you've decided to list all of the candidate's social media links in one column (why you would do this, I have no idea!) like so:

    @somecandidate,
    candidatewebsite.com,
    facebook.com/somecandidate,
    instagram.com/somecandidate


On it's own, that field or cell would actually be fairly easy to work with, because the individual items are ordered (Twitter, website, Facebook, Instagram) and separated by a comma. Phew!

However, if the next cell looks like this, which it often does, things start getting a bit more complicated:

   @someothercandidate,
   othercandidatewebsite.com
   instagram.com/someothercandidate


At first glance, that might look pretty much the same, but it's not consistent with the first. For starters, one of the links is not present (Facebook), which means that the order of the information is not something that can be relied on. Second, the comma is no longer consistently used to separate the distinct parts.

Now your poor overworked programmer has a lot of edge cases to sort through to turn this into meaningful data that can be presented to your readers. At this point, no doubt, any lazy programmer is simply going to add some columns to your spreadsheet and ask you to separate the data (which, per above, you should have done in the first place). Or, because programmers love a good challenge, she will develop some complex logic to deal with all of the various ways that data could be entered in that column (and your project will launch late).

OK, so there's a simple lesson here: One column should contain just one piece of information (a single attribute of the subject at hand), and keep it really, really consistent. (For example, in Google Spreadsheets, you can use the comments feature to leave formatting guidance on the column headings).

Don't overwork the data

OK, those considerations out of the way, let's move on to working with the data, i.e., calculations, sorting, and so on. Looking at the inspiration for this post and our running example, our basic infrastructure is a Google Spreadsheet and a "Web application" that will display the data to users in a friendly format. Here our much-maligned journalist has not one, but two, tools in the mix for doing data manipulation: the ever-powerful spreadsheet and the even-more-powerful programmer. For this reason, there really is no (good) reason to try to do these manipulations manually.

So, continuing with our example of an election, the spreadsheet might contain a worksheet for a previous election, and that worksheet might have columns for district, candidate, party, and votes received. With that information, it's very, very easy to determine who the "winner" of a given race was, and who was in second, third, fourth place, and so on.

It's also easy to determine the total number of votes (assuming that there's a row for each and every one of the candidates who ran and not just the first five or something) and to indicate the percentage of the total votes that each candidate won. However, if you don't want to list all 10 candidates who ran, you'll also need to store the total number of votes cast, somewhere, so that an accurate calculation can be made.

In general, I'd suggest just including a row for each candidate that ran -- then your data is 100% complete and you can avoid the question of where to put the "Total votes counted in this district" information. If you need to include it, because you only want to research the top five candidates, it's probably best to add a column for "Total votes counted in this district" and add that number to every row, according to what election the row is referring to. This introduces some redundancy in the data, which is a trade-off for less data entry.

So, just to be clear, there's really no need to manually indicate in the spreadsheet with columns like "XYZ Election Winner Name" and "XYZ Election Winner Runner Up." This can, and probably should, be done either by using functions in your spreadsheet (which enables you to check your data entry vs. the official record), or in the Web application that presents the data (again, letting you double-check the numbers and helping to catch data entry problems).

SUMing it all up

That was a lot of information to digest, no doubt, so let's conclude this with a quick review:

  • Spreadsheets map to databases quite nicely (when used efficiently): They have the equivalent of tables, columns, and rows that can describe a subject like people, places, or things.
  • Worksheets, collectively, can describe a broader knowledge domain, like an election, or a parliamentary session, or city contracts.
  • Using these efficiently means keeping each distinct "bit" of data in its own field, under a descriptive column heading.
  • When in doubt, create a new column.
  • Be consistent. If the data is being entered by more than one person, then use built-in features like comments to provide guidance on the format that should be used.
  • Names of people and places should be as accurate, and proper, as possible. (Think about the data sources you might be referencing and cross-check the entry with that source.)
  • Keep the data clean, concise, and descriptive, and leave the sorting, joining, calculating and manipulation for later.

No doubt you -- as a journalist, or a programmer -- have some of your own tip, tricks, and pet peeves when working with spreadsheets. I hope that you'll take a moment to share the love (or the pain) here in the comments, or on Twitter.

P.S. Do you have a great example of a Google Spreadsheet-powered Web app? Please drop a link in the comments!

This post originally appeared on PBS MediaShift Idea Lab, please post any comments on the original post!

Leave a comment

TrackBack URL: http://www.phillipadsmith.com/trackback/2879

Investigating the Technology Underlying Top News Organizations

Cross-posted from PBS MediaShift Idea Lab

Phillip Smith's personal "technology stack"

“We Like Lists Because We Don’t Want to Die” — Umberto Eco

Ever since PBS MediaShift contacted me about contributing to their Idea Lab site again (past posts here), I’ve been taking the opportunity to ponder a new “beat” of sorts, one that would be both interesting to read, and also a contribution in some way to the larger conversation about shifts that are underway in news, media, and technology.

What I’ve decided to try is an experiment: a structured investigation into the technology behind some of the Web operations that are generally considered to be “top news organizations,” something that I’ve been following casually for quite a while. The experiment is to see if I can bring some thoroughness to this line of investigation, and not simply cherry-pick the good technology stories that advance my own views. (It’s easy to discover that I’m a shameless promoter of open-source software and open access.)

This will also be an investigation to explore the who, what, when, where, and why of news operations that are putting their money where their mouth is (or not). For example, do they claim to be in support of user privacy but continue to quietly collect and share users’ information? Do they promote their use of open-source stacks for news application development but run proprietary software to handle day-to-day workflow and content management needs, and — more generally — where do these news organizations fall on a range of technology choices like Web and e-mail service providers, analytics tools, commenting systems, and so on? Hopefully, the answers we find will be informative and useful to other news operations that are facing the same questions.

The first order of business, then, is to come up with an initial set of data to work with. A list of news organizations’ Web sites, if you will, that we can peer into and glean some information.

Building a list (or two)

As a starting point, I looked at lists of top news sites by traffic on the Guardian UK as well as the 2012 report from the Nielsen Group on “Top U.S. Web brands and news Web sites.”

I also scanned a few other lists, which added sites like Mail Online, and then I removed aggregators and sites without much original content, specifically Yahoo News, Google News, Drudge Report, and Reddit.

The final list I plan to work with for “top news sites” is the following:

I also thought it would be fun to track some niche lists — here are a couple that came immediately to mind:

The Open News partner news organizations:

  • Al Jazeera English
  • BBC
  • Boston Globe/Boston.com
  • Guardian UK
  • Zeit Online
  • New York Times
  • Spiegel Online
  • ProPublica
  • La Nacion

And a “Semi-comprehensive list of newsrooms doing news applications”:

  • Associated Press
  • Chicago Tribune
  • Cincinnati Enquirer
  • CQ-Roll Call Group
  • Dallas Morning News
  • Los Angeles Times
  • New York Times
  • ProPublica
  • St. Petersburg Times
  • The Texas Tribune
  • Washington Post
  • MSNBC.com

I have a few ideas for other collections of news sites to look at, e.g., Canadian news sites (because I’m a Canuck), online-only news operations (a personal passion), and perhaps the independent newsrooms that make up The Media Consortium membership.

If you have some ideas for other list of news operations and news Web sites to investigate, please drop me a line (Twitter, or the comments here work nicely) or — better yet — just edit the list in Google Spreadsheets directly.

Where we go from here

I’ve asked the folks at BuiltWith — a Web site profiler, competitive analysis, and business intelligence tool — to help provide the basic information for this series. They’ve generously agreed to provide access to their API for the purpose of this column. Using BuiltWith, we’ll have a look at the collections of sites above, explore what technology stacks they are using, and start digging for trends or interesting outliers.

From there, I hope to jump off and dig into the initiatives that aren’t so easily compared such as the technology underlying these news organizations’ mobile and tablet strategies, technical approaches to pay walls, authentication and identity management, and how each organization engages with readers, and so on.

Much of this information is outside the reach of simple profiling tools, so — when I hit those limits — I’ll do something very unnatural for a technically inclined person: I will pick up the phone and start asking questions. All that to say, when the phone rings and it’s me on the other end of the line, I hope you’ll take a moment to field my inquiries.

If there are specific questions that you have for your own news operations, which might be influenced by knowing how other organizations are approaching a specific technical challenge, please drop me a line or leave a comment below. I’d be encouraged to have some reader input on what questions are the most interesting to answer.

Here we go.

Cross-posted from PBS MediaShift Idea Lab. Have comments? Please post them on the original post

Leave a comment

TrackBack URL: http://www.phillipadsmith.com/trackback/2878

Commentary: Do we really need state-funded news entities any more?

Yesterday, Mathew Ingram asked “Do we really need state-funded news entities like the BBC any more?.

After sifting through the commentary, and introducing some of my own, it appears that the crux of Mathew’s argument is about the size of the BBC budget, which is presented as 5 billion (dollars?).

Being a Canadian, and given that the article references the CBC, I immediately see this argument in the context of Canada’s national broadcaster (whose budget has been questioned constantly over the last five years of conservative government).

Let’s put this in perspective: the UK has a population of roughly 63 million. Canada roughly 33 million. The highest level of funding the CBC has reportedly received in the last ten years is 1.25 billion. If that was doubled for the population difference, the funding would probably be 2.5 billion. So, in theory, the CBC receives roughly half of the funding that the BBC does — and, if you asked many Canadians, I suspect they would say that was obvious, i.e., that the BBC outpaces the CBC as a global news operation.

So, on the surface, it would appear that the CBC is greatly underfunded, if we looked to the UK as a model.

There’s a catch-22 with “state-funded news entities,” however, and that is that state-funded media can become like a giant tree in a forest that prevents light from reaching the ground, and consequently makes it more difficult for private media to grow. If you’ve ever worked in private media in Canada, you’ve probably lamented the false sense of security that Canadians have about the health of their media ecosystem (sadly, one of the most concentrated in the world) because the presence of a large state-funded media organization makes everything seem okay (and by “everything seem okay,” I’m specifically referring to approvals for further concentration of media ownership in every market and every medium).

My arguments for state-funded media are straightforward: market failures would be likely to occur if a nation’s news coverage was left entirely to private enterprise. Those failures would likely include just about everything that doesn’t make money, like covering remote areas, the concerns of under-represented people, and — as we’ve already seen cut from many private media initiatives as they struggle financially — expensive and important investigative journalism.

The other more interesting point that Mathew touches on briefly (that I wish he would explore a bit further), is about the opportunity to explore some innovative funding models around state-funded media. The question is, however, is it even possible for a government to undertake such innovation? And, if so, what could it look like?

The Canadian media environment needs more innovation, there’s no doubt. There probably should be more news start-ups and more rapid reinvention in the established media space. But, for me, this should not come at the expense of a well-funded national broadcaster. It should — instead — be part of what Canada invests in to ensure that not all information that citizens receive is signed, sealed, and delivered by national media conglomerates.

Leave a comment

TrackBack URL: http://www.phillipadsmith.com/trackback/2856


2 3 4 5 6 7 8