Remind me again...

...why we're supposed to trust governments rather than corporations with our private data?

Some scientists have done a study of the forwarding patterns of Internet chain letters and come up with two interesting findings:

1) Ninety percent of the time, when someone forwards a chain letter to a group of people, only one person out of that group will forward it on.

2) At the median, when a person receives an Internet chain letter it will have been forwarded three hundred times before it reaches them.

I'm not sure what any of this means yet, but I find it intriguing. I'm particularly curious how this compares to pure information, as opposed to chain letters. I can't imagine that a piece of information would go through three hundred episodes of telephone before it reaches a person . . . would it?

Full paper here.

The Things One Can Discover...

...with enough data and the right algorithm:

Falling Coca-Cola sales in a specific region of Africa are an excellent indicator of civil unrest, famine, or some other problem in that region.

Now That's What I'm Talking About!

A system that analyzes the content of books and suggests new books for you to read based on the amounts of dialog and action, the density, the pacing, and a couple of other factors. The books currently in the beta are heavily skewed towards science fiction, which is a genre I haven't really read in since I was 16, so I can't say how well it works based on what's in there now. But kudos for the general concept!

The Slashdot story has links to more background on the project and the person behind it.

Be a Data Geek, Win a Prize!

The British government is going to give 20,000 pounds to the person who comes up with the best idea for mashing up and re-using the reams of data collected by said government.

I checked the rules, and it's not limited to British citizens. The deadline is in September. Full details here.

It took me awhile to decide whether this new Skewz site is brilliant or one of the signs of the apocalypse, but I think I've come down on the side of "brilliant."

Basically, the point of Skewz is to use the wisdom of crowds to make explicit the bias that exists implicitly in the media, while also functioning sort of like Digg, aggregating stories that people find interesting. People submit stories and then get to vote on how much and in which direction they think the stories are skewed.

But Skewz isn't designed to let people only see the stories that they agree with ideologically—the pages are divided up into one column for "liberally skewed" stories and one for "conservatively skewed" stories, which lets readers see both sides of the news right next to each other. Unfortunately Skewz doesn't seem to actively pair conservative and liberal stories on the same topic, but still, it's useful to be able to see what both sides are talking about on any given day—especially since quite a bit of media bias isn't in how a story is covered but in what people think is newsworthy in the first place.

And, one of the best parts: it aggregates people's skew ratings for each story and uses them to create a giant chart showing the ideological skew of quite a few of the major newspapers and blogs on 20 different issues.

So of course this comes out after I quit editing the books where I had to pair opposing viewpoints and was always running around trying to find somebody arguing the other side of some mildly obscure issue....

Hat tip: Marginal Revolution (again!)

Well That's New and Interesting

While doing a search in Google today I got the following line at the bottom of the first page of results:

"In response to a complaint we received under the US Digital Millennium Copyright Act, we have removed 1 result(s) from this page. If you wish, you may read the DMCA complaint that caused the removal(s) at"

I wonder if Google has been doing this for awhile and it just so happened that today was the first time I happened to do a search that hit one of these, or if this is a new thing?

There Has Got to Be an Easier Way

Tyler Cowen at Marginal Revolution has a post up on how he finds new books to read. (If you're not reading Marginal Revolution yet—and you should be!—Prof. Cowen reads scarily massive numbers of books on wildly divergent topics.) Here's just a partial list of the things he does to find new books (bracketed expansions of acronyms are mine): “visit Borders every Tuesday to look for new books, go to a local public library every other day and scan the new books section, subscribe to TLS [Times Literary Supplement], London Review of Books, New York Review of Books, noting that you should spend more time with the ads than the book reviews, read the blogs Bookslut and Literary Saloon, read the new magazine BookMark (recommended), read the NYT [New York Times], FT [Financial Times], and Guardian and their books sections....” (FYI, he's a professor of economics, not English, so it's not like keeping up on new fiction is part of his job description.)

This reminded me of a post I've been meaning to write for awhile about the person I know who had the most trouble finding new books to read—my grandmother. Well, no, let me rephrase that slightly. She outsourced the job of finding books for her to read to my mother and me, so really it was us who had the book-finding problem. My grandmother went through a book about every day and a half, so every two weeks, when my mother and I went to the library, we had to find about 10 books for her. And she was a picky reader. She liked love stories best, but only if there was no sex or bad language in them; she would read lighthearted mysteries, like the Mrs. Pollifax and Cat Who series, but nothing violent or dark or scary.... It was really all that my mother and I (and the wonderful librarians at the Middletown Public Library) could do to keep her in books. Our saving grace was that, once three or so years had passed, she would forget that she had read a book, so we could give it to her again.

How did we keep track of which books she had already read? At that time the library kept a card in the back of each book with the library card numbers of everyone who had checked out the book, along with their respective due dates. So we just had to look for our library card number in the list and see how long it had been since we'd last checked that book out. "Horrors!," I can hear the librarians out there thinking. "Freedom to read! It is unethical for the library to keep records like that of who has checked out a specific book! What if the government subpoenas those records?" I understand the logic behind that position (although I also don't think that my grandmother really would have cared if the government found out about her love of the novels of Janette Oake), but at the same time, I can't even imagine how we would have kept my grandmother in books without those records. We lived out in the boondocks, so it wasn't like we could make a mid-week run to the library and get more books for her if we accidentally brought home a pile of books that she had read recently. The library was too small for us to give her exclusively new books, and ILL was no solution—in the dark days before Amazon, it was hard to get much information about the content of a book without holding it in your hands. (Library of Congress subject headings don't really tell you things like, "How graphic are the murders in this murder mystery?" or "Is there out-of-wedlock sex in this romance novel?")

So when I say that libraries really ought to consider doing something Amazon-like to help people find books that they might like, I'm thinking of all of the time that my mother and I spent over the years flipping through books to decide if my grandmother might like them or not and scrutinizing the little card in the back to see how recently she had read them. If you estimate that we spent half an hour every two weeks doing that for probably thirty years (well, I only participated in this process for about ten years, but I think my mother did it for close to thirty)...that's a lot of time that could have been saved if the library catalog had had some system for saying, "People who like the kind of books that you like also like these new books." And that's one of the ethical values of librarianship too, right? Saving the time of the reader?

(Oh, and for another argument for why libraries ought to be taking a page from Amazon's book, check out the comments to Prof. Cowen's post. Count the number of people who advise using various Amazon features to get good book recommendations, and the number who know enough about how those methods work to recommend ways to improve the Amazon recommendations. Now count the number of people who say anything at all about libraries/librarians as a way of finding new books. By my count, the ratio is 7 to 0.)

Another Benefit of E-books/E-articles/etc.

They're much easier to move.

I'm just about finished packing. Just out of curiosity, I counted the boxes of information that I'm moving—not counting documents that I need to keep for legal or record-keeping reasons or anything like that, just books and papers that I'm keeping purely for the value of the information in them.

The tally:

  • 4 milk-crate-sized boxes of notebooks/papers/printed articles/photocopies/etc. from my undergraduate and graduate courses

  • 1 milk-crate-sized box of coursepacks from my undergraduate courses

  • 17 smallish boxes of books

  • A half-box of cookbooks

  • Two years' worth of American Libraries and Information Technology in Libraries

I hereby resolve to think about moving 22-boxes-plus worth of paper next time I'm tempted to print an article to write notes on it or to buy a paper book that I could get as a (not DRMed-to-death) e-book. Henceforth (or at least until I'm settled into someplace that I have no intention of moving out of ever again) all of my information is going to be electronic.

Don't Trust Everything You Read in Books, Part 274

A new biography of King Louis XIV's mistress, Madame de Maintenon, got the whole way through the editorial process at Bloomsbury (one of the more prestigious British publishing houses) without anybody noticing that one of its sources, a “diary” supposedly written by the king, was actually historical fiction.

Stories like this (as well as 6.5 years of working in the publishing industry) are a big part of why I worry that traditional information literacy instruction does a disservice to students by encouraging them to rely on external authority cues (Was it published by a reputable publisher and/or in a peer-reviewed journal?) rather than on internal accuracy cues (Do their numbers add up? Can you track down and verify their sources?) when evaluating information. Not everything that's made it through the editorial process is true, and not everything on the Web is false, and it seems like students would be much better served by learning how to evaluate the truth of the message rather than the “trustworthiness” of the medium.

More on Twine

I'm still playing with Twine, and it's growing on me. It's like on steroids.

So, I've started a twine called The Examined Web, where I'm going to collect stories about the social, economic and political impacts of Web 2.0 / the Semantic Web / [insert other Web-related buzzwords here]. That means that I probably won't be doing Link Roundups here anymore, unless I've got a group of stories that I want to comment on and not just point people to. So if you're interested in that kind of stuff, join the twine! You have to join Twine itself first, but I've got 10 invitations for it, so leave a comment here or e-mail me and I will make sure you get one if you need it.

And by the way, if you're interested in the technical side of the Semantic Web and you're planning on joining Twine, you should check out Apps :: On Semantic Web & Related Applications. They've got some great stuff.

(Update: links to the twines added, although I'm not sure what happens if you click on those links and you haven't joined Twine yet....)

The Cult of the Amateur Blogger Makes a Comeback!

I know I've argued that blogs deserve more respect than they get from some people, but even I think that this is going a bit too far.

The Cult of the Amateur Blogger No More?

Some people have been arguing that blogging is going to kill traditional journalism because free bloggers will undercut paid journalists.

Today, Megan McArdle points out that most of the good amateur bloggers have now been hired by one media corporation or another.

Anecdotes != data, but it's an interesting anecdote none the less.

Six months ago I posted about how Twine was going to make my life easier, once I got my beta invitation.

Well, it finally came today, and I've spent the past hour and a half poking around in it.

It's definitely still in beta (real beta, not Google's "beta-in-perpetuity" beta), and the algorithms they're using to pull metadata out of free text still need some work, but I can definitely see the potential in it. But I'm not sure that I see as much potential as the stories from 6 months ago were promising. For what I'd primarily be using it for (organizing my personal research/bookmarks), Zotero has it beat by a mile at this point—and that will jump to about 10 miles once Zotero gets around to launching the server sync and recommender services that they're promising.

Ah well. I will probably play around with it a little further, as it moves out of beta into something that's actually supposed to be fully functional, to see how it winds up.

Access to Information in the Third World

There's a nice article in this weekend's New York Times Magazine about how enhanced access to information improves living conditions in the third world. "Ah hah," I'm sure all of you librarians out there are now thinking, "Further proof of how wonderful libraries are!" Actually, no: the story is about how cellphones allow the global poor easy access to information that was either completely unavailable or prohibitively expensive before. (It's an interesting mental exercise for a librarian, actually, to read this article and try to figure out how a library could function to meet the sorts of information needs that are featured in it.)

By the way, the article mentions in passing the story of the fishermen of Kerala, India, who provided some of the first evidence of how important access to information is for the global poor. That story by itself is fascinating. If you're interested, here are a paper from the Quarterly Journal of Economics and a Washington Post article about the research that has been done on these fishermen.

(Yes, I am aware that it's been almost a month since I updated this blog. Yes, I am aware that this makes me a bad blogger. Blogging will become more regular once I finish packing up all of the junk I've acquired in the past 6.5 years and moving it 500 miles.)

Another Advertising-Related Link Roundup

There's been a big dust-up in Britain over the past couple of weeks about targeted advertising. As somebody who thinks that advertising-supported content is a good thing (and yes, I still do at some point intend to do a long and thoughtful post about why I think that), I don't necessarily find all of the arguments against targeted advertising to be persuasive. But I still think they're interesting and worth listening to.

So, here are a few of the better entries in the debate:

A Liberal Democratic politician says, in relation to targeted advertising on MySpace, “I think it's absolutely wrong if you haven't been notified and given the opportunity to opt out.” My take: Notification is definitely a good thing, but opt-out is a very different question. MySpace can only afford to give you a free account because advertisers are willing to give them money to show you advertisements. Why should you be able to say to MySpace, “I want you to give me my free account, but I refuse to help you make the money to pay for that account”? It seems to me that the opt-out option is, “If you don't like MySpace's advertising practices, don't use MySpace.”

I don't know much about the details of this Phorm system, but it's interesting that there's one story praising its privacy-protecting features, another a few days later saying that there isn't enough information to know if it's acceptable privacy-wise or not, and another one saying it's illegal. (More on Phorm.) And Tim Berners-Lee has come out against not only Phorm, but all systems that track online activity in order to provide targeted advertisements.

Yahoo! Search Is Going Semantic

Or so the Yahoo! Search Blog said yesterday, anyway. Details are still a little thin, but Dublin Core is first on the list of metadata schemas that will be supported....

Friday, March 7, 2008

More on Privacy, Data and the Government

Apropos of the comments I made a couple of weeks ago about privacy and the government, there's a nice article in Wired right now that addresses some of the same issues.

(Hat tip: Slashdot, where the comments are actually pretty interesting too.)

Tuesday, March 4, 2008

How Did I Not Know about This?

There is an entire blog dedicated to quirky numeric data, with a strong emphasis on visualizations thereof.

I am in heaven.

Thanks to Marginal Revolution for pointing this blog out and making my day.

Link Roundup

The Atlantic has a very interesting article on Internet censorship in China. (And there's also a Web-only interview with the author.)

The Encyclopedia of Life, an open encyclopedia that aims eventually to have comprehensive entries on every single species of living thing, has launched.

A Personal Announcement

As you've probably noticed by now, this isn't exactly a "personal" blog. But kindly indulge me as I make one personal announcement.

I am no longer a job-hunting librarian! As of July 1, I will be the data services librarian at Grinnell College.

And now back to our regularly scheduled blogging.

Another Book for My To-Read List

Chris Anderson (author of The Long Tail) has a new book coming out soon: Free: Why $0.00 Is the Future of Business. There's a long excerpt of it up on the Wired site now. I think he's overselling some of his points a little bit (talk to me about computing power being cheap and abundant enough to waste after you've tried running statistical tests on multi-gig datasets for awhile), but it's still worth reading.

Link Roundup

Does researching in a Dewey decimal library lead to finding different information than researching in an LoC library? (Hat tip: The Monkey Cage) I haven't had a chance to read this article yet, seeing as how I'm currently between academic institutions and therefore don't have access to much of anything, but the abstract (at the link) looks quite interesting. (But I think the real question is, why is this research appearing in the Journal of Theoretical Politics?)

Another freely downloadable book (and one I've actually been intending to read!): Daniel Solove's The Future of Reputation: Gossip, Rumor and Privacy on the Internet.

Another filtering tool for managing RSS overload, called Persai. I haven't experimented with this one, but between the linked article and their blog, it sounds like it might be worth a look.

The internal dynamics of Wikipedia and Digg. Hint: they're not as democratic as you think (but that's not necessarily a bad thing).

Best cartoon ever. Seriously, this is going on my bulletin board. (Mouse over for the punchline.)

Reuters Gives the Semantic Web a Boost

Reuters recently announced that they're opening up their Calais service to the world.

Why should you care? Because Open Calais takes unstructured text documents, analyzes them, and automatically attaches semantically rich metadata. Fast. For free. It's an open API—any Web applications developer with half a clue can start using it right now.

I'm already plotting ways to use this for one of my freelance gigs. I get paid to read great gobs of news every day (like, the entire output of a couple of wire services typs of great gobs), looking for news about business and political leaders that's noteworthy enough to justify updating their biographies in a certain biographical database. So, basically, I'm looking for stories about people getting hired/promoted/fired/elected/un-elected, retiring, or getting involved in legal cases. Which is, oh, maybe 5 or 10 percent of the stories on the wires; most of the stories on the financial side, for example, are about companies releasing their quarterly results, mergers and acquisitions, and other stuff that's about companies rather than people. So if Open Calais is good enough to reliably separate out the stories about people from the stories about companies, I should theoretically be able to set up an RSS feed that just contains stories about people and cut down on the number of headlines I have to skim by 90 percent. And I suspect I might even be able to use the Open Calais-generated metadata to write a script to run the names mentioned in the story against this biographical database automatically, so I won't have to check by hand to see if the people are included in the database or not. Yes, I'm definitely seeing a whole lot of ways that this makes my life easier.

And that's not even getting into the potential library applications of this. Automated or semi-automated subject analysis of electronic documents, without having to shell out thousands of dollars for the commercial indexing software that's generally used for that task in special libraries, anyone?

Told You...

...that a rainbow of books would be pretty spiffy.

Proof (found via LISNews).

Substantive blogging has been derailed by the strep infection from Hades, but theoretically at some point I should be feeling better and able to string coherent thoughts together again . . . right? In the meantime, enjoy the pretty pictures.

Free Books

There's a pretty detailed article in the current First Monday about the experiences of a group of scholars who published an open-access book series.

And Tor is giving away free e-books if you're willing to give them your e-mail address. Good free e-books, from the looks of things! (It's been awhile since I've had the time to read fiction, but I've been hearing good things about Scalzi.)

Monday, February 4, 2008

Speaking of Suppressing Information... of the New York City government apparently want to get in on the action. (Not suppressing eBay user ratings. Suppressing information that might “lead to excessive false alarms and unwarranted anxiety.”)

EBay Suppresses Information

EBay will no longer allow sellers to give negative feedback on buyers. I understand their stated rationale—that buyers will give more honest feedback if sellers can't retaliate by giving bad feedback to critical buyers—and it may well be the case that eBay is making a logical decision from a business standpoint. (Far be it from me to be too critical of the business logic of a company that's raking in billions of dollars a year!) But from a librarian's perspective, I find it unfortunate: sellers have a legitimate need for information about potential buyers, and eBay is about to take away the method that sellers have been using to fulfill that information need.

Of course, I also suspect that it will take about a week for the sellers to band together and start an independent site where they will share notes on bad buyers. Information has a habit of being difficult to suppress....

Microsoft is seriously considering a hostile takeover of Yahoo.

I'm still on the road and don't have time to do any real research into this right now, but my gut feeling is that this will not end well for consumers.

Monday, January 28, 2008

Seeing Like a Librarian

Blogging will be light for the next two weeks, as I will be traveling. However, I'm going to take advantage of being trapped in airports, airplanes and other Internet-free places to slack off on my paying work and do something I've been meaning to do for awhile: re-read James C. Scott's Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed. I read this book 9.5 years ago as an undergrad and haven't re-read it since, but conclusions from it keep popping into my head in relation to the library-land discussions on privacy and on tagging/folksonomies. At its heart the book is a critique of the high modernist tendency in 20th century global politics, but (as I remember) it has some really interesting things to say about both the dangers of governmental attempts to keep close statistical tabs on citizens and the dangers of ignoring local folk knowledge when constructing a view of the world. So, hopefully I'll be back soon with something insightful to say about all of that.

Feel free to get the book yourself and read along if you're so inclined. It's quite accessible to people who aren't into political theory, and it's actually really interesting.

More Great Free Ad-Supported Content

The Atlantic has joined the trend: they will now make all of their content available free online.

Corporate Trust and Business Information

Apparently I'm in good company when I trust Google more than the government: a new study shows that amongst the college-educated elite in 18 countries, trust in business is higher than trust in government or the media. (Link goes to the Financial Times, so it will go behind a pay-wall at some point.)

Another interesting tidbit from the article: “When young US opinion leaders were asked to choose the most credible source for corporate information, a surprising 55 per cent mentioned Wikipedia.”

Why I Trust Google

Siva Vaidhyanathan, proprietor of the blog The Googlization of Everything, has a brief comment today on Google's open-source data project that I blogged about yesterday. As usual, I completely disagree with his comment, but it got me thinking about something.

Vaidhyanathan seems to think that it would have been better if the government would have set up something like this as a public service, rather than having a company such as Google do it. The broader version of this sentiment—that the government is more trustworthy than corporations—is pretty common, I think, and I find it very curious. The vast majority of the time, the government poses a far bigger danger to you than any corporation ever could, simply because the government has the guns and the jails and the authority to use them on you—“a monopoly on the legitimate use of violence,” to use the political science term for it. Corporations that make use of data mining or collaborative filtering or other tools that let them learn about your tastes and habits may be able to annoy you with eerily targeted advertisements, but only the government can take that data and decide that the pattern indicates that you're a criminal and ruin your life with it.

The exception to that dichotomy, of course, is the corporations that collaborate with the government in the “deciding that you're a criminal and ruining your life” department—think the RIAA, MPAA, and Microsoft. But for the most part the corporations that collaborate with the government in that way are the corporations that sell digital products directly to consumers and who need the threat of government punishment to keep consumers from pirating their stuff. Google (wisely, I think, for reasons that I will discuss in a subsequent post) has realized that more money can be made more easily by selling advertisements to third parties rather than selling stuff to customers. In that situation there's no reason for Google and its users to have an antagonistic relationship with each other, because economically they're both on the same side: both benefit when the users get what they want, which is free and copious access to Google's content and services.

So that's the short version of why I trust Google with my data: because they don't themselves have the power to harm me with it, and because they have no incentive to collaborate with governmental agencies that could harm me with it. The fact that Google has a history of going to court to fight the government when it tries to get data from them shows, I think, that Google itself known on which side its bread is buttered.

Along those lines, I'm interested to see how Google's expansion of its DC lobbying operations plays out. I'm mildly concerned that Google might wind up a little too cozy with the government, but I'm more intrigued by the possibility of having Google running around DC fighting on behalf of its users (who, remeber, are on the same side as Google) against the government and the corporations that collaborate with it.

Another Good Thing from Google

I have forgiven Google for (possibly) mining my e-mail to recommend new blogs to me. (Well, I was never really all that upset with them about it in the first place.) (It's still only recommending library-related blogs in its top 3 recommendations, by the way.)

Why is Google back in my good graces? Because they're opening a new service that will host large scientific datasets for free access. Being a data geek, I'm all in favor of having more data to slurp up and play with. Although, I wonder if they're ever going to have data in the social sciences or psychology? I suspect not, unfortunately, because of the potential human subjects/ethics problems. But still, this is very cool!

Learn Something New Every Day

Yesterday's new thing that I learned: the technical term for the sort of recommender tools I blogged about a couple of weeks ago is “collaborative filtering.”

Today's new thing that I learned: Google Reader is now using collaborative filtering to recommend new blogs to me. Curiously, the top three recommendations that it had for me were all library blogs, despite the fact that library-related blogs are a distinct minority in my blog subscriptions. However, my Gmail account is absolutely overflowing with library-related stuff, because I subscribe to a whole bunch of library-related listservs. So now I'm wondering . . . is Google using information skimmed from my e-mail to suggest blogs to me? The Google Reader FAQs say no, but I remain suspicious—and really curious about the data and algorithms they're using.

I admit to being a critic of the cataloging status quo much of the time, but this has me even more baffled than usual. Does anybody know why the Library of Congress is reclassifying Scottish literature as English literature? I mean, not only is this problematic for all of the reasons listed in the article, but it is distinctly possible that Scotland will become independent from the U.K. in the next 10-15 years. (Not inevitable, but distinctly possible—more likely than Quebec becoming independent, but less likely than Kosovo, let's say.) Which seems rather problematic, no?