Sunday, July 25, 2010

Don't just rip a book, don't just rip a bookshelf - rip the library

The BookLiberator, another cheap DIY book scanning kit has just been announced. There's even a site dedicated to DIY book scanners.

The relevance of mass digitisation for libraries has been highlighted by a report commissioned by CLIR: "On the Cost of Keeping a Book," by Paul Courant and Matthew "Buzzy" Nielsen, contained within the document "The Idea of Order: Transforming Research Collections for 21st Century Scholarship" published in June 2010.

Courant and Nielsen analyse the full costs of storing physical books, and conclude that for a typical book held by a US research library in an open stack, the "fully loaded" cost is $US4.26 per-annum. Moving the book to a high density storage facility after 20 years reduces this cost to $US1.99 pa in perpetuity, assuming very low usage of the information it contains (that is, very low circulation).

Drawing on the experiences of the Hathi Trust, Courant and Nielsen estimate the comparable "fully loaded" cost of storing a black-and-white ebook in a mirrored digital archive with tape backup is less than $US0.15 pa, and $US0.40 pa for a full colour ebook and after adding a third mirror site.

A very signficant difference; but the real "total societal" cost differential is far greater.

Consider the 3.1 million monographs held by the National Library of Australia in Canberra. Less than 2% of the Australian population has convenient physical access to the NLA, so getting access to a book at the NLA for a typical Australian is very expensive if the person needs to travel to the book, and expensive and slow if they are fortunate enough to be able to get the book to travel to them through an inter-library loan.

The 2001 inter-library loan benchmark by the NLA revealed the average cost to the participating libraries of a loan was over $A49, and the average delay between the request being made and the reader being informed that their requested book was waiting for them at their library was over 11 days.

But a scanned and OCR'ed version of the book, an ebook, could be delivered almost instantly and for negligible cost to the reader (or, if the reader did not have a means of receiving or reading it, to their library). It could be delivered in a format which was easier to read (reader selectable font) and which supported searching, annotation, copying and pasting and hyperlinking. It could be made available to 2 or more readers simultaneously.

But the primary benefit to libraries is the greatly reduced cost of storage and access.

The Internet Archive claim it costs about $US30 to digitise a book and store it in perpetuity using their widely deployed hardware and software. That's much less than the cost of one inter-library loan.

The Internet Archive recently announced a new digital lending library service with 3 categories of books, two of which are not controversial:

  • downloading public domain books
  • linking to the commercial OverDrive service for "current" books made available through that service by the reader's library

But a third, much smaller selection of out-of-print but in-copyright books have been scanned and made available for anyone to freely download and read for 2 weeks. After 2 weeks, the downloaded copy can no-longer be read and the ebook becomes available for someone else to download.

There are currently less than 200 books in this third category. Eric Hellman speculates that Internet Archive's founder, Brewster Kahle, must have expensive legal advice, and that perhaps this ploy is a bait for the publishers.

And 200 books won't change much; they are just noise compared with the number of in-print and in-copyright books which have been "liberated" and circulate on peer-to-peer networks.

It is very likely that very soon, Google Editions will begin making cloud-hosted versions of in-copyright books available for an average price of $US6, of which about $US3.80 will be made available to the Books Right Registry for distribution to the rights holders.

Libraries are funded to preserve and circulate books. They perform an essential role in enriching our society by making information and entertainment available to all. By storing and circulating ebooks rather than physical books, libraries can probably save around $4 per book per year and simultaneously provide a better service to their readers. Even a system which allowed just one electronic copy of each in-copyright book to circulate would provide a better service and be much cheaper than the current physical storage and circulation system.

But what if the savings made from going "e" where made available to purchase additional "copies" for simultaneous circulation? Or what if rights holders could be compensated according to the circulation of their creations?

For decades, the Australian Government has run a public lending right program which makes payments to creators and publishers based on their physical holdings in Australian libraries.

It's now time for libraries to provide a better service for their readers and reduce their own costs by digitising their collections.

It's time for libraries to build on the technology of their new commercial competitors and to spend less resources on shuffling and storing blocks of paper and more on encouraging and rewarding those who produce the content.

Unlike the costs of storing and circulation books, the benefits of a better informed citizenry are incalculable.

From the Internet Archive Digital Lending Library announcement:

"As the first American library to lend books, we believe it is only
fitting that we extend and upgrade this basic, yet crucial service in
the digital age,” said Tom Blake, Digital Projects Manager Boston
Public Library. “We hold the third largest research collection in the
country, much of which is available at our buildings only during
business hours. Digital lending allows us to circulate these rare,
precious, and unique holdings into our local neighborhoods and
beyond – anytime, anywhere, free to all."

Saturday, February 28, 2009

Google Book Settlement doesn't address the hard problem

The Google Book Settlement (GBS) defines an arrangement between the Association of American Publishers (AAP), the Authors Guild and Google which allows Google to digitise and sell access to out-of-print books which are still subject to copyright, and to share the proceeds with the rights holders.

It's easy to see what AAP and the Authors Guild were thinking: books, like all information, are going 'e', lets monetise these lazy assets, and if you can't beat them, join them. But AAP and the Authors Guild are "joining" Google like the Celtic Gauls joined the Roman Empire.

It probably costs Google about $90 to digitise each book covered by the GBS: $60 in up-front payment to the Books Rights Registry (BRR) and around $30 to perform the digitisation (*).

I'm not sure what it costs to author, edit, layout, proof-read and index the typical book, but I've seen estimates that it's typically many tens of thousands of dollars. That is, that the difference between the costs of digitisation and production is around 3 orders of magnitude.

But the split between Google and the BRR is 37%:63%. That is, despite costs hundreds or thousands of times higher, rights holders get only twice Google's share of income produced.

At an average sales price of $6, Google need only $90 / $6 / 0.37 = 41 sales of a title to recoup their costs (**). Rights holders need more like 15000 sales to recoup theirs. The risk/reward balance looks to be unbalanced and hence unstable.

Maybe that's fine - after all, these books are out-of-print, and the rights holders have presumably already got all the revenue they can from these works? Well, no. All we can deduce from the fact that a book is out-of-print is that it is no longer commercially advantageous, given the high costs of producing, moving and selling physical books, to bother printing, distributing and selling it. Old books have to make way for the new on the book-store shelf.

The problem with the settlement is that given the reality of inevitable piracy of digitised books, the interests of rights holders and Google are seriously misaligned. Google has little incentive to be very worried about piracy, and in any case, they're smart enough to know there's nothing they can do about it. All they need is to sell 40 odd copies (or get equivalent per-book institutional subscription revenue to their book database) and they're in the black. If the sell 100, they've got a 200% return on investment, whereas the rights holders haven't even covered the costs of the layout artist.

Digitised books from the Google repository will be pirated and there's nothing that can be done about it. DRM wouldn't help a bit, copies will be untraceable, watermarks will be removed (***).

In the short-term, those lucky enough through personal wealth or institutional affiliation (or those happy to use pirated copies) will enjoy previously unimaginable access to our written culture, albeit at the terms set by a for-profit corporation. But in the long term, we'll all suffer as the incentives to produce are reduced by uncontrollable piracy.

As Kevin Kelly says
The internet is a giant copy machine ... a super-distribution system, where once a copy is introduced it will continue to flow through the network forever, much like electricity in a superconductive wire. We see evidence of this in real life. Once anything that can be copied is brought into contact with internet, it will be copied, and those copies never leave. Even a dog knows you can't erase something once it's flowed on the internet.

As Paul Krugman says
Bit by bit, everything that can be digitized will be digitized, making intellectual property ever easier to copy and ever harder to sell for more than a nominal price. And we’ll have to find business and economic models that take this reality into account.

The Google Books Settlement does not take this reality into account. Rather, it is a short term commercial play which helps to cement Google's pre-eminent position in the information business.

Google isn't being evil, or even tricky, it's just being rational. I assert that what the AAP, the Authors Guild and our society is really looking for is what Krugman describes as "a sustainable business and economic models that take this reality into account".

One attempt to come up with a model that fits Krugman's specification does so by incorporating compulsory licensing with free, easy and anonymous access and downloading of digitised materials administered by commercially disinterested parties and funded by general taxation. More details are here.


* The Internet Archive asserts it costs them around $30 to digitise a typical book by scanning from paper and store the digitised copy.

** The settlement claims that about half the books covered will be offered for sale for $5.99 or less.

*** Copies will be downloaded by individuals with access to large institutional subscriptions (eg, university students using their library's access), programmatically combined with other copies to locate and remove or blur watermarks. The costs of piracy are near zero as everything can be automated (see for example, the Google Book Downloader which automates the process of creating a local PDF copy of books on Google whose pages can be viewed). The music industry has learnt that neither DRM nor attacking P2P networks materially helps.