Universal Access to All Knowledge
Universal access to all knowledge, Brewster Kahle Digital Librarian, Director and Co-founder of the Internet Archive, declared, will be one of humanity’s greatest achievements. We are already well on the way. “We’re building the Library of Alexandria, version 2. We can one-up the Greeks!”
Start with what the ancient library had—books. The Internet Library already has 3 million books digitized. With its Scribe Book Scanner robots—29 of them around the world—they’re churning out a thousand books a day digitized into every handy ebook format, including robot-audio for the blind and dyslexic. Even modern heavily copyrighted books are being made available for free as lending-library ebooks you can borrow from physical libraries—100,000 such books so far. (Kahle announced that every citizen of California is now eligible to borrow online from the Oakland Library’s “ePort.”)
Around 3000 BC, The Library of Alexandria was founded for a similar purpose: to store all human knowledge. It was based on a technological advance, namely papurys scrolls. Before that, knowledge was stored on clay tablets, which weren’t exactly what you would call a compact way to store stuff. But with scrolls, you could actually store hundreds of thousands pieces of information in a fixed place. We have had a similar technological breaktrough in recent decades with computers and the Internet. Computers can store vast amounts of data and the Internet can make it available.
“What is the Library of Alexandria most famous for?” Kahle asked. “For burning! It’s all gone!” To maintain digital archives, they have to be used and loved, with every byte migrated forward into new media evey five years. For backup, the whole Internet Archive is mirrored at the new Bibliotheca Alexadrina in Egypt and in Amsterdam. (“So our earthquake zone archive is backed up in the turbulent Mideast and a flood zone. I won’t sleep well until there are five or six backup sites.”)
We also have the political will to complete such a project. We want to live in an open society, where the idea of education and access to knowledge is supported. Sure, there are forces trying to restrict openness, but, “by and large the idea is supported.”
So, for the first time in history, we’ve got all the three things required: the storage, network and distribution mechanism and the political will. It’s uncertain how long this opportunity will last, so we’ve got to act fast.
To answer the question whether such project is indeed doable, we need to find answers to four questions:
- Should we do it?
- Can we?
- May we?
- Will we?
So, should we?
The hand-waving answer: Yes, we should.
You could dwell on the issue for as lang as you will, weighing the pros and cons, never coming to a conclusive argument. But, it’s a project that might be up there with man’s greatest achievements, achieving or even bypassing the dream of Library of Alexandria. Right now we have a window of opportunity to complete the project, and the window might close in few years. So let’s give this project a shot; if it turns out to be a bad idea, let’s find that out.
The other open question is “Will we?” Brewster left the answer to that as “an exercise to the reader.” He and a bunch of other people will devote their time and money to the project, but their resources are limited. Volunteers are needed.
As for questions “can we?” and “may we?”; well, there are obstacles, but they are not insurmountable, as we shall see.
We divide these questions further to the kinds of data we need to archive. The categories are: text, audio, moving images and the web. Let’s go through these one at a time.
Text, text, text
So, how many books are there?
The Library of Congress, which has the largest collection of books, has around 24 million books. One book takes about one megabyte of storage. So to digitally store all books in Library of Congress it would take a whopping 24 terabytes of storage space. (If you’d need to store images of the pages, it would take a bit more, but that’s still within our reach.)
You can store 24 terabytes of data in a stack of Linux boxes that’s takes about square meter of space and costs approximately $60000. All in all a doable number. Nothing out of our reach.
Then there’s the trouble of converting all those books to a digitized format. You need to scan the books. There are projects already doing this, in which The Internet Archive is involved with, like the Million Book Project or Project Gutenberg.
There are a couple of ways to do the scanning. One is to do it by hand; a person flips through the pages and “takes a picture” of the pages. This has been going on in India, where it costs about ten dollars a book.
In western countries, where the labor is more expensive, they have set up so called in-library scanning systems, where the local people do the scanning, and automated systems, which consist of robots and custom made arrangements. The price is still a bit higher per book, but they are making progress and collecting experiences from India.
What about making the books available? Reading the books on computer screens kind of sucks, so TIA has this project called Bookmobile. Bookmobiles are trucks that carry a satellite on the top and printing facilities inside. One bookmobile costs around $15000, including the car, which is not that much. A Bookmobile downloads the book through satellite connection from the archive, prints and binds the book. The whole thing costs about $1 per book (assuming the book is black and white and around 100 pages). The local people can do the whole thing with just a week’s training. Also, they found out that loaning a book from Harvard library costs, all in all, around $2, so it might actually be cheaper to print and bind a book and to give it away. (However, see Chris Lightfoot’s noteson the subject.) So, we can conclude that making the books available is also well within our grasp.
May we scan the books and make them available?
Imaging, ie. the actual scanning of the books, is allowed for as long as you own the copy of the book.
But making them available is a bit more complicated. First, there are the in-print books, which consist of hundreds of thousands of books, which are rather strictly legislated. But there is also a huge body of copyrighted books that are permanently out of print, which are called the “orphans.” Here the librarians ran in to the tragedy of screwed up laws. By the current legislation, you just can’t make them available, even though the books are never going to be in-print anymore. (Libraries live by giving access to these out-of-print books.)
To clarify the copyright issue with orphans, TIA filed a law-suit. (That’s the way things are clarified in USA.) So, there’s an ongoing Kahle vs. Ashcroft case, called “Free The Orphans”, which hopefully brings some sense to the issue.
But we can conclude that text and books are doable.
Audio and music
How much is there?
As for music, Kahle noted that the 2-3 million records ever made are intensely litigated, so the Internet Archive offered music makers free unlimited storage of their works forever, and the music poured in. The Archive audio collection has 100,000 concerts so far and a million recordings, with three new bands every day uploading.
Almost all of the published works of audio is music. They consist of 2-3 million titles (78s, LPs, CDs). That’s storage-wise a doable number, but published music is right now a very litigated and restricted area, so we just can’t go and give access to it.
Instead, TIA has started to work on other areas of music. They’ve been working with bootleg recorders. There’s an ongoing tradition, started by the Grateful Dead, amongst jam bands that you are allowed to tape and trade recordings of live performances as long as you don’t make profit out of it. The trading has moved to Internet, but the traders have had problems with bandwidth. One concert takes about gigabyte, losslessly compressed, so it’s not so light on downloading or uploading.
So the people at The Internet Archive offered the live recorders “unlimited storage, unlimited bandwidth, forever, for free.”, which is just what the traders wanted. Now they have archived over 12000 concerts; mostly from the “guys with guitars genre” and lately also from the “guys with mandolins genre.” The only requirement is that the recordings are under Creative Commons license. If you are in a band, or know people who are in bands, join the action; uploading gigabytes of concerts is no sweat to The Internet Archive. Upload away.
Legislation concerning recorded performances in the US is really hard to figure out. But in the Europe there’s a 50 (or 70) year rule: the works are placed under public domain 50 years after the performer has died. This basically implies that the contents of 78 rpm records are public domain, so if you have any 78s, rip them and send the contents to The Internet Archive. It would be kind of fun to have hip-hop songs built out of scratches of 78s.
Archive.org also hosts around 50 netlabels. These netlabels have their content on the archive and they point straight to the downloadable files, meaning that you don’t have to go through archive.org’s branding or anything.
So, all in all, archiving music and making it available is also possible.
Moving images. The 150,000 commercial movies ever made are tightly controlled, but 2 million other films are readily available and fascinating—600,000 of them are accessible in the Archive already. In the year 2000, without asking anyone’s permission, the Internet Archive started recording 20 channels of TV all day, every day. When 9/11 happened, they were able to assemble an online archive of TV news coverage all that week from around the world (“TV comes with a point of view!”) and make it available just a month after the event on Oct. 11, 2001.
Most people first think of the big theatratical releases of which there are around 100 000 – 200 000. (Half of them are estimated to be Indian.) These are very litigated and restricted. However, there are some public domain feature films too. There’s a guy that has collected around 600 public domain feature films, and The Internet Archive is in process to digitize them. The legendary The Night of the Living Dead, for example, is a public domain film!
But there are also other interesting films that are less limited. It used to be that you had to insert an explicit copyright remark to the film, otherwise the film would be automatically public domain. (This is what Kahle called the “Ben Franklin way of viewing the world.”) This means that there are tons of old films that are public domain.
Clips of such films are used in various interesting ways now: government films, educational films, social guidance stuff, home art flicks, etc. Brewster invites artist to use the public domain stuff to create new art. And if you have your own videos, and want to share them under a free license (for example, the creative commons license), The Internet Archive offers you unlimited storage, unlimited bandwidth, forever, for free; just upload away.
The Internet Archive and Television Archive have started archiving TV broadcasts little late, at 1999. They record 20 channels (some Russian channel, some Japanese channel, CNN, BBC, etc.) 24 hours a day with DVD quality. Broadcasts are stored on off-line hard-drives. The recording takes around 20 terabytes of storage space a month.
Due to legislation issues, these TV archives are not, by default, publicly available. However, they’ve once made the archives available for a short period of time. At October 11 2001, they showed the TV archives spanning from September 11 2001 to September 18 2001.
The idea was to observe in retrospective what did the world actually see, because the 9/11 was basically a television event. They wanted to show how people perceived it. The showcase made evident that media is just a player, not the truth, just one point of view.
But as far as archiving is considered, moving images are very much doable. We can archive the whole dang thing and we could make it available.
Software, or, run Lotus 1-2-3 as it was originally written
It’s estimated that there has been around 50,000 packaged and released software titles. Technically, archiving the software is doable. We can rip the stuff, we can run them through emulators. You can replay all that amazing stuff from the early days, software for early Mac, Sol, Commodore 64, etc.
However, they ran into a major obstacle: The Digital Millennium Copyright Act (DMCA), which, as Kahle puts it, is an “amazing piece of Soviet-era legislation where ‘everything is illegal unless we give permission’,” and which “is an anti-thesis of United States used to stand for.”
DMCA implies that ripping (ie. making a digital copy of) the software is illegal. You are not allowed to break the copy protection, which a lot of the early software had, which effectively means that you can’t archive them.
The librarians went to the copyright office with a lot of preparation, briefs etc. They waved a floppy in their hands and said, paraphrased: “Here’s Lotus 1-2-3 on a floppy, rotting away. We are ready to make a digital copies of this stuff and we want to be allowed to do this.”
The librarians won. They got a copyright exemption for two years. What The Internet Archive now needs is donations of the physical objects, for example floppies, and digital copies of the software. People need to act fast with this. The exemption is for two years and the window of opportunity may shut. When the legal hearings come around again, we can show that “the world didn’t crater as the software industry thought.”
The exemption says that you are allowed to rip old software if you are “an agent for archive.org.” Only ripping is allowed, you can’t publish the stuff on the net.
The Web itself. When the Internet Archive began in 1996, there were just 30 million web pages. Now the Wayback Machine copies every page of every website every two months and makes them time-searchable from its 6-petabyte database of 150 billion pages. It has 500,000 users a day making 6,000 queries a second.
The Internet Archive is perhaps best known for archiving the web. They have been archiving the web since 1996, taking a full snapshot of the publicly available web, all the text and all the pictures, every two months. It takes about 20 terabytes storage space compressed, around 50-60 terabytes uncompressed.
All that data is available through the Wayback machine. With the Wayback Machine you can (try to) “surf the web is it was”. It is visited by 150 000 people a day, and it gets 8 million hits a day. The database is in order of 300-400 terabytes compressed, over a petabyte uncompressed.
It might be one of the largest databases out there, even though it doesn’t run on Oracle. Instead, it runs on cluster of Linux machines and the data is stored as flat files on the filesystem, which is the only way they’ve found to scale.
Archiving the web is doable, at least as good as The Internet Archive is doing it right now. And they are getting better at it all the time.
Preservation and access
“Don’t just have one copy of the data”
The storage has to be very cost effective. Tapes are too expensive, so they’re storing the data on hard drives. They put four 300 gigabyte disks in a Linux box, yielding around terabyte storage space per machine, and stack a lots of these Linux boxes to a rack.
The only full copy of the data is in San Francisco, known also for its massive earthquakes, which might not prove to be a good idea. The lesson to learn from the original Library of Alexandria is, of course, that it burned down. What’s now left of all that ancient knowledge is allgedly eight pieces of papurys. So, don’t just have one copy of the data.
That’s why they’ve been swapping data with the New Library of Alexandria. Anything that The Internet Archive gets goes to the New Library of Alexandria and vice versa. The New Library of Alexandria has a few hundred terabytes now, and by the end of the year they will have the other 700 terabytes.
Also, The Internet Archive is kicking off the European archive at Amsterdam, Netherlands. (Brewster himself and his family are living in Amsterdam right now.) The first installment in Amsterdam is a 100 terabyte machine, which is to show that they are serious with the project. They are planning to roll the full petabyte in 12 months. Internet Access For All (XS4ALL) provides free hosting and free bandwidth for the European archive. The Internet Archive is looking for a technical director for the European Archive.
The Internet Archive now provide several gigabytes of bandwidth, hoping to extend it to 10 gigabytes, which would be a meaningful perecentage of the public web.
But the most important thing that libraries do is to provide an interface the knowledge, ie. “search engines”. One search engine to the Internet Archive data was built by a gal in two years as a part time job. Her search engine indexes 11 billion web pages, four times the size of Google in terms of pages. So the idea is to make it possible to “build a Google in a month.”
The public access to the Internet is quite horrible in the States. The bandwidth isn’t growing and the phone companys want to raise the prices. The Internet Archive are providing wireless access is San Francisco area. They provide unlimited bandwidth, that’s not the problem, but primarily the costs with technical support.
We’ve got to start making better job as far as bandwidth is concerned. Right now, most of the videos on the Internet are pathetically small. In the future, people just won’t turn to the Internet, if we can’t serve proper videos.
So, will we do this?
Finally, Brewster returns to the question of whether we are going to make this project happen. There are forces that are making it not happen, but, on the other hand, there are a lot of people already doing it. They’ve got the funding. TIA’s budget is around 5 million dollars a year, and they are trying to raise that to 10 million dollars a year.
In his speech, Brewster argued that universal access to all human knowledge is possible. Succeeding, the project might be one of humanity’s greatest achievements, up there with the myth of Library of Alexandria. Not only to store all human knowledge, but to provide access to it, even to the child in rural of Uganda. All we need need is few years of really hard core work, and we might actually pull this through.
Speaking of institutional longevity, Kahle noted during the Q & A that nonprofits demonstrably live much longer than businesses. It might be it’s because they have softer edges, he surmised, or that they’re free of the grow-or-die demands of commercial competition. Whatever the cause, they are proliferating.
Let’s get on with it.
By various sources