GENERAL TODO: * The examples are all over the place. They need to be more consistent. * Check that x-archive-queue-derive header. I just skimmed it and it doesn't seem right. * Investigate getting an "ias3support@archive.org" address for support requests * Some of the standard metadata fields are repeatable, some are not. State this in the descriptions. * Excellent Hank idea: Quick Start (TL;DR) section to avoid all the gory details * Dang, but this damn thing is hard to read. Will that get better when it gets converted to the PHP wrapper? I have my doubts. May need a some quick George love to give tips for better readability. * All the other 'foo' (read: green) bits below # Internet Archive's S3-like Server API Last Updated: $Date: 2011-10-06 +0000 (Thu, 06 Oct 2011) $ * NOTA BENE * Introduction * What the IAS3 API Allows You To Do * How IAS3 Differs From Amazon S3 * System Requirements * Using S3 Clients to Access IAS3 * Passing Authorization Credentials to IAS3 * Commonly Used Amazon S3 Headers * x-amz-auto-make-bucket * Internet Archive-specific IAS3 Headers * x-archive-cascade-delete * x-archive-ignore-preexisting-bucket * x-archive-keep-old-version * x-archive-meta-* * x-archive-queue-derive * x-archive-size-hint * IAS3 Identifiers * Settings Metadata Values via Headers * Standard Internet Archive Metadata Fields * hidden * identifier * title * creator * mediatype * collection * description * date * subject * licenseurl * pick * noindex * publicdate * addeddate * adder * uploader * updater * updatedate * notes * rights * contributor * publisher * language * coverage * credits * Custom Metadata Fields * Repeating Metadata Fields * Setting Metadata Values via Files * IDENTIFIER_marc.xml * IDENTIFIER_meta.mrc * How These Metadata Files Are Processed * Special Files * IDENTIFIER_meta.xml * IDENTIFIER_files.xml * IDENTIFIER_rules.conf * Specific File Formats * Only 'lossy' File Formats * All Derivatives * Troubleshooting * Viewing a log of your IAS3 object * My file isn't appearing in the item. * Is there sandbox I can use for testing IAS3? * What happens to my item/file after uploading? * Is there any way to control how files derive? * Downloading via IAS3 * Code Examples * curl * Text item (a PDF will be OCR'd) * Movie item (Will get video player on details page) * Uploading a file to an existing item * Destroy and respecify the metadata for an item * A Movie example with subject keywords, and creative commons license * Perl * An extract of a script for uploading multiple files via IAS3 using LWP * Other Languages * Support * Appendices * Terminology * Internet Archive's Item Structure (in brief) * IAS3 HTTP Return Codes * Error Messages * Default Metadata Values * Example IDENTIFIER_marc.xml file * Example IDENTIFIER_meta.mrc file ## NOTA BENE This document is very, very much a work in progress. It's so in-progress it's not even a first draft. Please do not assume this document is definitive until it gets committed and pushed live to archive.org (wrapped in appropriate lookfeel, etc.). Until then, feel free to reference this but the official IAS3 documentation can still be found at [http://archive.org/help/abouts3.txt](http://archive.org/help/abouts3.txt). ## Introduction This document covers the technical details of using Internet Archive's S3-like server API, aka "IAS3." The intended audience is a technical user, ideally one who is comfortable in the Linux/UNIX command line environment. IAS3 is an [API](http://en.wikipedia.org/wiki/Api) based upon [Amazon's](http://amazonwebservices.com) [Simple Storage Service (aka S3)](http://docs.amazonwebservices.com/AmazonS3/latest/dev/). Whereas Amazon's S3 API allows you to store items in the Amazon S3 cloud storage service, the IAS3 API allows you to create items on and upload data to [Internet Archive](http://archive.org). Because of its similarities to Amazon's S3, **_please familiarize yourself with the [Amazon S3](http://docs.amazonwebservices.com/AmazonS3/latest/dev/) documentation before using Internet Archive's IAS3._** ## What the IAS3 API Allows You To Do foo: Check with Sam re: the examples marked below; they aren't working as expected. Also: don't like the section title. In Internet Archive terminology, an item maps directly onto the Amazon S3 concept of a [bucket](http://docs.amazonwebservices.com/AmazonS3/latest/dev/In troduction.html). IAS3 allows you to create items nee buckets, populate them with files and maintain the metadata for the item. You can also use IAS3 to control certain elements of file processing behavior. Internet Archive currently does not support file-level metadata. Because Internet Archive items are analogous to Amazon S3 buckets they can be accessed using similar URL addresses. Items are typically accessed on Internet Archive using the IA-specific details/IDENTIFIER format. For instance: http://www.archive.org/details/Sita_Sings_the_Blues The link above will present the _details page_ for the item on Internet Archive. This same item is also available in an S3-like format of: http://s3.us.archive.org/Sita_Sings_the_Blues Or: http://Sita_Sings_the_Blues.s3.us.archive.org/ These URLs will return XML containing information about the item. Each file contained in an item can similarly be used as an S3-like key in a URL: http://Sita_Sings_the_Blues.s3.us.archive.org/Sita_Sings_the_Blues_small.mp4 Performing a PUT on the Internet Archive equivalent to an S3 endpoint will result in the creation of a new item in Internet Archive. Files may be added to the item in the same manner. Both of these operations may be combined in a single PUT command. For example, using curl: curl --location --header 'x-amz-auto-make-bucket:1' \ --header 'x-archive-meta01-collection:opensource' \ --header 'x-archive-meta-mediatype:texts' \ --header 'x-archive-meta-sponsor:Andrew W. Mellon Foundation' \ --header 'x-archive-meta-language:eng' \ --header "authorization: LOW $accesskey:$secret" \ --upload-file /home/samuel/public_html/intro-to-k.pdf \ http://s3.us.archive.org/sam-s3-test-08/demo-intro-to-k.pdf ## How IAS3 Differs From Amazon S3 IAS3 differs from Amazon's S3 API in several significant ways: * IAS3 does not allow DELETE for buckets, only for files. Attempting to DELETE a bucket will result in a _Not Authorized_ error. * IAS3 supports the HTTP 1.1 REST interface for S3 but **not** the SOAP interface. * IAS3 is much more likely to issue HTTP 307 Location redirects than Amazon S3, therefore it is advised that you use an S3-compatible client with good HTTP 100 Continue support (for example, curl version 7.19 and higher). * Amazon S3 allows users to [set ACLs for buckets and objects](http://docs.amazonwebservices.com/AmazonS3/latest/API/index.html?RESTBucketPUTacl.html). IAS3 does not. Instead, all items are created with ACLs of world readable and item uploader writable. * Amazon S3's POST and COPY are not implemented in IAS3. * IAS3 ignores HTTP 1.1 Range headers. IAS3 also supports several of its own headers. These are discussed in more detail below. ## System Requirements In order to use IAS3 to upload to Internet Archive, you must have: * An internet connection * An Internet Archive [patron account](http://www.archive.org/account/login.createaccount.php) * [API keys](http://www.archive.org/account/s3.php) for IAS3 * Client code which supports the Amazon S3 API. Most examples in this document use [curl](http://curl.haxx.se/) due to simplicity. If you choose to use curl or libcurl to interface with IAS3 please be sure you are using version 7.19 or highter. These versions have excellent HTTP 100 Continue support. ## Using S3 Clients to Access IAS3 Internet Archive strives to make IAS3 compatible with current Amazon S3 client code. Ideally running the following command--replacing amazonaws.com with us.archive.org--on your S3 client code would allow you to use IAS3 with no further changes necessary: perl -pi -e 's/amazonaws.com/us.archive.org/g' * Some Amazon S3 clients obey configuration files, many of will will allow you to define the preferred S3 hostname. Setting this hostname to s3.us.archive.org in the configuration file should allow the client code to upload to Internet Archive with no further changes. For instance, adding the following to your ~/.s3cfg configuration file for [s3cmd](http://s3tools.org/s3cmd), a popular Amazon S3 client, will allow you to connect to IAS3: [default] access_key = [YOUR-ACCESS-KEY](http://www.archive.org/account/s3.php) secret_key = [YOUR-SECRET-KEY](http://www.archive.org/account/s3.php) host_base = s3.us.archive.org host_bucket = %(bucket)s.s3.us.archive.org ## Passing Authorization Credentials to IAS3 [Authorization credentials](http://www.archive.org/account/s3.php) may be passed to IAS3 by your Amazon S3-compatible client via configuration file (see above). In addition there is a clear text password mode. To use this mode, pass your access and secret keys as values to the Authorization header: Authorization: LOW $accesskey:$secret This is the authorization method shown in most of the examples in this document. ## Commonly Used Amazon S3 Headers foo: are there any more of these? only the one ever appears in the examples Most Amazon S3 headers can also be used with IAS3. This section briefly discusses the most commonly used Amazon S3 headers. ### x-amz-auto-make-bucket The x-archive-auto-make-bucket header allows you to both create an item and upload directly to it with a single command. To enable this option, pass the x-archive-auto-make-bucket header with a value of 1. If you do not specify this value you **must** create an item before you attempt to upload to it. The default value for this header is 0. This header only works when PUTting to IAS3. ## Internet Archive-specific IAS3 Headers foo: I really don't like the formatting here. Maybe add a standard table to each header, listing where it can be used (PUT/GET/DELETE, etc.), valid values, default value? Internet Archive has implemented specialized headers for controlling certain operations upon objects and files via IAS3. ### x-archive-cascade-delete Normal DELETE operation is to remove only the specified file. The x-archive- cascade-delete header allows you to delete not only a file but also all derivative and original files associated with it. The [Internet Archive derivatives help page](http://www.archive.org/help/derivatives.php) provides additional information about the files which may be deleted in this operation. To enable this option, pass the x-archive-cascade-delete header with a value of 1. The default value for this header is 0. This header only works when DELETING a file within an item. **Nota bene:** DELETE is not allowed for items (buckets) in IAS3. You may only DELETE a file and its derivatives. ### x-archive-ignore-preexisting-bucket A normal PUT operation including x-archive-meta-* headers will overwrite an existing IDENTIFIER_meta.xml file. The x-archive-ignore-preexisting-bucket header will instead overwrite the existing IDENTIFIER_meta.xml file with the x -archive-meta-*- header values passed in the same PUT command. To enable this option, pass the x-archive-ignore-preexisting-bucket header with a value of 1. The default value for this header is 0. This header only works when PUTting to IAS3. ### x-archive-keep-old-version Normal PUT operation will overwrite a file when it is used to upload a file of the same name. A normal DELETE operation will remove the specified file. The x -archive-keep-old-version header will rename the specified file, prepending the filename with .~~ before proceding with the PUT or DELETE operation. To enable this option, pass the x-archive-keep-old-version header with a value of 1. The default value for this header is 0. **Caution!** This header is experimental. Its use could result in unexpected results if interleaved with PUTs which do not use this header. This header works for both PUT and DELETE for IAS3. ### x-archive-meta-* The x-archive-meta-* header is used for setting metadata values for an item. This header is discussed in detail below. ### x-archive-queue-derive Normal operation after a file has been PUT into an item is to queue it for [derivation to other file formats](http://www.archive.org/help/derivatives.php). When PUTting either a very large file or a large number of files can bog down the derivation process and slow system performance. In these instances it is preferable to disable automatically derive queueing. Please note: Files may be queued for derivation following upload. To queue an individual file, navigate to the item detail page on Internet Archive and click the _Edit Item!_ link at the top. If you have several files which need to be queued, [contact Internet Archive](mailto:info@archive.org?subject=[Queue for Derive]) for assistance. To disable automated creation of derivative files, pass the x-archive-queue- derive header with a value of 0. The default value for this header is 1. This header works only when PUTting to IAS3. ### x-archive-size-hint If the total size of files in your item will exceed 10 gigabytes, Internet Archive recommends you declare the size at the time of bucket creation. This allows the Internet Archive catalog to more easily place the item for storage, facilitating a potential speed boost to the upload. To enable this option, pass the x-archive-size-hint header with a value of the file size **in bytes**. If this header is not defined IAS3 will attempt to default to the value in the content-length header. This header works only when PUTting to IAS3. ## IAS3 Identifiers Each item at Internet Archive has a identifier. An identifier is composed of any unique combination of alphanumeric characters, underscore (_) and dash (-). While there are no official limits it is strongly suggested that they be between 5 and 80 characters in length. Identifiers must be unique across the entirety of Internet Archive, not simply unique within a single collection. Once defined an identifier **can not** be changed. It will travel with the item or object and is involved in every manner of accessing or referring to the item. In IAS3, identifiers are defined implicitly in the target URL. For example: curl --location --header 'x-amz-auto-make-bucket:1' \ --header "Authorization: LOW $accesskey:$secret" \ --header "x-archive-meta-collection:test_collection" \ --upload-file /Users/archive/Desktop/The_Open_Source_Way_03.pdf \ http://s3.us.archive.org/**vmb_tosw_trial_upload_03**/The_Open_Source_Way_03.p df The identifier in this command is vmb_tosw_trial_upload_03. The item may be viewed at its _details_ page. The details page for any item is simply http://archive.org/details/ followed by the identifier. The details page for this example is: http://archive.org/details/vmb_tosw_trial_upload_03 ## Settings Metadata Values via Headers The x-archive-meta-* header is used to set metadata values for items. _At this time Internet Archive does not support file-level metadata. Metadata may only be defined at an item level._ All metadata fields are defined as key-value pairs passed via headers. The header format is: x-archive-meta-FIELDNAME:FIELDVALUE For instance, if you are using curl you may set a value for the title metadata field using this header: --header "x-archive-meta-title:John Muir on Hetch Hetchy" \ Alternatively, you may use the Amazon S3 standard x-amz-meta- FIELDNAME:FIELDVALUE header for setting metadata. Metadata headers are sorted prior to processing. This sorting includes the x-amz- or x-archive- header prefixes, therefore if you use both of these prefixes when setting metadata values the fields set with x-amz- will be processed first and may cause unexpected behavior. To avoid potential problems it is advised that you use either the x-archive- or the x-amz- header prefix when setting metadata, not both. All metadata header values are interpreted as UTF-8 encoded characters. ### Standard Internet Archive Metadata Fields There are several standard metadata fields recognized for Internet Archive items. All metadata fields except identifier are optional. foo: alphabetize these foo: standardize wording; it's all over the place foo: field or tag? Pick a term and stick with it #### hidden foo: what's this do? It's admin/owner-only and doesn't appear on editxml.php #### identifier Each item at Internet Archive has a identifier. An identifier is composed of any unique combination of alphanumeric characters, underscore (_) and dash (-). While there are no official limits it is strongly suggested that they be between 5 and 80 characters in length. An identifier **can not** be defined via metadata header. Instead identifiers are defined implicitly in the target URL. Please see IAS Identifiers above for additional information. #### title The title for the item. This appears in the header of the item's detail page on Internet Archive. If a value is not specified for this field it will default to the identifier for the item. #### creator An entity primarily responsible for creating the files contained in the item. #### mediatype The primary type of media contained in the item. While an item can contain files of diverse mediatypes the value in this field defines the appearance and functionality of the item's detail page on Internet Archive. In particular, the mediatype of an item defines what sort of online viewer is available for the files contained in the item. The mediatype metadata field recognizes a limited set of values: * audio The majority of audio items should receive this mediatype value. Items for the [Live Music Archive](http://www.archive.org/details/etree) should instead use the etree value. * data This is the default value for mediatype. Items with a mediatype of data will be available in Internet Archive but you will not be able to browse to them. In addition there will be no online reader/player for the files. * etree Items which contain files for the [Live Music Archive](http://www.archive.org/details/etree) should have a mediatype value of etree. The Live Music Archive has very specific upload requirements. Please consult the [documentation](http://www.archive.org/about/faqs.php#Live_Music_Archive) for the Live Music Archive prior to creating items for it. * image Items which predominantly consist of image files should receive a mediatype value of image. Currently these items will not available for browsing or online viewing in Internet Archive but they will require no additional changes when this mediatype receives additional support in the Archive. * movies All videos (television, features, shorts, etc.) should receive a mediatype value of movies. These items will be displayed with an online video player. * software Items with a mediatype of software are accessible to browse via Internet Archive's [software collection](http://www.archive.org/details/software). There is no online viewer for software but all files are available for download. * texts Items with a mediatype of texts will appear with the [online bookreader](https://github.com/openlibrary/bookreader). Internet Archive will also attempt to OCR files in these items. * web The web mediatype value is reserved for items which contain web archive [WARC](http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml) files. If the mediatype value you set is not in the list above it will be saved but ignored by the system. This field may be modified only by an administrator or the owner of the item. If a value is not specified for this field it will default to data. #### collection A collection is a specialized item used for curation and aggregation of other items. Assigning an item to a collection defines where the item may be located by a user browsing Internet Archive. To assign an item to a collection, pass its identifier as the value for an x-archive-metadata-collection header. For example, if you are using curl you can assign an item to the _Community Texts_ collection (identifier: _opensource_) with the following header: --header 'x-archive-metadata-collection:opensource' \ A collection **must** exist prior to assigning any items to it. Currently collections can only be created by Internet Archive staff members. Please [contact Internet Archive](mailto:info@archive.org?subject=[Collection Creation Request]) if you need a collection created. #### description A description of the item. The value of this metadata field may contain HTML.