From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Garzik Subject: Re: tabled test corpus? Date: Fri, 05 Mar 2010 16:28:05 -0500 Message-ID: <4B917765.3040706@garzik.org> References: <4B9123EF.5000108@garzik.org> <4B914E77.5090806@garzik.org> <436f52801003051315j179e7f98nc0632fba6714cac2@mail.gmail.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:cc:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=dXGn3DSONdyfb9RI0Z1ClIJC83SLUryIoB0Spj4PPXg=; b=L4KlQ40FwFAseXPIoKkJI3uUVVnxmwsGO95iRmtVhv+iJGroSdE5vvqGSuCGJDagIg cu4MoCUgOXQN4R6HeSwCJyo6LysHBRdHTjRcfXGD3UxM0CdSwRd099cgD5LMp0ThaD+K TTZFgfX2Xy4JfBFvG89bvfiZWGJJCwVCaAMIs= In-Reply-To: <436f52801003051315j179e7f98nc0632fba6714cac2@mail.gmail.com> Sender: hail-devel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Colin McCabe Cc: Project Hail , Pete Zaitcev On 03/05/2010 04:15 PM, Colin McCabe wrote: > Random thoughts: > > Maybe something like a freely available dictionary would work, with > the key as the word, and the value as the definition. > > You could grab git commits from the Linux kernel and make the key the > SHA, and the value the patch. > > There's a lot of text in Project Gutenberg. I guess you'd have to > decide what you want your average key / value lengths to be-- I think > most books there are longer than 16K. Maybe you could make the key > (book, page_number). Yeah, I am definitely looking for something much larger than 16K. S3 values can run into the gigabytes per value... Jeff