We use a Hadoop cluster to rollup registration and view data each night.
Our cluster has 10 1U servers, with 4 cores, 4GB ram and 3 drives
Each night, we run 112 Hadoop jobs
It is roughly 4X faster to export the transaction tables from each of our reporting databases, transfer the data to the cluster, perform the rollups, then import back into the databases than to perform the same rollups in the database.
We use Hadoop and HBase in several areas from social services to structured data storage and processing for internal use.
We currently have about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development. We plan a deployment on an 80 nodes cluster.
We constantly write data to HBase and run MapReduce jobs to process then store it back to HBase or external systems.
Our production cluster has been running since Oct 2008.
A 15-node cluster dedicated to processing sorts of business data dumped out of database and joining them together. These data will then be fed into iSearch, our vertical search engine.
We use Hadoop for variety of things ranging from ETL style processing and statistics generation to running advanced algorithms for doing behavioral analysis and targeting.
The Cluster that we use for mainly behavioral analysis and targeting has 150 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk.
ARA.COM.TR - Ara Com Tr - Turkey's first and only search engine
We build Ara.com.tr search engine using the Python tools.
We have been running our cluster with no downtime for over 2 ½ years and have successfully handled over 75 Million files on a 64 GB Namenode with 50 TB cluster storage.
We are heavy MapReduce and HBase users and use Hadoop with HBase for semi-supervised Machine Learning, AI R&D, Image Processing & Analysis, and Lucene index sharding using katta.
We use Hadoop to summarize of user's tracking data.
And use analyzing.
Brilig - Cooperative data marketplace for online advertising
We use Hadoop/MapReduce and Hive for data management, analysis, log aggregation, reporting, ETL into Hive, and loading data into distributed K/V stores
Our primary cluster is 10 nodes, each member has 2x4 Cores, 24 GB RAM, 6 x 1TB SATA.
We also use AWS EMR clusters for additional reporting capacity on 10 TB of data stored in S3. We usually use m1.xlarge, 60 - 100 nodes.
We use Hadoop for batch-processing large RDF datasets, in particular for indexing RDF data.
We also use Hadoop for executing long-running offline SPARQL queries for clients.
We use Amazon S3 and Cassandra to store input RDF datasets and output files.
We've developed RDFgrid, a Ruby framework for map/reduce-based processing of RDF data.
We primarily use Ruby, RDF.rb and RDFgrid to process RDF data with Hadoop Streaming.
We primarily run Hadoop jobs on Amazon Elastic MapReduce, with cluster sizes of 1 to 20 nodes depending on the size of the dataset (hundreds of millions to billions of RDF statements).
We use Hadoop to create our indexes of deep web content and to provide a high availability and high bandwidth storage service for index shards for our search cluster.
We use Hadoop and Nutch to research data on programming-related websites, such as looking for current trends, story originators, and related information.
We're currently using three nodes, with each node having two cores, 4GB RAM, and 1TB storage. We'll expand these once we settle on our related technologies (Scala, Pig, HBase, other).
We use Hadoop in a Data-Intensive Computing capstone course. The course projects cover topics like information retrieval, machine learning, social network analysis, business intelligence, and network security.
The students use on-demand clusters launched using Amazon's EC2 and EMR services, thanks to its AWS in Education program.
We are using Hadoop in a course that we are currently teaching: "Massively Parallel Data Analysis with MapReduce". The course projects are based on real use-cases from biological data analysis.
We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.
Currently we have 2 major clusters:
A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
A 300-machine cluster with 2400 cores and about 3 PB raw storage.
Each (commodity) node has 8 cores and 12 TB of storage.
We are heavy users of both streaming as well as the Java APIs. We have built a higher level data warehousing framework using these features called Hive. We have also developed a FUSE implementation over HDFS.
We, the Japanese company Freestylers, use Hadoop to build the image processing environment for image-based product recommendation system mainly on Amazon EC2, from April 2009.
Our Hadoop environment produces the original database for fast access from our web application.
We also uses Hadoop to analyzing similarities of user's behavior.
G
GBIF (Global Biodiversity Information Facility) - nonprofit organization that focuses on making scientific data on biodiversity available via the Internet
18 nodes running a mix of Hadoop and HBase
Hive ad hoc queries against our biodiversity data
Regular Oozie workflows to process biodiversity data for publishing
All work is Open source (e.g. Oozie workflow, Ganglia)
We use a customised version of Hadoop and Nutch in a currently experimental 6 node/Dual Core cluster environment.
What we crawl are our clients Websites and from the information we gather. We fingerprint old and non updated software packages in that shared hosting environment. We can then inform our clients that they have old and non updated software running after matching a signature to a Database. With that information we know which sites would require patching as a free and courtesy service to protect the majority of users. Without the technologies of Nutch and Hadoop this would be a far harder to accomplish task.
We are using Hadoop and Nutch to crawl Blog posts and later process them. Hadoop is also beginning to be used in our teaching and general research activities on natural language processing and machine learning.
We use hadoop for Information Retrieval and Extraction research projects. Also working on map-reduce scheduling research for multi-job environments.
Our cluster sizes vary from 10 to 30 nodes, depending on the jobs. Heterogenous nodes with most being Quad 6600s, 4GB RAM and 1TB disk per node. Also some nodes with dual core and single core configurations.
Rather than put ads in or around the images it hosts, Levin is working on harnessing all the data his service generates about content consumption (perhaps to better target advertising on ImageShack or to syndicate that targetting data to ad networks). Like Google and Yahoo, he is deploying the open-source Hadoop software to create a massive distributed supercomputer, but he is using it to analyze all the data he is collecting.
We also use Hive to access our trove of operational data to inform product development decisions around improving user experience and retention as well as meeting revenue targets
Our data is stored in s3 and pulled into our clusters of up to 4 m1.large EC2 instances. Our total data volume is on the order of 5Tb
Using Hadoop MapReduce to analyse billions of lines of GPS data to create TrafficSpeeds, our accurate traffic speed forecast product.
K
Kalooga - Kalooga is a discovery service for image galleries.
Uses Hadoop, Hbase, Chukwa and Pig on a 20-node cluster for crawling, analysis and events processing.
Katta - Katta serves large Lucene indexes in a grid environment.
Uses Hadoop FileSytem, RPC and IO
Koubei.com Large local community and local search at China.
Using Hadoop to process apache log, analyzing user's action and click flow and the links click with any specified page in site and more. Using Hadoop to process whole price data user input with map/reduce.
This is the cancer center at UNC Chapel Hill. We are using Hadoop/HBase for databasing and analyzing Next Generation Sequencing (NGS) data produced for the Cancer Genome Atlas (TCGA) project and other groups. This development is based on the SeqWare open source project which includes SeqWare Query Engine, a database and web service built on top of HBase that stores sequence data types. Our prototype cluster includes:
Multiple alignment of protein sequences helps to determine evolutionary linkages and to predict molecular structures. The dynamic nature of the algorithm coupled with data and compute parallelism of Hadoop data grids improves the accuracy and speed of sequence alignment. Parallelism at the sequence and block level reduces the time complexity of MSA problems. The scalable nature of Hadoop makes it apt to solve large scale alignment problems.
Our cluster size varies from 5 to 10 nodes. Cluster nodes vary from 2950 Quad Core Rack Server, with 2x6MB Cache and 4 x 500 GB SATA Hard Drive to E7200 / E7400 processors with 4 GB RAM and 160 GB HDD.
Hardware: 35 nodes (2*4cpu 10TB disk 16GB RAM each)
We intend to parallelize some traditional classification, clustering algorithms like Naive Bayes, K-Means, EM so that can deal with large-scale data sets.
We use Hadoop, Pig and map/reduce to process extracted SQL data to generate json objects that are stored in MongoDB and served through our web services
We have two clusters with a total of 40 nodes with 24 cores at 2.4GHz and 128GB RAM
Each night we process over 160 pig scripts and 50 map/reduce jobs that process over 600GB of data
We are using Hadoop on 17-node and 103-node clusters of dual-core nodes to process and extract statistics from over 1000 U.S. daily newspapers as well as historical archives of the New York Times and other sources.
Collection and analysis of Log, Threat, Risk Data and other Security Information on 32 nodes (8-Core Opteron 6128 CPU, 32 GB RAM, 12 TB Storage per node)
We use Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. We use Cloudera's CDH2 distribution of Hadoop, and store all data as compressed LZO files.
We use both Scala and Java to access Hadoop's MapReduce APIs
We use Pig heavily for both scheduled and ad-hoc jobs, due to its ability to accomplish a lot with few statements.
We employ committers on Pig, Avro, Hive, and Cassandra, and contribute much of our internal Hadoop work to opensource (see hadoop-lzo)
For more on our use of Hadoop, see the following presentations: Hadoop and Pig at Twitter and Protocol Buffers and Hadoop at Twitter
Our goal is to develop techniques for the Semantic Web that take advantage of MapReduce (Hadoop) and its scaling-behavior to keep up with the growing proliferation of semantic data.
RDFPath is an expressive RDF path language for querying large RDF graphs with MapReduce.
PigSPARQL is a translation from SPARQL to Pig Latin allowing to execute SPARQL queries on large RDF graphs with MapReduce.
30 nodes cluster (Xeon Quad Core 2.4GHz, 4GB RAM, 1TB/node storage). We use Hadoop to facilitate information retrieval research & experimentation, particularly for TREC, using the Terrier IR platform. The open source release of Terrier includes large-scale distributed indexing using Hadoop Map Reduce.
We are one of six universities participating in IBM/Google's academic cloud computing initiative. Ongoing research and teaching efforts include projects in machine translation, language modeling, bioinformatics, email analysis, and image processing.
We currently run one medium-sized Hadoop cluster (1.6PB) to store and serve up physics data for the computing portion of the Compact Muon Solenoid (CMS) experiment. This requires a filesystem which can download data at multiple Gbps and process data at an even higher rate locally. Additionally, several of our students are involved in research projects on Hadoop.
We run a 16 node cluster (dual core Xeon E3110 64 bit processors with 6MB cache, 8GB main memory, 1TB disk) as of December 2008. We teach MapReduce and use Hadoop in our computer science master's program, and for information retrieval research.
uses Hadoop as a component in our Scalable Data Pipeline, which ultimately powers VisibleSuite and other products. We use Hadoop to aggregate, store, and analyze data related to in-stream viewing behavior of Internet video audiences. Our current grid contains more than 128 CPU cores and in excess of 100 terabytes of storage, and we plan to grow that substantially during 2008.
We use Hadoop for our webmaster tools. It allows us to store, index, search data in a much fast way. We also use it for logs analysis and trends prediction.