Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Searching Information Inside Hadoop Platform Abinasha KaranaDirector-TechnologyBizosys Technologies Pvt Ltd.abinash@bizosys.comwww.bizosys.com

To search a large dataset inside HDFS and HBase, At Bizosys we started with Map-Reduce and Lucene/Solr

Map-reduceWhat didn’t work for usResult not in a mouse click

It required vertical scaling with manual sharding and subsequent resharding as data grewLucene/SolrWhat didn’t work for us

We built a new search forHadoop Platform HDFS and HBaseWhat we did

In the next few slides you will hear aboutmy learning from designing, developing and benchmarking a distributed, real-time search engine whose Index is stored and served out of HBase

Key LearningUsing SSD is a design decision.Methods to reduce HBase table storage size Serving a request without accessing ALL region serversMethods to move processing near the dataByte block caching to lower network and I/O trips to HBaseConfiguration to balance network vs. CPU vs. I/O vs memory

Using SSD is a design decision1SSD improved HSearch response time by 66% over SATA.However, SSD is costlier. In HBase Table Schema Design we considered “Data Access Frequency”, “Data Size” and “Desired Response Time” for selective SSD deployment.

.. Our SSD Friendly Schema DesignKeyword:Reads all for a query.Document: Reads 10 docs / query.Keyword + Document in 1 TableKeyword + Document in 2 TablesSSD deployment is All or noneSSD deployment is only for Keyword Table

Key lengthValue lengthRow lengthRow BytesFamily LengthFamily BytesQualifier BytesTimestampKey Type Value Bytes4 BYTES1 BYTE4 BYTES4 BYTES2 BYTESBYTES1 BYTEBYTES8 BYTESBYTES2Methods to reduceHBase table storage sizeStoring a 4 byte cell requires >27bytes in HBase.

.. to 1/3rdStored large cell values by merging cellsReduced the Family name to 1 CharacterReduced the Qualifier name to 1 Character

Serving a request without accessing ALL region servers3Consider a 100 node cluster of HBase and a single search request need to access all of them.Bad Design.. Clogged Network.. No scaling

Index Table was divided on Column-Family as separate tablesScan Table A - 3 Machines Hit Table BTable AMachine 5Machine 4Machine 54-5 MMachine 3Machine 3Machine 43-4 MMachine 2Row Ranges2-3 MMachine 3Machine 11-2 MMachine 20-1 MMachine 13And our solution…Scan “Family A” - 5 Machines HitFamily AFamily B

Methods to move processing near the data4Sent filtered Rows over network. public class TermFilter implements Filter {public ReturnCode filterKeyValue(KeyValue kv) { boolean isMatched = isFound(kv); if (isMatched ) return ReturnCode.INCLUDE; return ReturnCode.NEXT_ROW;}…E.g. Matched rows for a keywordSent relevant Fields of a Row over network.

Sent relevant section of a Field over network.public class DocFilter implements Filter {public void filterRow(List<KeyValue> kvL) { byte[] val = extractNeededPiece(kvL); kvL.clear(); kvL.add(new KeyValue(row,fam,,val));}….E.g. Computing a best match section from within a document for a given query

Byte block caching to lower network and I/O trips to HBase5Object caching – With growing number of objects we encountered ‘Out of Memory’ exceptionHBase commit - Frequent flushing to HBase introduced network and I/O latencies.Converting Objects to intermediate Byte Blocks increased record processing by 20x in 1 batch.

Configuration to balance Network vs. CPU vs. I/O vs. Memory 6DiskI/OBlock CachingCompressionMemoryCPUAggressive GCNetworkIPC CachingCompressionIn a Single Machine

… and it’s settingsNetworkIncreased IPC Cache Limits (hbase.client.scanner.caching)CPUJVM agressive heap ("-server -XX:+UseParallelGC -XX:ParallelGCThreads=4 XX:+AggressiveHeap “)I/OLZO index compression (“Inbuilt oberhumer LZO” or “Intel IPP native LZO”)MemoryHBase block caching (hfile.block.cache.size) and overall memory allocation for data-node and region-server.

.. and parallelized to multi-machinesHTable.batch (Get, Put, Deletes)

FUTURE-coprocessors (hbase 0.92 release).Allocating appropriate resources dfs.datanode.max.xcievers, hbase.regionserver.handler.count and dfs.datanode.handler.count

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana

Recommended

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana (20)

More from Yahoo Developer Network (20)

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Platform" by Abinasha Karana