Highly Scalable Algorithm For Distributed Real-Time Text Indexing

被引：1

作者：

Narang, Ankur ^{[1
]}

Agarwal, Vikas ^{[1
]}

Kedia, Monu ^{[1
]}

Garg, Vijay K. ^{[1
]}

机构：

[1] IBM India Res Lab, New Delhi, India

来源：

16TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), PROCEEDINGS | 2009年

关键词：

D O I：

10.1109/HIPC.2009.5433193

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Stream computing research is moving from terascale to petascale levels. It aims to rapidly analyze data as it streams in from many sources and make decisions with high speed and accuracy in fields as diverse as security surveillance and financial services including stock trading. We specifically consider real-time text indexing and search with high input data rates (10 GB/s or more) along with small index age-off(expiry) time. This makes it necessary to have maximal indexing rates for large volumes of data as well as minimal latency for indexing (time between start of indexing for a document and its availability for search) while maintaining very-low search response time. In addition, future massively parallel architectures with storage class memories will enable high speed in-memory real-time indexing, where index can be completely stored in a high capacity storage class memory. In this paper, we present the design of distributed data-structures and distributed real-time text indexing algorithm for parallel systems having large (thousands to hundred thousand) number of cores/processors, while simultaneously providing acceptable search performance [1]. The inherent trade-offs involved in index space, indexing throughput and search response time make this problem particularly challenging. Our algorithm uses group-based index construction and leverages novel index data structures that reduce load imbalance and make text indexing and merge process more scalable and efficient. We show analytically that the asymptotic parallel time complexity of our distributed indexing algorithm, is at least Omega(log(P)) factor better than typical indexing approaches, where P is the number of indexing nodes in a group. We further demonstrate the performance and scalability of our distributed indexing algorithm, on an MPP architecture (Blue Gene/L-1) using actual IBM intranet data. We achieved high indexing throughput of around 312 GB/min on an 8K node Blue Gene/L machine. In comparison with parallel indexing implemented using typical approaches like CLucene (2), this is 3x -7x better. To the best of our knowledge, this is the first published result on indexing throughput at such a large scale, with sustained search performance. We further show that our approach is scalable to 128K nodes, giving an estimated indexing throughput of 5 TB/min. We also achieved indexing latency that is around 10x better than typical indexing approaches.

引用

下载

页码：332 / 341

页数：10

共 50 条

[41] An optimal scheduling algorithm for distributed heterogeneous real-time systems
Rooholamini, M
Hosseini, SH
COMPUTERS AND THEIR APPLICATIONS: PROCEEDINGS OF THE ISCA 12TH INTERNATIONAL CONFERENCE, 1997, : 126 - 129
[42] An algorithm of scheduling analysis for distributed real-time embedded systems
Zhang, Hai-Tao
Ai, Yun-Feng
Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2007, 36 (03): : 489 - 492
[43] Real-time data text mining based on Gravitational Search Algorithm
Mosa, Mohamed Atef
EXPERT SYSTEMS WITH APPLICATIONS, 2019, 137 : 117 - 129
[44] A REAL-TIME MONITOR FOR A DISTRIBUTED REAL-TIME OPERATING SYSTEM
TOKUDA, H
KOTERA, M
MERCER, CW
SIGPLAN NOTICES, 1989, 24 (01): : 68 - 77
[45] A Bounded-Time Service Composition Algorithm for Distributed Real-Time Systems
Garcia-Valls, M.
Castro-Fernandez, R.
Estevez-Ayres, I.
Basanta-Val, P.
Rodriguez-Lopez, I.
2012 IEEE 14TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2012 IEEE 9TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (HPCC-ICESS), 2012, : 1413 - 1420
[46] Scalable Real-Time Flock Detection
Lacerda, Thiago
Fernandes, Stenio
2016 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2016,
[47] Toward scalable real-time messaging
Bauer, D.
Garces-Erice, L.
Rooney, S.
Scotton, P.
IBM SYSTEMS JOURNAL, 2008, 47 (02) : 237 - 250
[48] A Scalable Real-Time Biomonitoring Platform
Argatu, Florin Ciprian
Adochiei, Felix Constantin
Adochiei, Ioana Raluca
Ciucu, Radu
Vasiliki, Vita
Seritan, George
2019 E-HEALTH AND BIOENGINEERING CONFERENCE (EHB), 2019,
[49] Real-time multicast with scalable reliability
Wu, PCK
Liew, SC
PERFORMANCE AND CONTROL OF NETWORK SYSTEMS II, 1998, 3530 : 322 - 333
[50] Scalable real-time animation of rivers
Yu, Qizhi
Neyret, Fabrice
Bruneton, Eric
Holzschuch, Nicolas
COMPUTER GRAPHICS FORUM, 2009, 28 (02) : 239 - 248

← 1 2 3 4 5 →