Baidu is not only the largest search engine in China, it’s also a vast big data analytics machine, consuming huge datasets and delivering insights in real-time. That’s a technological challenge that Baidu chose to solve by developing its own “fast data” solution called OceanDB.
Dr. Shiming (Simon) Zhang, has been the Principal Data Scientist in Big Data Lab at Baidu Research China and Silicon Valley since 2014 and was one of the keynote presenters at the “Advancing Analytics” Conference in 2016. Simon led the development of the distributed in-memory relational database system. OceanDB was built for fast big data pipeline appliances, and it’s 100+ times faster than conversional database systems.
Simon shared his view on streaming analytics, fast data and the requirements for real-time analytics architectures at the 2016 Conference.
According to Simon, real-time data and analytics and the resulting datasets will become increasingly important to business success across industries – from banking to retail - and to monetisation efforts.
“Baidu collects data from websites, mobile and apps, and supplements it with GDP and geo-location data. That provides a great dataset for data analytics and uncovering ways to monetise the data,” noted Simon.
The move to real-time big data and analytics also has its challenges - from potential data leaks to storage approaches to simply determining a useful real-time insight from the (big) data. According to Simon, overcoming the last challenge of real-time analytics at the speed of big data drove the development of OceanDB.
However, with analysis moving at big data speed, attention to governance and privacy issues is still necessary.
“We’ve found we don’t have to capture an individual’s profile to understand the trend,” said Simon. “By aggregating groups of people we can capture the trend without involving the individual.”
As the number of Internet of Things (IoT) devices explodes, the need to be able to handle big data analytics in real-time – or fast data - may be a requirement of organisations a fraction of the size of Baidu. According to Simon, embracing fast data involves five steps:
1. Identify your fast data opportunity
• There are four commonalities of fast data applications:
• Need to respond in real-time to streams of data events
• Not a dashboard or look-up app
• Emphasis on real-time action
• Usually part of operations
2. Assess existing infrastructure
• Leverage systems that work well, quickly and reliably, for the jobs for they were designed. E.g.: data warehouses
• Avoid complexity, keep it simple – but no simpler than necessary
• Know when best of breed beats DIY
3. Agree on success
• Define success: narrow to pain points that need to be solved
• Describe the project’s primary purpose- Scope the project: keep it to three use cases
• Identify sub-projects and build a timetable
• Identify risks to existing production systems and isolate them
• Plan for test & QA
• Choose components that are real solutions – set goals
• Build a success profile
4. Understand the business and technology implications
• Are you solving an analytics problem or a transactional problem?
• Real-time or batch? Can you afford to wait for correct answers?
• Analytics or operational?
• Is data integrity important? Is correct data in real-time a must-have, not a nice-to-have?
• Best-of-breed vs DIY Apache stack?
• Do your people have necessary skill sets?
• Don’t forget a reference check.
5. Prototype, pilot, refine
• Don’t waste resources on a PoC – prototype a MVP
• Be realistic about resources and timelines
• Download software and pilot-test against your use cases
• Decide: open source or commercially supported?
• Benchmark and test to success criteria
• Move to production – or refine assumptions
The 2017 IAPA National Conference on 18 October is set to provide more unique views on data sharing, approaches to break down silos, machine learning in health, preparing an organisation for AI and more.
Book your tickets now for the best price.