Adrian Colyer (Venture Partner for Accel Partners)Adrian is the author of 'The Morning Paper' where he reviews an interesting CS paper every weekday. When he's not writing about distributed systems he also serves as a Venture Partner for Accel Partners in London, and advises a number of Accel's portfolio companies. Prior to joining Accel, Adrian held CTO roles at Pivotal, VMware, and SpringSource. Catch him on twitter at @adriancolyer.
NoSQL matters, on that much I'm sure we can all agree. But if we take a closer look, what really matters when it comes to choosing a data store and/or a data processing platform? What really matters when it comes to getting the most out of that platform? And what is really going to matter as we take things to the next level?
Chris Ward (Crate.IO)Chris Ward is a developer advocate for Crate.IO. He has worked in open source communities for 15 years and has always loved to inform and help those working within it. He has spoken at meet-ups and conferences around the World on a wide variety of topics. He is originally from London, spent many years in Melbourne, Australia and now lives in Germany.
Abstract:Understanding databases for distributed Docker applications
In this talk we'll focus on the use of Crate alongside Weave in Docker containers, the technical challenges, best practices learned, and getting a big data application running alongside it. You'll learn about the reasons why Crate.IO is building "yet another NoSQL database" and why it's unique and important when running web scale containerized applications. We'll show why the shared-nothing architecture is so important when deploying large clusters in containers and how it addresses the issues and fears of a Docker-based persistence layer. You will learn how to deploy a Crate cluster in the cloud within minutes using Docker, some of the challenges you'll encounter, and how to overcome them in order to scale your backends efficiently. We focused on super simple integration with any cloud provider, striving it to be as turnkey as possible with minimal up-front configuration required to establish a cluster. Once established, we'll show how to scale the cluster horizontally by simply adding more nodes. The session will also give you examples when you should use Crate compared to other similar technologies such as MongoDB, Hadoop, Cassandra or FoundationDB. We'll talk about this approach's strengths and what types of applications are well-suited for this type of data store, as well what is not. Finally we'll outline how to architect an application that is easy to scale using Crate and Docker.
Akmal Chaudhri (Independent IT Consultant)Akmal B. Chaudhri is an Independent Consultant, specializing in Big Data, NoSQL and NewSQL database technologies. He has over 25 years experience in IT and has previously held roles as a developer, consultant, product strategist and technical trainer. He has worked for several blue-chip companies, such as Reuters and IBM and also the Big Data startups Hortonworks (Hadoop) and DataStax (Cassandra NoSQL Database). He has worked extensively on worldwide developer relations programs at IBM and worldwide university relations and academic initiative programs at Informix and IBM. He has regularly presented at many international conferences and served on the program committees for a number of major conferences and workshops. He has published and presented widely on Java, XML and Database systems and edited or co-edited 10 books. He holds a BSc (1st Class Hons.) in Computing and Information Systems, MSc in Business Systems Analysis and Design and a PhD in Computer Science. He is a Member of the British Computer Society (MBCS) and a Chartered IT Professional (CITP).
Abstract:How to Build Streaming Data Applications: Evaluating the Top Contenders
Building applications on streaming data has its challenges. If you are trying to use programs such as Apache Spark or Storm to build applications, this presentation will explain the advantages and disadvantages of each solution and how to choose the right tool for your next streaming data project. Building streaming data applications that can manage the massive quantities of data generated from mobile devices, M2M, sensors and other IoT devices, is a big challenge that many organizations face today. Traditional tools, such as conventional database systems, do not have the capacity to ingest data, analyze it in real-time, and make decisions. New technologies such as Apache Spark and Storm are now coming to the forefront as possible solutions to handing fast data streams. Typical technology choices fall into one of three categories: OLAP, OLTP, and stream-processing systems. Each of these solutions has its benefits, but some choices support streaming data and application development much better than others. Employing a solution that handles streaming data, provides state, ensures durability, and supports transactions and real-time decisions is key to benefitting from fast data. During this presentation you will learn: - The difference between fast OLAP, stream-processing, and OLTP database solutions. - The importance of state, real-time analytics and real-time decisions when building applications on streaming data. - How streaming applications deliver more value when built on a super-fast in-memory, SQL database.
Damien Krotkine (Booking.com)Damien Krotkine is a software engineer at Booking.com (world’s leading online hotel and accommodation reservations company). He currently works on the events subsystem, where he helps gathering, storing, managing and analyzing big quantities of data in real-time. Previously, he has been working in various fields like Linux Distribution, e-commerce, online real-time advertising. He's an active member of the Perl community, maintaining some NoSQL related modules ( Redis driver, Riak client, Bloomd client ... )
Abstract:Events storage analysis with Riak at Booking.com
At Booking.com, we have a constant flow of events coming from various applications and internal subsystems. This critical data needs to be stored for real-time, medium and long term analysis. Events are schema-less, making it difficult to use standard analysis tools.This presentation will explain how we built a storage and analysis solution based on Riak. The talk will cover: data aggregation and serialization, Riak configuration, solutions for lowering the network usage, and finally, how Riak's advanced features are used to perform real-time data crunching on the cluster nodes.
Dan Sullivan (DS Applied Technologies)Senior systems architect specializing in data analytics, data mining, statistics, data modeling and data warehousing, cloud computing, bioinformatics and computational biology. Skills include: Oracle, SQL Server, MySQL, NoSQL, R, Python, statistics. Extensive writing experience in topics including: cloud computing, project management and systems management. Author of NoSQL for Mere Mortal (Addison Wesley, forthcoming) and numerous articles for TechTarget (http://www.techtarget.com/contributor/Dan-Sullivan) and Tom's IT Pro (http://www.tomsitpro.com/tags/?author=dan+sullivan)
Abstract:Data Analytics and Text Mining with MongoDB
Data analysis is an exploratory process that requires a variety of tools and a flexible data store. Data analysis projects are easy to start but quickly become difficult to manage and error prone when depending on file-based data storage. Relational databases are poorly equipped to accommodate the dynamic demands complex analysis. This talk describes best practices for using MongoDB for analytics projects. Examples will be drawn from a large scale text mining project (approximately 25 million documents) that applies machine learning (neural networks and support vector machines) and statistical analysis. Tools discussed include R, Spark, Python scientific stack, and custom pre-processing scripts but the focus is on using these with the document database.
Felipe Hoffa (Google)In 2011 Felipe Hoffa moved from Chile to San Francisco to join Google as a Software Engineer. Since 2013 he's been a Developer Advocate on big data - to inspire developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several YouTube videos, blog posts, and conferences around the world.
Abstract:Analyzing 50 billion Wikipedia pageviews in 5 seconds
Wikipedia has been publishing its pageviews for years (http://dumps.wikimedia.org/other/pagecounts-raw/). This is deeply interesting data which some services take to help understand and analyze it (http://stats.grok.se/). But how can we give users a way to answer arbitrary questions, without needing hours to download and load this data into their own clusters? In this talk I'll show and share how I use Google BigQuery to answer ad-hoc questions in seconds. Even better, Google makes this data and engine available for everyone and with a monthly free processing quota. In this talk I'll feature interesting queries and results, with some jaw dropping examples (I promise).
Mark Harwood (Elastic Co.)Mark is a developer at Elasticsearch and long-time contributor to Lucene. Prior to joining Elasticsearch, Mark was Chief Scientist at BAE Systems Detica and designed search and visualization systems on multi-billion document solutions for analysts in commercial and government clients.
Abstract:Building Entity Centric Indexes
Sometimes we need to step back and take a look at the bigger picture - not just counting huge piles of individual log records, but reasoning about the behaviors of the people who are ultimately generating this firehose of data. While your DevOps folks care deeply about log records from a machine utlization perspective, marketing wants to know what these records tell us about the customers' needs. Elasticsearch Aggregations are a great feature but are not a panacea. We can happily use them to summarise complex things like the number of web requests per day broken down by geography and browser type on a busy website, but we would quickly run out of memory if we tried to calculate something as simple as a single number for the average duration of visitor web sessions when using the very same dataset. Why does this occur? A web session duration is an example of a behavioural attribute not held on any one log record; it has to be derived by finding the first and last records for each session in our weblogs, requiring some complex query expressions and a lot of memory to connect all the data points. We can maintain a more useful joined-up-picture if we run an ongoing background process to fuse related events from one index into ?entity-centric? summaries in another index e.g: • Web log events summarised into ?web session? entities • Road-worthiness test results summarised into ?car? entities • Reviews in a marketplace summarised into a ?reviewer? entity Using real data, this session will demonstrate how to incrementally build entity-centric indexes alongside event-centric indexes by using simple scripts to uncover interesting behaviours that accumulate over time. We'll explore: • Which cars are driven long distances after failing roadworthiness tests? • Which website visitors look to be behaving like ?bots?? • Which seller in my marketplace has employed an army of ?shills? to boost his feedback rating? Attendees will leave this session with all the tools required to begin building entity-centric indexes and using that data to derive richer business insights across every department in their organization.
Abstract:NoSQL meets Microservices
Just a few years ago all software systems were designed to be monoliths running on a single big and powerful machine. But nowadays most companies desire to scale out instead of scaling up, because it is much easier to buy or rent a large cluster of commodity hardware then to get a single machine that is powerful enough. In the database area scaling out is realized by utilizing a combination of polyglot persistence and sharding of data. On the application level scaling out is realized by microservices. In this talk I will briefly introduce the concepts and ideas of microservices and discuss their benefits and drawbacks. Afterwards I will focus on the point of intersection of a microservice based application talking to one or many NoSQL databases. We will try and find answers to these questions: Are the differences to a monolithic application? How to scale the whole system properly? What about polyglot persistence? Is there a data-centric way to split microservices?
Nathan Ford (Information Mosaic)Nathan is an engineer at Information Mosaic here in Dublin. In his spare time is a graph database enthusiast, with his main area of interest is graph based software analytics.
Abstract:Divination of the Defects (Graph-Based Defect Prediction through Change Metrics)
While metrics generated by static code analysis are well established as predictors of possible future defects, there is another untapped source of useful information, namely your source code revision history. This presentation will discuss converting this revision information into a graph representation, various defect prediction models and how to generate their related change metrics through graph traversal, as well as the potential applications and benefits of these graph enabled prediction models.
Peter Bakas (Netflix)Peter leads the Events and Data Pipeline team at Netflix. His team is responsible for building common infrastructure to collect, transport, aggregate, process and visualize over 400 billion events a day. Prior to Netflix, Peter led the Platform Engineering and Infrastructure teams at Ooyala where his teams were responsible for the development and implementation of Ooyala’s PaaS and IaaS solutions that enabled the delivery of 1 billion videos to 200 million unique monthly users in 130 countries and processed 60 billion analytic events per month. Peter also advises several startups including an early stage, stealth startup building products and services on Spark, Cassandra & Kafka.
Abstract:Zero to Insights - Real time analytics with Kafka, C*, and Spark
In this talk, Peter will cover his experience using Spark, Cassandra & Kafka to build a real time analytics platform that processed billions events a day. He will cover the challenges in how to turn all those raw events into actionable insights. He will also cover scaling the platform across multiple regions, as well as across multiple cloud environments.
Philipp Krenn (ecosio)Philipp Krenn is running everything database related and the cloud infrastructure of the Vienna based B2B startup ecosio. When not fighting MongoDB, MySQL, Jenkins, or AWS he is giving NoSQL and cloud computing trainings. In his spare time he organizes the ViennaDB and Papers We Love Vienna meetups.
Abstract:Host your database in the cloud, they said...
More than two years ago we faced the decision whether to run our MongoDB database on Amazon's EC2 ourselves or to rely on a Database as a Service provider. Common wisdom told us that a well known provider, focusing all its knowledge and energy on running MongoDB, would be a better choice than us trying it on the side. Well, this talk describes what can go wrong, since we have seen a lot of interesting minor and major hiccups — including stopped instances, broken backups, a major security incident, and more broken backups. Additionally, we discuss some reasons why a hosted solution is not always the better choice and which new challenges arise from it.
Prassnitha Sampath (Groupon)Prassnitha Sampath is a senior engineer in the Deal Personalization Team at Groupon. She designed and delivered the real time analytics infrastructure that powers personalization across web, mobile and email platforms using big data technologies such as HBase, Storm and Kafka. Before personalization infrastructure work, Prassnitha worked on Seller Services at Amazon. Prassnitha holds degrees from Portland State University and Madras University.
Abstract:Real Time Big Data Analytics with Kafka, Storm & HBase
Relevance and Personalization is crucial to building personalized local commerce experience at Groupon. Talk covers overview of the real time analytics infrastructure that handles over 3 million events/ second and stores and scales to billions of data points. Solution covers how our Kafka -> Storm -> Redis/ HBase pipeline is used to generate real time analytics for hundreds of millions of users of Groupon. Solution includes various architecture design choices and tradeoffs including some interesting algorithmic choices such as Bloom Filters & Hyper Log Log. Attendees can take away learnings from our real-life experience that can help them understand various tuning methods, their tradeoffs and apply them in their solutions.
Stephan Hochdoerfer (bitExpert AG)Stephan Hochdörfer currently holds the position of Head of Technology at bitExpert AG, a company specializing in software and mobile development. His primary focus is everything related to web development as well as automation techniques ranging from code generation to deployment automation.
Abstract:The NoSQL Store everyone ignores: PostgreSQL
PostgreSQL is well known being an object-relational database management system. In it`s core PostgreSQL is schema-aware dealing with fixed database tables and column types. However, recent versions of PostgreSQL made it possible to deal with schema-free data. Learn which new features PostgreSQL supports and how to use those features in your application.
Zigmars Rasscevskis (Clusterpoint)Zigmars has left senior engineering position at Google to join Clusterpoint, seeing the importance of NoSQL database technology that is available to developers worldwide. Zigmars has worked across different areas of engineering at Google including Cluster Management and Product Search and in the last position being engineering manager of Websearch backend team in Zurich. The experience in building largest and most scalable search engine in the world has made him believe in the utility of distributed systems. At Clusterpoint Zigmars is happy to lead a team focused on NoSQL database technology that makes a difference for developers and organisations by taking care of complexity of data management. The key factor for Zigmars to join Clusterpoint was commitment of the team to innovation in database technology.
Abstract:Why NoSQL database-as-a-service (DBAAS) is game changing for cloud computing?
Conventionally cloud computing is viewed as renting a hardware instead of owning. This talk attempts to explain why distributed databases can serve as a solid foundation for computing that is massively parallel and instantly scalable. Massive and instant scalability can be achieved with effective workload scheduling and prioritization algorithms within database software stack. The talk will cover some practical matters of large-scale distributed system design relevant for the database field: 1 Failure tolerance of distributed systems. By design, distributed systems with state replication are resistant against most forms of single machine failures. However the real key in distributed system design is avoiding correlated failures due to common cause, which can be fatal for an entire system. 2 Difficulties of performance guarantees in multi-tenant environment. Effective workload scheduling and low-level resource prioritization and isolation are key techniques to achieve predictable performance. 3 For many years, ACID-compliant transactions has been a challenging topic for NoSQL because of the distributed architecture. A brief illustration of how did Clusterpoint solve the issue of handling multi-document updates consistently in distributed environment.
Stefan EdlichProf. Dr. Stefan Edlich is a senior lecturer at the Beuth University of Applied Science Berlin. He wrote two of the world’s first NoSQL books and twelve other IT books for publishers as Apress, O’Reilly, Spektrum/Elsevier, Hanser and others. In 2008 he created the ICOODB conferences series that run in Berlin, ETH Zürich and Frankfurt. Finally, he runs the NoSQL Archive and organizes NoSQL Events worldwide. The variety of topics that surrounds the work of Stefan Edlich makes him the perfect candidate to chair the NoSQL conference program committee.
Michael HausenblasMichael works at MapR Technologies as Chief Data Engineer where he helps people to tap the potential of Big Data. His background is in large-scale data integration research and development, advocacy and standardisation. He has experience with NoSQL databases and the Hadoop ecosystem. Michael speaks at events, blogs about Big Data, and writes articles and books on this topic. Michael contributes to Apache Drill, a distributed system for interactive analysis of large-scale datasets. Michael lives in Ireland and is therefore perfectly suited for being part of the NoSQL matters Dublin program committee.
Frank CellerAs head of Dr. Celler Cologne Lectures, Frank Celler is the host of the NoSQL matters conference series as well as of the NoSQL Cologne User Group. Since 20 years he is working in the field of software business and entered the world of NoSQL more than 13 years ago. Working for different companies he early discovered the potential of high-performance databases. Today he is passionate about promoting the importance of NoSQL to the world. Together with Stefan Edlich and Marc Planagumà, he chairs the NoSQL matters 2014 program committee to select the finest talks for the conference’s agenda.