Andrew Geweke Andrew Geweke on

The last few years have seen an explosion of interest in NoSQL data-storage layers, and then some retrenchment as the limitations of these systems became increasingly apparent. (It turns out they’re not magic, after all!) Today we seem faced with a choice. On one hand, we can reach for some of the potential “big wins” of NoSQL systems, but many of them are still relatively immature — at least when compared to the RDBMS — and the things we give up (transactionality, durability, manageability) we often discover to be very painful losses. On the other hand, we can reach for the security of a traditional RDBMS; we get incredibly well-understood, robust, durable, manageable systems…but we often sacrifice a lot of potential future growth.

Continue
Michael Kjellman Michael Kjellman on

I won’t lie (or conveniently fail to mention) that I have lost many nights of sleep due to Cassandra. I’ve certainly reflected and asked myself, “Was it really worth it?” Some of the sleepless nights were due to encountering previously unknown bugs, which have since been fixed. Other sleepless nights were caused by bad and misinformed decisions myself and my co-workers made while performing various C* operations. Implemented correctly, distributed computing brings lots of potential to your application. You can improve performance by distributing work across many physical (and inexpensive!) machines. Additionally, a database like Cassandra was designed from the beginning with replication in mind. Ensuring there are multiple copies of a dataset across multiple nodes and datacenters in distant geographical regions is not an afterthought (unlike MySQL replication). However, the many advantages and benefits of distributed computing come with the tradeoff of increased complexity.

Continue
Patrick McFadin Patrick McFadin on

Three years ago, I was stuck trying to get a use case fit into my Oracle database. It was getting expensive fast and I was running out of budget. A friend suggested I try Apache Cassandra for the task and the time series use case was perfect. It's not a perfect database and it was really hard to get my head around the data model and the driver support was scattered. There were a few points where I was ready to just give up and pay Oracle but I stuck with it. Cassandra was the solution that fit my problem, and after a long uphill climb, it worked better than I'd expected.

Continue
Michael Kjellman Michael Kjellman on

So far, I've explained why you shouldn't migrate to C* and the origins and key terms. Now, I'm going to turn my attention to how Cassandra stores data.

Cassandra nodes, clusters, rings


At a very high level, Cassandra operates by dividing all data evenly around a cluster of nodes, which can be visualized as a ring. Nodes generally run on commodity hardware. Each C* node in the cluster is responsible for and assigned a token range (which is essentially a range of hashes defined by a partitioner, which defaults to Murmur3Partitioner in C* v1.2+). By default this hash range is defined with a maximum number of possible hash values ranging from 0 to 2^127-1.

Continue
Unknown author on

Matt Jurik (Software Developer, Hulu) gave an excellent talk at Cassandra Day Silicon Valley about Hulu's migration to Cassandra. The talk features awesome diagrams of Hulu's architecture with a focus on the Hugetop service. Hugetop tracks users' progress in content. Hulu has been able to scale this service to accommodate over 400 million monthly plays. Here are my favorite snapshots from the talk.

Continue
Matt Jurik Matt Jurik on

Hulu users view 400 million videos and 2  billion advertisements each month. Hugetop is the service that allows users to track their progress in video content. The Hulu engineering team switched to a Cassandra-based architecture in the wake of unbounded data growth, MySQL servers that were running out of space, and the horrors of manual resharding.

Continue
Al Tobey Al Tobey on

As we move into the world of big data, systems architectures and data models we've relied on for decades are hindering growth. At the core of the problem is the read-modify-write cycle. In this talk, Al Tobey (Open Source Mechanic, DataStax) explains  how to build systems that don't rely on RMW, with a focus on Cassandra. For those times when RMW is unavoidable, he covers how and when to use Cassandra's lightweight transactions and collections.

Continue
Michael Kjellman Michael Kjellman on

A new class of databases (sometimes referred to as “NoSQL”) has been developed and designed with 18+ years worth of lessons learned from traditional relational databases such as MySQL. Cassandra (and other distributed or “NoSQL” databases) aim to make the “right” tradeoffs to ultimately deliver a database that provides the scalability, redundancy, and performance needed in todays applications. Although MySQL may have performed well for you in the past, new business requirements and/or the need to both scale and improve the reliability of your application might mean that MySQL is no longer the correct fit.

Continue
Russell Jurney Russell Jurney on

In this talk, Russell Jurney (author of Agile Data) presents about rapidly prototyping analytics applications using the Hadoop stack to return to agility in light of the ever deepening analytics stack. This presentation uses Hadoop, Pig, NoSQL stores and lightweight web frameworks to rapidly connect end-users to real insights. This talk was recorded at the SF Data Mining meetup at Trulia.

Continue
Fangjin Yang Fangjin Yang on

In this talk, Fangjin Yang of MetaMarkets will talk on their motivations for building druid, its architecture, how it works, and its real-time capabilities. Druid is open source infrastructure for Real²time Exploratory Analytics on Large Datasets. The system uses an always-on, distributed, shared-nothing, architecture designed for real-time querying and data ingestion. It leverages column-orientation and advanced indexing structures to allow for cost effective, arbitrary exploration of multi-billion-row tables with sub-second latencies. This talk was recorded at the SF Data Engineering meetup at Square.

Continue
Jon Hyman Jon Hyman on

In this talk, "MongoDB, Analytics, and Flexible Schemas," Jon Hyman, CTO and co-founder of Appboy, discusses how Appboy takes advantage of MongoDB's schemaless data modeling for analytic pre-aggregation. Jon will also discuss how Appboy uses the aggregation framework and statistical analysis to estimate results of ad-hoc queries over tens of millions of database documents. Furthermore, he will also cover other tips and hints that he learned from growing a MongoDB set up to support thousands of writes per second. This talk was recorded at the New York MongoDB user group meetup at Ebay NYC.

Continue
Matt Story Matt Story on

About the talk: NoSQL databases seem to be everywhere you look these days, whether it's 10gen becoming MongoDB, AWS exposing DynamoDB as a service, or a heated argument overheard at a meetup pinning Riak against Voldemort. In all the hubbub, there is one key-value store replete with name-spacing support, backed by an open standard and supporting a robust and battle-tested authorization scheme that is consistently overlooked -- POSIX filesystems.

Continue
Radu Gheorghe Radu Gheorghe on

In this talk, Radu Gheorghe, from Sematext, gives an Introduction to Elasticsearch.  Radu starts out by talking on what Elasticsearch is and how it can act as your NoSQL data-store while providing quick, flexible and scalable search. For example, indexing logs or storing product information so that customers can search on them. Radu also delivers a demo which will display the most important functions of Elasticsearch. Some key talking points will be indexing and searching for documents, text analysis for tweaking the relevance of your searches and the facets that allow for pulling statistics out of documents as well as scaling out which offers for more capacity and fault tolerance. He will also touch base on performance tuning for indexing and monitoring as well as administering your cluster in production. This talk was recorded at the NYC Search, Discovery and Analytics meetup at Gilt.

Continue
Paul Dix Paul Dix on

In this presentation, Paul Dix from Errplane gives and introduction to InfluxDB, an open source distributed time series database that he created. Paul talks about why one would want a database that's specifically for time series and also covers its API as well as some of the key features of InfluxDB, including:

Continue
Patrick McFadin Patrick McFadin on

In this introduction to Cassandra, Patrick McFadin, Chief Evangelist for Apache Cassandra at DataStax, will be presenting on why Cassandra is a key player in database technologies. Both large and small companies alike choose to use Apache Cassandra as their database solution and Patrick will be presenting on why they made this choice. Patrick will also be discussing Cassandra's architecture, including: data modeling, time-series storage and replication strategies, providing a holistic overview of how Cassandra works and the best way to get started. This talk was recorded at the Big Data Gurus meetup at Samsung R&D.

Continue
Ben Engber Ben Engber on

Ben Engber, CEO and founder of Thumbtack Technology, will discuss how to perform tuned benchmarking across a number of NoSQL solutions. He describes a NoSQL Database Comparison across Couchbase, Aerospike, MongoDB, Cassandra, HBase, and others in a way that does not artificially distort the data in favor of a particular database or storage paradigm. This includes hardware and software configurations, as well as ways of measuring to ensure repeatable results. This talk was recorded at the Scale Warriors of NYC meetup at adMarketplace.

Continue
Alex Baranau Alex Baranau on

 HBase

 
In this talk from the HBase NYC group, hear Alex Baranau, Software Engineer at Sematext International, give an Introduction to HBase.

 
This presentation will consist of two parts and will cover the "Introduction to HBase" and "Introduction to HBase Internals" topics. In the first part you'll hear about the key features of HBase and their importance, what HBase setups look like, HBase usage patterns, and when to choose HBase. The second part will cover some aspects of HBase underlying architecture, as well as some schema design insights. Understanding this will help HBase users make better use of this powerful database technology and avoid common mistakes.


Continue