This survey course is targeted towards both technical and non-technical people who want to understand the emerging world of Big Data, with a specific focus on Cassandra. In each sub-topic, the instructor will provide links and resource recommendations for students who want to explore that area further (for example, YouTube videos, books, blog posts). Students will be given a ~150 page PDF slide deck which can be used as reference material after the course. PDFs will also be given out for the 5 labs in the course.
This three day Cassandra course will be a dev-ops type of course, essentially a hybrid developers and operations course. The class is 60% lecture and 40% labs.
Objectives
- Explain how to choose the correct use cases for Cassandra
- Introduce students to the core concepts of the operations side of the Cassandra database
- Deep dive into the critical architecture paths of Cassandra: Bloom filters, Block Indexes, SSTables, etc.
- Give each student access to a 3-node Cassandra cluster in Rackspace to run through some hands-on labs
- Teach the fundamentals of how to write Java code to interact with Cassandra
- Provide links to the best books, blog posts and videos for students to learn more about Hadoop on their own
Audience
Engineers, Programmers, Networking specialists, Managers
Duration
3 Days
Outline for Cassandra DevOps Training
1. Intro to Cassandra
- How to pick a NoSQL category
- Brief use case discussion of: Key/Value, Key/Document, Column Family, Graph, RealTime
- Structured vs. Unstructured data
- Cassandra Origins: Amazon Dynamo, Google BigTable and Facebook
- So, what’s Cassandra good for? Use Cases.
- Hardware recommendations (Spinning disks vs SSD, CPU/RAM/Network requirements, etc)
- Cassandra versions
- Quick Vendor Discussion (Vanilla Apache, DataStax, Accunu)
- Book, YouTube & Blog recommendations for learning more about Cassandra
Lab #1: Install DataStax Community Edition (w/ Cassandra 2.0) and OpsCenter on one VM in Rackspace
2. Cassandra Architecture Fundamentals and Intro to CQL
- Peer to peer design
- Logical Data Model: Keyspace, Column Family/Table, Rows, Columns
- Traditional Ring design vs. VNodes
- Partitioners: Murmer3, Random (md5) and ByteOrdered
- Gossip communications
- Coordinator node
- Seed nodes
- Write/Read consistency levels: Any, One, Two, Three, Quorum
- Snitches: Dynamic snitching, Simple Snitch, Rack Inferring Snitch, Property File Snitch, Gossiping Property File Snitch
- Routing Client requests
- How a table is flushed from Memtable onto disk into SSTable files
- Compactions fundamentals to reduce SSTable data files
- Nodetool commands: gossipinfo, cfstats, describering
- YAML file fundamentals
- OpsCenter GUI
- Stress testing Cassandra
- CQL command fundamentals
Lab #2: Run Cassandra commands and explore OpsCenter (Create a new Keyspace and table, write data to the table, flush the table to SSTable on disk, learn how to run compaction, run nodetool commands, explore the OpsCenter web GUI, benchmark the one node by inserting and reading 100,000 rows)
3. Scaling Cassandra, Advanced CQL and Advanced YAML file
- Best practices for scaling a Cassandra cluster
- Managing a Cassandra cluster across data centers (new write/read consistency levels: Local quorum, each_quorum, all, serial)
- Deeper dive into the YAML file settings
- Advanced CQL concepts
Lab #3: Grow the cluster size to 3 nodes (Install Cassandra on 2 additional nodes in Rackspace and edit the YAML files to configure the 3-node cluster)
4. Database Internals
- Deep dive into the Write path
- In-memory structures for each SSTable: partition index, partition summary, bloom filter
- Fsync settings for the commit log
- How inserts, updates and deletes are treated by Cassandra
- Hinted Handoffs
- Deletes and Tombstone fundamentals
- Advanced Compaction concepts
- Deep dive into the Read path: Row cache, partition key cache, partition summary, bloom filters, etc
- Off-heap components in Cassandra
- Compression concepts
- Lightweight Transactions
- Snapshots
Lab #4: Advanced Cassandra commands (query the system table, take a snapshot, decommission a node, rejoin the same node back into the cluster)
5. Java API
- Different ways to programmatically query Cassandra: Thrift, Hector, Astyanax, DataStax Devcenter, DataStax Java driver, python with Pycassa, DataStax C# driver, ODBC for Hive, plus others
- Writing your first client application
- Connecting to the Cassandra cluster programmatically
- Using a session to execute CQL commands
- Asynchronous I/O to Cassandra cluster
- Node discovery
- Automatic failover
- Modifying cluster configuration programmatically
Lab #5: Java API lab (learn how to programmatically insert and read data from a Cassandra cluster using the DataStax Java API)
6. Advanced Concepts
- JVM performance tuning fundamentals
- JConsole vs jmxterm
- Tools to monitor/test Cassandra clusters: disk i/o (hdparm, iostat), memory analysis, visualization with D3.js and OpsCenter
- Logging in Cassandra (log4j)
- Security: SSL encryption for client-to-node and node-to-node
- Security: Authentication and Authorization fundamentals
- Security: Firewall ports
- DataStax Enterprise: Running Hadoop with Cassandra
- DataStax Enterprise: Running Solr with Cassandra