ChatGPT解决这个技术问题 Extra ChatGPT

Difference between partition key, composite key and clustering key in Cassandra?

I have been reading articles around the net to understand the differences between the following key types. But it just seems hard for me to grasp. Examples will definitely help make understanding better.

primary key,
partition key, 
composite key 
clustering key
I found this article which contains many detailed explanations around these concepts.
This article also points out clearly these terms.
@duong_dajgja URL shared by you above is broken, can you please edit comment with valid/useful URL?
@realPK The link somehow has gone. But I found another link for you here quora.com/….

F
Federico Navarrete

There is a lot of confusion around this, I will try to make it as simple as possible.

The primary key is a general concept to indicate one or more columns used to retrieve data from a Table.

The primary key may be SIMPLE and even declared inline:

 create table stackoverflow_simple (
      key text PRIMARY KEY,
      data text      
  );

That means that it is made by a single column.

But the primary key can also be COMPOSITE (aka COMPOUND), generated from more columns.

 create table stackoverflow_composite (
      key_part_one text,
      key_part_two int,
      data text,
      PRIMARY KEY(key_part_one, key_part_two)      
  );

In a situation of COMPOSITE primary key, the "first part" of the key is called PARTITION KEY (in this example key_part_one is the partition key) and the second part of the key is the CLUSTERING KEY (in this example key_part_two)

Please note that both partition and clustering key can be made by more columns, here's how:

 create table stackoverflow_multiple (
      k_part_one text,
      k_part_two int,
      k_clust_one text,
      k_clust_two int,
      k_clust_three uuid,
      data text,
      PRIMARY KEY((k_part_one, k_part_two), k_clust_one, k_clust_two, k_clust_three)      
  );

Behind these names ...

The Partition Key is responsible for data distribution across your nodes.

The Clustering Key is responsible for data sorting within the partition.

The Primary Key is equivalent to the Partition Key in a single-field-key table (i.e. Simple).

The Composite/Compound Key is just any multiple-column key

Further usage information: DATASTAX DOCUMENTATION

insert into stackoverflow_simple (key, data) VALUES ('han', 'solo');
select * from stackoverflow_simple where key='han';

table content

key | data
----+------
han | solo

COMPOSITE/COMPOUND KEY can retrieve "wide rows" (i.e. you can query by just the partition key, even if you have clustering keys defined)

insert into stackoverflow_composite (key_part_one, key_part_two, data) VALUES ('ronaldo', 9, 'football player');
insert into stackoverflow_composite (key_part_one, key_part_two, data) VALUES ('ronaldo', 10, 'ex-football player');
select * from stackoverflow_composite where key_part_one = 'ronaldo';

table content

 key_part_one | key_part_two | data
--------------+--------------+--------------------
      ronaldo |            9 |    football player
      ronaldo |           10 | ex-football player

But you can query with all keys (both partition and clustering) ...

select * from stackoverflow_composite 
   where key_part_one = 'ronaldo' and key_part_two  = 10;

query output

 key_part_one | key_part_two | data
--------------+--------------+--------------------
      ronaldo |           10 | ex-football player

Important note: the partition key is the minimum-specifier needed to perform a query using a where clause. If you have a composite partition key, like the following

eg: PRIMARY KEY((col1, col2), col10, col4))

You can perform query only by passing at least both col1 and col2, these are the 2 columns that define the partition key. The "general" rule to make query is you must pass at least all partition key columns, then you can add optionally each clustering key in the order they're set.

so, the valid queries are (excluding secondary indexes)

col1 and col2

col1 and col2 and col10

col1 and col2 and col10 and col 4

Invalid:

col1 and col2 and col4

anything that does not contain both col1 and col2


@brain storm: For what is possible to do I've added a few informations and usage examplese. It's not easy to represent data in a single post
To create a secondary index: CREATE INDEX myindex ON mytable(acolumn);
As I wrote -- <> -- since col10 is defined before col4 you have to pass it to query also for col4
You can add secondary indexes but that does not mean you can execute "any" cql query -- and more: before creating a secondary index you should count till 10 ... 000 ..... :)
Secondary indexes are implemented as local indexes -- they are not distributed in the cluster. Each node of the cluster is responsible for storing secondary indexes of data it owns. For this reason a query on sec.index might involve all nodes in the cluster
O
OrangeDog

Adding a summary answer as the accepted one is quite long. The terms "row" and "column" are used in the context of CQL, not how Cassandra is actually implemented.

A primary key uniquely identifies a row.

A composite key is a key formed from multiple columns.

A partition key is the primary lookup to find a set of rows, i.e. a partition.

A clustering key is the part of the primary key that isn't the partition key (and defines the ordering within a partition).

Examples:

PRIMARY KEY (a): The partition key is a.

PRIMARY KEY (a, b): The partition key is a, the clustering key is b.

PRIMARY KEY ((a, b)): The composite partition key is (a, b).

PRIMARY KEY (a, b, c): The partition key is a, the composite clustering key is (b, c).

PRIMARY KEY ((a, b), c): The composite partition key is (a, b), the clustering key is c.

PRIMARY KEY ((a, b), c, d): The composite partition key is (a, b), the composite clustering key is (c, d).


This actually answers a query I've had for a while, how do you define a composite partition key (multi-column) without a clustering key, the trick is to use double-parenthesis! PRIMARY KEY ((a, b))
F
Federico Navarrete

In Cassandra, the difference between primary key, partition key, composite key, clustering key always makes some confusion. So, I am going to explain below and co relate to each other's. We use CQL (Cassandra Query Language) for Cassandra database access. Note: - Answer is as per updated version of Cassandra.

Primary Key: - In Cassandra there are 2 different ways to use primary Key.

CREATE TABLE Cass (
    id int PRIMARY KEY,
    name text 
);

Create Table Cass (
   id int,
   name text,
   PRIMARY KEY(id) 
);

In CQL, the order in which columns are defined for the PRIMARY KEY matters. The first column of the key is called the partition key having property that all the rows sharing the same partition key (even across table in fact) are stored on the same physical node. Also, insertion/update/deletion on rows sharing the same partition key for a given table are performed atomically and in isolation. Note that it is possible to have a composite partition key, i.e. a partition key formed of multiple columns, using an extra set of parentheses to define which columns form the partition key.

Partitioning and Clustering The PRIMARY KEY definition is made up of two parts: the Partition Key and the Clustering Columns. The first part maps to the storage engine row key, while the second is used to group columns in a row.

CREATE TABLE device_check (
  device_id   int,
  checked_at  timestamp,
  is_power    boolean,
  is_locked   boolean,
  PRIMARY KEY (device_id, checked_at)
);

Here device_id is partition key and checked_at is cluster_key.

We can have multiple cluster keys as well as partition keys too which depends on declaration.


You could have given some credit to your sources (2013 = older than your post): thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html
F
Federico Navarrete

Primary Key: Is composed of partition key(s) [and optional clustering keys(or columns)] Partition Key: The hash value of Partition key is used to determine the specific node in a cluster to store the data

Clustering Key: Is used to sort the data in each of the partitions (or responsible node and its replicas)

Compound Primary Key: As said above, the clustering keys are optional in a Primary Key. If they aren't mentioned, it's a simple primary key. If clustering keys are mentioned, it's a Compound primary key.

Composite Partition Key: Using just one column as a partition key, might result in wide row issues (depends on use case/data modeling). Hence the partition key is sometimes specified as a combination of more than one column.

Regarding confusion of which one is mandatory, which one can be skipped etc. in a query, trying to imagine Cassandra as a giant HashMap helps. So, in a HashMap, you can't retrieve the values without the Key.

Here, the Partition keys play the role of that key. So, each query needs to have them specified. Without which Cassandra wouldn't know which node to search for.

The clustering keys (columns, which are optional) help in further narrowing your query search after Cassandra finds out the specific node (and its replicas) responsible for that specific Partition key.


F
Federico Navarrete

In brief sense:

Partition Key is nothing but identification for a row, that identification most of the times is the single column (called Primary Key) sometimes a combination of multiple columns (called Composite Partition Key).

Cluster key is nothing but Indexing & Sorting. Cluster keys depend on few things:

What columns you use in where clause except primary key columns. If you have very large records then on what concern I can divide the date for easy management. Example, I have data of 1million a county population records. So, for easy management, I cluster data based on state and after pincode and so on.


Partition Key is NOT an identification for A row... it identifies a bunch of rows all of which have the same partition key
@wmac, if there is no clustering key and if there is just one partition key, then isn't the partition key identification of the row ?
k
kboom

Worth to note, you will probably use those lots more than in similar concepts in relational world (composite keys).

Example - suppose you have to find last N users who recently joined user group X. How would you do this efficiently given reads are predominant in this case? Like that (from offical Cassandra guide):

CREATE TABLE group_join_dates (
    groupname text,
    joined timeuuid,
    join_date text,
    username text,
    email text,
    age int,
    PRIMARY KEY ((groupname, join_date), joined)
) WITH CLUSTERING ORDER BY (joined DESC)

Here, partitioning key is compound itself and the clustering key is a joined date. The reason why a clustering key is a join date is that results are already sorted (and stored, which makes lookups fast). But why do we use a compound key for partitioning key? Because we always want to read as few partitions as possible. How putting join_date in there helps? Now users from the same group and the same join date will reside in a single partition! This means we will always read as few partitions as possible (first start with the newest, then move to older and so on, rather than jumping between them).

In fact, in extreme cases you would also need to use the hash of a join_date rather than a join_date alone - so that if you query for last 3 days often those share the same hash and therefore are available from same partition!


E
Ejaz Ahmed

Disclaimer: This is answer is specific to DynamoDB, however the concepts apply to Cassandra as well, since both are NoSQL databases.

When you create a table, in addition to the table name, you must specify the primary key of the table. The primary key uniquely identifies each item in the table, so that no two items can have the same key.

DynamoDB supports two different kinds of primary keys:

Partition key – A simple primary key, composed of one attribute known as the partition key.

DynamoDB uses the partition key's value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored.

In a table that has only a partition key, no two items can have the same partition key value.

Partition key and sort key – Referred to as a composite primary key, this type of key is composed of two attributes. The first attribute is the partition key, and the second attribute is the sort key.

DynamoDB uses the partition key value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored. All items with the same partition key value are stored together, in sorted order by sort key value.

In a table that has a partition key and a sort key, it's possible for two items to have the same partition key value. However, those two items must have different sort key values.

A composite primary key gives you additional flexibility when querying data. For example, if you provide only the value for Artist, DynamoDB retrieves all of the songs by that artist. To retrieve only a subset of songs by a particular artist, you can provide a value for Artist along with a range of values for SongTitle.

Note: The partition key of an item is also known as its hash attribute. The term hash attribute derives from the use of an internal hash function in DynamoDB that evenly distributes data items across partitions, based on their partition key values.

The sort key of an item is also known as its range attribute. The term range attribute derives from the way DynamoDB stores items with the same partition key physically close together, in sorted order by the sort key value.

Reference - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.CoreComponents.html#HowItWorks.CoreComponents.PrimaryKey