1

Many resources on the internet contain conflicting information regarding read/write logic for RAID chunks.

This answer contains the following (seemingly conflicting) pieces of information:

A 512 KB chunk size doesn't require system to write e.g. 512 KB for every 4 KB write or to read 512 KB of device surface for a 4 KB application read.

[When reading a 16-KiB block from RAID with a 64-KiB chunk size] the RAID will perform a read/modify/write operation when writing that 4-KiB file/16-KiB block because the RAID's smallest unit of storage is 64-KiB.

On the other hand, this resource contains the following pieces of information:

For example, if you have a 10 KB text file and the chunk size is 256 KB, then that 10 KB of data is stored in a 256 KB block, with the rest of the block left empty. Conversely, with 16 KB chunks, there is much less wasted space when storing that 10 KB file.

In particular, I have the following questions:

  1. When reading/writing some unit of data smaller than the RAID chunk size using a scheme without parity, does this require a read/modify/write operation for the entire chunk, or only the part of the chunk that is modified?
  2. When using a RAID scheme with parity, does this change anything in the answer to question 1?
  3. As alluded to in the second reference, does writing a unit of data smaller than the RAID chunk somehow leave the rest of the RAID chunk empty? This seems incorrect to me, but I wanted to clarify as this resource quite unambiguously states this.
  4. Do any of these answers change depending on the RAID implementation (Linux kernel, hardware RAID, etc.)?

If possible, providing some sort of authoritative reference (some RAID specification, source code, etc.) would be awesome.

Thanks in advance!

5
  • Please edit your question to clarify whether you're asking about mirrored/striped RAID configurations (RAID0, RAID1, or RAID10), or about checksummed RAID configurations (RAID4, RAID5, RAID6). The presence/absence of checksum blocks will very likely affect how much of the chunk is written/rewritten on the physical disks.
    – Sotto Voce
    Commented Jul 9 at 1:09
  • I've updated my question to clarify the difference between items 1. (without parity) and 2. (with parity).
    – quixotrykd
    Commented Jul 9 at 2:16
  • @PhilipCouling I didn't get the impression that the OP knew the difference and wanted answers for both cases. The question was worded as if all RAID types behave the same way with respect to slicing/chunking of reads and writes. I thought the OP had a specific use case in mind (although it was not stated in the question) and desired the answer for that use case.
    – Sotto Voce
    Commented Jul 9 at 3:41
  • 1
    @PhilipCouling I don't agree that it's the right thing to explain all possible use cases in the answer to a vague question. I favor asking for clarification and then giving an answer relevant to the clear user case. It looks like you don't agree with me on that point. That's okay, you asked me about it and I answered.
    – Sotto Voce
    Commented Jul 9 at 7:10
  • @SottoVoce I wasn't suggesting enumerating all possible use cases, and this question is not vague. I was saying that if you know a specific detail makes a difference and there are a tiny handful of options, better to explain that in an answer than have the user guess what they need. Since there's now and answer pointing out that different RAID type does not affect the answer, it rather proves my point that the question was not at all vague and did not need this information adding. Commented Jul 9 at 11:26

1 Answer 1

3

The chunk size primarily determines how data is distributed across devices (which physical device and offset to look for if you want to read byte 1234567890 on your RAID device).

It does not directly affect the RAID algorithm, which in case of RAID 5 is a simple XOR operation, also known as bitwise XOR. Mathematically this operates on bits and as such does not depend on bytes, sectors or chunks. RAID 6 is a bit more involved but still similar enough.

As such there is no need to process the entire chunk.

For Linux mdadm RAID, you can try to verify it experimentally:


Create some virtual drives (using sparse files):

# truncate -s 1G {1..8}.img
# for img in {1..8}.img; do losetup --find --show "$img"; done
/dev/loop{1..8}

Put a mdadm RAID 6 on top (using --assume-clean so it writes nothing, other than metadata):

# mdadm --create --assume-clean --level=6 --raid-devices=8 --data-offset=2048 /dev/md42 /dev/loop{1..8}
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md42 started.

Small write to a random offset:

# blockdev --getsize64 /dev/md42
6429868032
# echo $((SRANDOM % 6429868032))
2931013558
# echo -n TEST | dd bs=1 seek=2931013558 of=/dev/md42
# sync
# echo 3 > /proc/sys/vm/drop_caches

Result:

# filefrag {1..8}.img
1.img: 1 extent found
2.img: 1 extent found
3.img: 2 extents found
4.img: 1 extent found
5.img: 2 extents found
6.img: 2 extents found
7.img: 1 extent found
8.img: 1 extent found

All images have 1 extent (for mdadm metadata), so you can ignore it. Only 3 images have 2 extents (data, parity 1, parity 2). So only these have been written to.

# filefrag -v -e 3.img 5.img 6.img
File size of 3.img is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        1..       1:     301057..    301057:      1:          1: merged
   1:   119739..  119739:     318395..    318395:      1:     420795: last,merged
3.img: 2 extents found

As you can see, the extent is a single sector, not a chunk.

Raw data looks like this:

# hexdump -C 3.img
*
1d3bb7b0  00 00 00 00 00 00 54 45  53 54 00 00 00 00 00 00  |......TEST......|
1d3bb7c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
# hexdump -C 5.img
*
1d3bb7b0  00 00 00 00 00 00 54 45  53 54 00 00 00 00 00 00  |......TEST......|
1d3bb7c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
# hexdump -C 6.img
*
1d3bb7b0  00 00 00 00 00 00 29 24  59 29 00 00 00 00 00 00  |......)$Y)......|
1d3bb7c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*

Because all data is zero, RAID 5 parity of TEST is still TEST. For RAID6 parity, TEST turned into )$Y).


You can extend this experiment by filling the array with random data, then only write {4,16,512,4096,16384} bytes of zero at/around the targeted offset for only one or several out of these 3 devices, then repeat the experiment.

This way you can determine that mdadm is not operating at single byte level resolution (but still not going beyond sectors, not to mention entire chunks).

You might also notice that it writes the wrong parity if parity was already wrong before write (parity updated using parity instead of recalculating from data).

1
  • This possibly misses one subtle point on RAID[4,5,6]. The block/stipe size can affect the quantity of parity data being written. Using terms "block" to mean data on a single disk and "stripe" to mean collection of blocks across all disks: If the data being written is no more than one block then double the data will be written in RAID[4,5]. Once for the data once for the parity. But if more than one block's worth of data is written within the same stripe then less parity will be written. EG for RAID 5 this could only be 50% parity for 3 disks or 25% for 5 disks. Commented Jul 10 at 6:26

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .