Science fiction and technology writer

Socrates

Thu Oct 11 09:46:42 2001

Ok, he who has handed me my ass on cache, and the os's affect on secondary cpu cache, Pizza.

I now have another area of caching, and wondering how this works in 2k, Linux, whatever.

In short, the argument goes like this:

The 8 MB cache on the WD 1000BB now is supposed to give this drive a great boost in performance over any ATA drive,
in fact, makes it even faster then X 15 scsi drives, since
the WD algorithims, are optimized to take advantage of the 8 mb of Cache on the new drive.

I have no clue as to how the os, and the algorithims on the
hard drives, affect their interrelationship with the drives firmware, and how it uses the drive cache. Could someone explain what happens?

Thanks in advance, Pizza.

Socrates

Thu Oct 11 12:21:11 2001

I seem to remember that the big 181 Gig mother of a drive had a 16Meg cache, I don't think it was IDE though.

DrPizza

Thu Oct 11 14:34:02 2001

Well, I'm not sure, but I can explain a few things.

Caches can be used for two things. Speeding up reads, and speeding up writes.

Speeding up writes is fairly self-explanatory -- the cache operates faster than the drive mechanism, so can be written to at the full speed of the interface; this means that short writes can be written at the speed of the interface, not of the drive mechanism; the drive will then write to disk at its leisure. (This is assuming a write-back cache; often, especially in RAIDs, you change the cache to write-through behaviour, where it writes it to the disk straight away, to ensure data integrity). So for writes (and it shouldn't really matter if they're sequential or random), a larger cache means a faster disk.

This is why RAID controllers often have large caches; they can cover up RAID 5's slower write times (slower because it has to update a minimum of two spindles (the one where the data is, and the parity) in order to change a single byte. This lengthens average seek times) because writes are done to the (normally battery-backed) cache, rather than the drives (which have their caches set to write-through).

With reads, the situation is a bit more complicated. If there were no other factors coming into play, the read performance would be limited by the drive mechanism. However, there are a few things that make this no longer the case.

All transfers have to go through the cache (both read and write) -- it's also used as a buffer to join the high-speed SCSI bus to the lower-speed internal bus. If the cache is empty, no read can occur, if the cache is full, no write can occur.

Drive controllers can't burst unlimited amounts of information over their bus. Each transfer has a fixed upper limit (and it's not that big, of the order of a few hundred kilobytes). In between these transfers, the drive is inactive.

Further, programs don't often read large sequential blocks of data as fast as possible. They'll tend to read a bit of a file, then process that, then read a bit more, then process it, then a bit more, and so on.

So on-drive caches can be used to take advantage of these things.

Making the cache bigger reduces the situations where a read or write can't take place because the cache is empty or full (respectively).

When reading, rather than only reading enough to satisfy a single transfer, they can read until their buffer is filled. If they then get another transfer that's in the cache already, they can service it from the cache -- so they get better read performance. There are various algorithms for calculating how best to read ahead. The OS does a lot of work in this area too, and will direct the drive to read bits of a file that the user hasn't yet requested, if it thinks there's a good chance that the user /will/ request them.

Other ways the cache can be optimized is by partitioning it. If you have, say, a 4 Mbyte cache, and you're doing long sequential reads, you might prefer to have two ~2 Mbyte chunks. Because each transfer must be serviced by the cache -- and by a single segment -- larger segments permit larger single transfer operations (you can transfer ~2 Mbytes at a time; if the cache were split into 4x ~1 Mbyte, you'd only be able to transfer ~1 Mbyte at a time). This would favourably improve your data:SCSI arbitration ratio.

On the other hand, if you were doing lots of random access, with smaller reads, you might prefer more segments. Each segment contains an amount of "context" information (describing where it's located on the disk), and more segments mean that you can cache data from more areas on the disk (albeit in smaller amounts).

So WD presumably tune this kind of thing depending on how you use the disk. They have smart read-ahead algorithms (you read ahead differently if the file is randomly accessed than if it's sequentially accessed), and they have smart partitioning mechanisms.

The OS does a level of caching above this, and makes similar decisions. Some OSes (e.g. Windows) let you hint how you're going to use the file when you open it (you can specify whether to optimize the caching for sequential transfers or for random access), which lets them pick a caching mechanism appropriate to what you're doing. The OS has a larger cache to play with (it uses system memory), but does much the same thing.

Socrates

Thu Oct 11 20:56:04 2001

Ok: Thanks, that really helped.

How do caching algorithims, firmware, work with different os?
Could you write a firmware package for let's say, XP, that would optimize the disk cache for that os, let's say for average usage by a workstation?

If you do this, do you sacrifice performance in nix based os?

How do the different firmware packages interact with the different os? When seagate makes a cheetah, what determines
which os they optimize the firmware for?

Nix, 2k, XP, etc.?

Can they do this, or is it beyond the capabilities of the firmware to optimize for a certain os?

In other words, if you look at the seek patterns of the different os, could you optimize the firmware to read ahead in a pattern that's consistent with 2k, and how would that differ from Linux?

Thanks again, for all your help,

Socrates

Socrates

Thu Oct 11 23:28:37 2001

Another question. This WD probably is using the 8 MB cache to store operating functions. Couldn't windows do the same thing, by increasing the Largecachefile setting from 4 mb, to something like 16mb?

Why haven't they, in these days of cheap ram?

Socrates

DrPizza

Fri Oct 12 00:52:04 2001

from Socrates posted at 9:56 pm on Oct. 11, 2001

Ok: �Thanks, that really helped.
How do caching algorithims, firmware, work with different os?
Could you write a firmware package for let's say, XP, that would optimize the disk cache for that os, let's say for average usage by a workstation?
If you do this, do you sacrifice performance in nix based os?
How do the different firmware packages interact with the different os? �When seagate makes a cheetah, what determines which os they optimize the firmware for?

They don't. The access pattern is determined by the applications you're using -- databases requiring random access, or video editing requiring sequential access, or whatever -- not the OS.

Some drives (e.g. some IBM drives) let you specify how the cache is carved up -- so you can pick, say, 4 x ~1 Mbyte or 2 x ~2 Mbyte or whatever -- depending on what you're going to use the drive for.

But in principle, the drive should work it out for itself, and will have some clever heuristic for doing so. Or, it'll just pick a good set of defaults and stick with them.

Nix, 2k, XP, etc.?
Can they do this, or is it beyond the capabilities of the firmware to optimize for a certain os?

If there were a significant point to it, it'd be possible -- but really, there isn't. There is a point to optimizing for particular usage patterns, and some (higher end) drives will let you do this.

In other words, if you look at the seek patterns of the different os, could you optimize the firmware to read ahead in a pattern that's consistent with 2k, and how would that differ from Linux?

I think that there are a number of problems with this. First, a given OS will have multiple algorithms (Win2K certainly does) and will switch between them (depending on whether it thinks random or sequential access is the better algorithm). I'd imagine Linux and so on would be similar. Second, these algorithms are prone to change over time. Third, the type of behaviour you see will be a feature of the applications you're running, not the OS.

Another question. This WD probably is using the 8 MB cache to store operating functions. Couldn't windows do the same thing, by increasing the Largecachefile setting from 4 mb, to something like 16mb?
Why haven't they, in these days of cheap ram?

(1) Windows does just that.
(2) Windows (NT family, at least) will use as much RAM as you have for its cache. Currently, my cache (it's not solely for files, but mostly) is sitting at ~300 of 640 Mbytes. Windows 95 had a problem where it could dynamically grow its cache during a session but it couldn't then reduce it if it needed to; I believe this was resolved in 98. Windows 95 had another problem in that its file cache resided in a portion of virtual memory that was fixed in size (at 1 Gbyte), and which had to contain many other things besides the file cache. Because of the former, it became common to restrict the size of its cache, to stop it growing uncontrollably. Because of the latter, it has become occasionally necessary to restrict the size of the cache to prevent out of memory errors on machines with a lot of memory (and, importantly, a lot of memory on their video cards).

Socrates

Fri Oct 12 01:16:45 2001

OK:
If 2k, which is currently holding 12700k of Physical memory for system cache, in task manager, then what is being stored in the buffer, that would have such a large impact on performance?

For a workstation, you don't have much but os functions, and the occasional pagefile hit.

My system in workstation, just loads sequential files of the programs I use, and then, unless the program calls for
pagefile system memory, everything just sits there.
Right now my memory usage history looks like I have a dead computer, with almost no cpu usage, 1=2%.

So these Ipeak tests, and stuff, maybe they call stuff that is likely to be stored in the buffer, but what would that be, if the os is in ram pretty much.

Perhaps they have segmented the buffer into many, very small, partitions, assuring that any system call will be either in the buffer, or the cache?

Don't know..

Socrates

seeking wisdom

DrPizza

Fri Oct 12 01:54:12 2001

from Socrates posted at 2:16 am on Oct. 12, 2001

OK:
If 2k, which is currently holding 12700k of Physical memory for system cache, in task manager,

Oh, I forgot to mention; by default, NT/2K/etc. won't create a vast cache -- there's no point for normal usage. There's a registry setting to switch it to large cache mode; also one of the networking settings configures this behaviour (when you switch between "save memory" or whatever, and "optimize for fileserving")

then what is being stored in the buffer, that would have such a large impact on performance?

The disk buffer? Dunno. Probably whatever you last read or wrote to the disk. The system cache? Probably the same (only more of it).

But in that situation, it won't have a huge impact on performance. Only when doing something with the disk.

For a workstation, you don't have much but os functions, and the occasional pagefile hit.
My system in workstation, just loads sequential files of the programs I use, and then, unless the program calls for pagefile system memory, everything just sits there.
Right now my memory usage history looks like I have a dead computer, with almost no cpu usage, 1=2%.

Yeah, it will do.

So these Ipeak tests, and stuff, maybe they call stuff that is likely to be stored in the buffer, but what would that be, if the os is in ram pretty much.

I'm not sure what these tests you speak of are. If you're testing OS-cached access to a file, the performance will be as good as main memory (or even processor cache). If you're testing disk-cached access to a file, the performance will be as good as the disk interface (as the cache should be able to saturate the interface). If you're testing uncached access (not cached by the OS and not currently in the disk cache), the performance will be as good as the disk mechanism itself.

The disk cache will contain the last read/written data.
The OS cache will contain those files that were last read/written. Just, it can store more of them.

Perhaps they have segmented the buffer into many, very small, partitions, assuring that any system call will be either in the buffer, or the cache?

Well, the average disk transfer is only of the order of a few tens of kilobytes. But you probably don't want to use tiny segments of cache; they'll really hurt your performance for larger transfers, and they won't give any appreciable gain for smaller ones.

The details of the disk buffer aren't [generally] controllable by the OS -- it's possible, I think, that one could write some way of controlling the disk buffer through software, but its configuration should be considered as static.

If your system isn't hitting the disk that often -- that is, heavy I/O is limited to, say, OS boots, and program loads -- then the disk speed isn't a critical factor in the performance; it just doesn't use the disk all that much (particularly not if you rarely reboot and rarely quit applications). A fast disk will speed up those I/O-limited tasks -- but they're sufficiently rare that it's probably not worth the money.

If, however, you do lots of disk I/O (for instance, using databases, or video files, or are extremely RAM limited so do lots of paging) then speeding the disk is more valuable, as it becomes a serious issue.

Socrates

Fri Oct 12 01:56:48 2001

Pizza, I love you. Your comments helped me figure out, I think, why this WD drive works so well in the storagereview tests.
The test bed they use has only 128 mb of ram, and 2k.
My comments:
"RAM: 128MB PC100 SRAM DIMM
Our decision to implement application-level benchmarks drove us to briefly flirt with 256 megs of RAM for the testbed. We faced a tough decision. On the one hand, we want the testbed to be fully representative of a typical reader's system over the next two years. On the other, with today's high-level benchmarks, 256 megs of RAM would drown out a significant amount of the hard disk's contribution towards performance, perhaps excessively so. We eventually decided on just 128 megs. "
(the above is the explanation for the test bed using 128 mb of ram)
I've long been perplexed by the WD numbers. If the machine was running 2k, and, the machine had sufficent ram, they system would take as much ram as it needs for system cache, and then what would the os need the buffer for?
My current system uses 128288k of physical memory for cache. So, from watching task manager, this system almost never goes outside the cached system memory.
The memory history looks like a dead persons heartbeat line.

But, with just 128 mb of ram, the system can't take all the system physical memory it wants. So, perhaps the high scores of the WD are due to the buffer
caching system functions, that on machines with larger ram allocations, would be
in system physical memory cache.

Perhaps the buffer picks up either kernal functions, or os functions, that on a system with sufficent ram, would be in system cache.

So rather then the drive being all that fast, it's just the test machine that
is really susceptible to being influenced by the drive cache size, and excellent algoritims.

Perhaps the selection of 128 mb of ram makes the machine highly susceptible to drive cache influences, since many of the os, or kernal functions are now in the buffer.

Perhaps a better reflection of the performance of the drive would be a system with 64 MB of ram.

Since 32 would be about the size of the kernal, total, on 2k, perhaps that would be the best size to test with.

From these observations, I strongly suspect if you take the WD drive, and put it in a system with 384-512 MB of ram, the caching functions that make for such a huge increase in performance under Ipeak and winbench, would be superseded by the system using physical ram for system cache.
Is this what you are saying, Davin and Eugene, with the following
quote:
"On the other, with today's high-level benchmarks, 256 megs of RAM would drown out a significant amount of the hard disk's contribution towards performance, perhaps excessively so. We eventually decided on just 128 megs. "

Pizza: Am I close?

Socrates

DrPizza

Fri Oct 12 03:38:23 2001

I think so.

On a machine with little RAM, it becomes more likely that the OS won't cache something it needs, so will have to go to the drive -- which might (a) be already caching what it needs, and (b) will have higher burst reads thanks to the cache.

But with ample RAM, it's more likely that any read will be serviced from a file already in RAM. �Burst performance will still be improved by the drive's cache, but the random access that crops up from normal usage (which is dominated by seek times, not transfer rates) won't receive the same kind of advantage (comparing large disk cache to small disk cache), because it just won't have to hit the disk so often. By reducing this kind of I/O -- seek-intensive, with small transfers -- you'll generally get far more performance than by making bursts much faster.

If you're benching real-world performance of the drive, you'll tend to get more differentiation between the drives if you use a machine with relatively little memory. �It will page to disk more (which means more disk I/O) and it will cache less, so the faster disk will win doubly. Paging is the kind of thing where the cache will probably help more, as it tends to be somewhat sequential

(Edited by DrPizza at 4:44 am on Oct. 12, 2001)

Socrates

Fri Oct 12 03:56:19 2001

Pizza: Thanks. You are awesome.

Sincerely,

Socrates

Socrates

Fri Oct 12 04:17:00 2001

How could you configure 2k, cache, and pagefile, and memory, to test the drives the hardest?

Use 32 mb of ram, set the pagefile high, encouraging the system to page? This would put a premium on cache size, and function?

Still, I don't see a perfect solution.

Any ideas?

Socrates

DrPizza

Fri Oct 12 06:07:43 2001

Depends what you want to test.

If it's theoretical performance, there's no problem -- you can tell Windows not to use its own buffering at all, so it gets bottlenecked solely on the drive/interface. I would imagine that any competent drive performance benchtest will do this.

If it's real-world performance, well... if you want to test true real-world performance, you have to give the machine a reasonable amount of RAM and so on (in which case many requests will be serviced from RAM anyway). The situation is such that realistic real-world tests won't highlight the drive's performance, except for very specific tests (e.g. boot speed, hibernate speed, video capture performance, etc.) that are really I/O intensive. If you really cripple the machine you'll stress the drive more, but it's not real world because most machines now have an adequate amount of memory.

I would be inclined to do realistic real-world tests (even if it won't stress the drives that much), and then synthetic benchmarks (which, if properly written, will take measures to ensure that OS caching is turned off for their test dataset).

Socrates

Fri Oct 12 07:12:16 2001

Storagereview, worst of both worlds.

They have managed to pick a ram selection that is hyper sensitive to a certain drive characteristic, buffer size.

It makes the tests both worthless for drive tests, since the tests skew in favor of cache, rather then drive performance, and, in a real world performance situation, there would be enough ram to diminish the drives cache
impact. Weird.

Guess you should be designing their test beds, because they think they are making great choices.

I have to really take the time to thank you for the effort you have put into educating me. Considering or prior
conflicts in ARS, I'm not sure if I was in your position, I would be so helpful. I'm amazed and thankful that we can have these dialogues, and really value your knowledge, and expertise.

Thanks again

Socrates

OSY 1.0 Thread Viewer

Pizza: help back to caching questions