Skip to content
This repository was archived by the owner on Jan 13, 2022. It is now read-only.

Commit da61639

Browse files
committed
Skip sequential IO (optional, defaults to off, sysctl'able).
1 parent 5d28979 commit da61639

9 files changed

+274
-21
lines changed

doc/flashcache-doc.txt

+18-2
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,9 @@ block. Note that a sequential range of disk blocks will all map onto a
4040
given set.
4141

4242
The DM layer breaks up all IOs into blocksize chunks before passing
43-
the IOs down to the cache layer. Flashcache caches all full blocksize
44-
IOs.
43+
the IOs down to the cache layer. By default, flashcache caches all
44+
full blocksize IOs, but can be configured to only cache random IO
45+
whilst ignoring sequential IO.
4546

4647
Replacement policy is either FIFO or LRU within a cache set. The
4748
default is FIFO but policy can be switched at any point at run time
@@ -164,6 +165,19 @@ In spite of the limitations, we think the ability to mark Direct IOs
164165
issued by a pid will be valuable to prevent backups from wiping out
165166
the cache.
166167

168+
Alternatively, rather than specifically marking pids as non-cacheable,
169+
users may wish to experiment with the sysctl 'skip_seq_thresh' which
170+
disables caching of IO determined to be sequential, above a configurable
171+
threshold of consecutive reads or writes. The algorithm to spot
172+
sequential IO has some ability to handle multiple 'flows' of IO, so
173+
it should, for example, be able to skip caching of IOs of two
174+
flows of sequential reads or writes, but only cache IOs from a third
175+
random IO flow. Note that multiple small files may be written to
176+
consecutive blocks. If these are written out in a batch (e.g. by
177+
an untar), this may appear as single sequential write, hence these
178+
multiple small files will not be cached. The categorization of IO as
179+
sequential or random occurs purely at the block level, not the file level.
180+
167181
(For a more detailed discussion about caching controls, see the SA Guide).
168182

169183
Futures and Features :
@@ -298,3 +312,5 @@ Acknowledgements :
298312
I would like to thank Bob English for doing a critical review of the
299313
design and the code of flashcache, for discussing this in detail with
300314
me and providing valuable suggestions.
315+
316+
The option to detect and skip sequential IO was added by Will Smith.

doc/flashcache-sa-guide.txt

+42-2
Original file line numberDiff line numberDiff line change
@@ -199,6 +199,7 @@ dev.flashcache.ram3+ram4.pid_expiry_secs = 60
199199
dev.flashcache.ram3+ram4.max_pids = 100
200200
dev.flashcache.ram3+ram4.do_pid_expiry = 0
201201
dev.flashcache.ram3+ram4.io_latency_hist = 0
202+
dev.flashcache.ram3+ram4.skip_seq_thresh = 0
202203

203204
Sysctls for a writeback mode cache :
204205
cache device /dev/sdb, disk device /dev/cciss/c0d2
@@ -218,6 +219,7 @@ dev.flashcache.sdb+c0d2.dirty_thresh_pct = 20
218219
dev.flashcache.sdb+c0d2.stop_sync = 0
219220
dev.flashcache.sdb+c0d2.do_sync = 0
220221
dev.flashcache.sdb+c0d2.io_latency_hist = 0
222+
dev.flashcache.sdb+c0d2.skip_seq_thresh = 0
221223

222224
Sysctls common to all cache modes :
223225

@@ -243,13 +245,19 @@ dev.flashcache.<cachedev>.do_pid_expiry:
243245
Enable expiry on the list of pids in the white/black lists.
244246
dev.flashcache.<cachedev>.pid_expiry_secs:
245247
Set the expiry on the pid white/black lists.
248+
dev.flashcache.<cachedev>.skip_seq_thresh:
249+
Skip (don't cache) sequential IO larger than this number (in kb).
250+
0 (default) means cache all IO, both sequential and random.
251+
Sequential IO can only be determined 'after the fact', so
252+
this much of each sequential I/O will be cached before we skip
253+
the rest. Does not affect searching for IO in an existing cache.
246254

247255
Sysctls for writeback mode only :
248256

249257
dev.flashcache.<cachedev>.fallow_delay = 900
250258
In seconds. Clean dirty blocks that have been "idle" (not
251-
read or written) for fallow_delay seconds. Default is 60
252-
seconds.
259+
read or written) for fallow_delay seconds. Default is 15
260+
minutes.
253261
Setting this to 0 disables idle cleaning completely.
254262
dev.flashcache.<cachedev>.fallow_clean_speed = 2
255263
The maximum number of "fallow clean" disk writes per set
@@ -350,13 +358,17 @@ not cache the IO. ELSE,
350358
2) If the tgid is in the blacklist, don't cache this IO. UNLESS
351359
3) The particular pid is marked as an exception (and entered in the
352360
whitelist, which makes the IO cacheable).
361+
4) Finally, even if IO is cacheable up to this point, skip sequential IO
362+
if configured by the sysctl.
353363

354364
Conversely, in "cache nothing" mode,
355365
1) If the pid of the process issuing the IO is in the whitelist,
356366
cache the IO. ELSE,
357367
2) If the tgid is in the whitelist, cache this IO. UNLESS
358368
3) The particular pid is marked as an exception (and entered in the
359369
blacklist, which makes the IO non-cacheable).
370+
4) Anything whitelisted is cached, regardless of sequential or random
371+
IO.
360372

361373
Examples :
362374
--------
@@ -480,6 +492,34 @@ agsize * agcount ~= V
480492

481493
Works just as well as the formula above.
482494

495+
Tuning Sequential IO Skipping for better flashcache performance
496+
===============================================================
497+
Skipping sequential IO makes sense in two cases:
498+
1) your sequential write speed of your SSD is slower than
499+
the sequential write speed or read speed of your disk. In
500+
particular, for implementations with RAID disks (especially
501+
modes 0, 10 or 5) sequential reads may be very fast. If
502+
'cache_all' mode is used, every disk read miss must also be
503+
written to SSD. If you notice slower sequential reads and writes
504+
after enabling flashcache, this is likely your problem.
505+
2) Your 'resident set' of disk blocks that you want cached, i.e.
506+
those that you would hope to keep in cache, is smaller
507+
than the size of your SSD. You can check this by monitoring
508+
how quick your cache fills up ('dmsetup table'). If this
509+
is the case, it makes sense to prioritize caching of random IO,
510+
since SSD performance vastly exceeds disk performance for
511+
random IO, but is typically not much better for sequential IO.
512+
513+
In the above cases, start with a high value (say 1024k) for
514+
sysctl dev.flashcache.<device>.skip_seq_thresh, so only the
515+
largest sequential IOs are skipped, and gradually reduce
516+
if benchmarks show it's helping. Don't leave it set to a very
517+
high value, return it to 0 (the default), since there is some
518+
overhead in categorizing IO as random or sequential.
519+
520+
If neither of the above hold, continue to cache all IO,
521+
(the default) you will likely benefit from it.
522+
483523

484524
Further Information
485525
===================

flashcache-wt/README

+3-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
flashcache-wt is a simple, non-persistent write-through and write-around
22
flashcache.
33

4-
It is a separate code base from flashcache (which is write back only).
4+
It is a separate code base from flashcache. Note that flashcache itself, which
5+
is more configurable, now has options for writeback, writethrough and writearound
6+
caching.
57

68
Notes :
79
-----

src/flashcache.h

+25-1
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,24 @@ struct flashcache_stats {
173173
unsigned long clean_set_ios;
174174
};
175175

176+
/*
177+
* Sequential block history structure - each one
178+
* records a 'flow' of i/o.
179+
*/
180+
struct sequential_io {
181+
sector_t most_recent_sector;
182+
unsigned long sequential_count;
183+
/* We use LRU replacement when we need to record a new i/o 'flow' */
184+
struct sequential_io *prev, *next;
185+
};
186+
#define SKIP_SEQUENTIAL_THRESHOLD 0 /* 0 = cache all, >0 = dont cache sequential i/o more than this (kb) */
187+
#define SEQUENTIAL_TRACKER_QUEUE_DEPTH 32 /* How many io 'flows' to track (random i/o will hog many).
188+
* This should be large enough so that we don't quickly
189+
* evict sequential i/o when we see some random,
190+
* but small enough that searching through it isn't slow
191+
* (currently we do linear search, we could consider hashed */
192+
193+
176194
/*
177195
* Cache context
178196
*/
@@ -275,6 +293,12 @@ struct cache_c {
275293
int sysctl_cache_all;
276294
int sysctl_fallow_clean_speed;
277295
int sysctl_fallow_delay;
296+
int sysctl_skip_seq_thresh;
297+
298+
/* Sequential I/O spotter */
299+
struct sequential_io seq_recent_ios[SEQUENTIAL_TRACKER_QUEUE_DEPTH];
300+
struct sequential_io *seq_io_head;
301+
struct sequential_io *seq_io_tail;
278302
};
279303

280304
/* kcached/pending job states */
@@ -333,7 +357,7 @@ enum {
333357
#define DIRTY 0x0040 /* Dirty, needs writeback to disk */
334358
/*
335359
* Old and Dirty blocks are cleaned with a Clock like algorithm. The leading hand
336-
* marks DIRTY_FALLOW_1. 60 seconds (default) later, the trailing hand comes along and
360+
* marks DIRTY_FALLOW_1. 900 seconds (default) later, the trailing hand comes along and
337361
* marks DIRTY_FALLOW_2 if DIRTY_FALLOW_1 is already set. If the block was used in the
338362
* interim, (DIRTY_FALLOW_1|DIRTY_FALLOW_2) is cleared. Any block that has both
339363
* DIRTY_FALLOW_1 and DIRTY_FALLOW_2 marked is considered old and is eligible

src/flashcache_conf.c

+16-1
Original file line numberDiff line numberDiff line change
@@ -1147,7 +1147,18 @@ flashcache_ctr(struct dm_target *ti, unsigned int argc, char **argv)
11471147
dmc->sysctl_cache_all = 1;
11481148
dmc->sysctl_fallow_clean_speed = FALLOW_CLEAN_SPEED;
11491149
dmc->sysctl_fallow_delay = FALLOW_DELAY;
1150-
1150+
dmc->sysctl_skip_seq_thresh = SKIP_SEQUENTIAL_THRESHOLD;
1151+
1152+
/* Sequential i/o spotting */
1153+
for (i = 0; i < SEQUENTIAL_TRACKER_QUEUE_DEPTH; i++) {
1154+
dmc->seq_recent_ios[i].most_recent_sector = 0;
1155+
dmc->seq_recent_ios[i].sequential_count = 0;
1156+
dmc->seq_recent_ios[i].prev = (struct sequential_io *)NULL;
1157+
dmc->seq_recent_ios[i].next = (struct sequential_io *)NULL;
1158+
seq_io_move_to_lruhead(dmc, &dmc->seq_recent_ios[i]);
1159+
}
1160+
dmc->seq_io_tail = &dmc->seq_recent_ios[0];
1161+
11511162
(void)wait_on_bit_lock(&flashcache_control->synch_flags, FLASHCACHE_UPDATE_LIST,
11521163
flashcache_wait_schedule, TASK_UNINTERRUPTIBLE);
11531164
dmc->next_cache = cache_list_head;
@@ -1295,13 +1306,15 @@ flashcache_dtr_stats_print(struct cache_c *dmc)
12951306
DMINFO("conf:\n" \
12961307
"\tvirt dev (%s), ssd dev (%s), disk dev (%s) cache mode(%s)\n" \
12971308
"\tcapacity(%luM), associativity(%u), data block size(%uK) metadata block size(%ub)\n" \
1309+
"\tskip sequential thresh(%uK)\n" \
12981310
"\ttotal blocks(%lu), cached blocks(%lu), cache percent(%d)\n" \
12991311
"\tdirty blocks(%d), dirty percent(%d)\n",
13001312
dmc->dm_vdevname, dmc->cache_devname, dmc->disk_devname,
13011313
cache_mode,
13021314
dmc->size*dmc->block_size>>11, dmc->assoc,
13031315
dmc->block_size>>(10-SECTOR_SHIFT),
13041316
dmc->md_block_size * 512,
1317+
dmc->sysctl_skip_seq_thresh,
13051318
dmc->size, dmc->cached_blocks,
13061319
(int)cache_pct, dmc->nr_dirty, (int)dirty_pct);
13071320
DMINFO("\tnr_queued(%lu)\n", dmc->pending_jobs_count);
@@ -1503,6 +1516,8 @@ flashcache_status_table(struct cache_c *dmc, status_type_t type,
15031516
dmc->size*dmc->block_size>>11, dmc->assoc,
15041517
dmc->block_size>>(10-SECTOR_SHIFT));
15051518
}
1519+
DMEMIT("\tskip sequential thresh(%uK)\n",
1520+
dmc->sysctl_skip_seq_thresh);
15061521
DMEMIT("\ttotal blocks(%lu), cached blocks(%lu), cache percent(%d)\n",
15071522
dmc->size, dmc->cached_blocks,
15081523
(int)cache_pct);

0 commit comments

Comments
 (0)