Tuesday 22 November 2016

linux-4.8-ck8, MuQSS version 0.144

Here's a new release to go along with and commemorate the 4.8.10 stable release (they're releasing stable releases faster than my development code now.)

linux-4.8-ck8 patch:
patch-4.8-ck8.lrz

MuQSS by itself:
4.8-sched-MuQSS_144.patch

There are a small number of updates to MuQSS itself.
Notably there's an improvement in interactive mode when SMT nice is enabled and/or realtime tasks are running, or there are users of CPU affinity. Tasks previously would not schedule on CPUs when they were stuck behind those as the highest priority task and it would refuse to schedule them transiently.
The old hacks for CPU frequency changes from BFS have been removed, leaving the tunables to default as per mainline.
The default of 100Hz has been removed, but in its place a new and recommended 128Hz has been implemented - this just a silly microoptimisation to take advantage of the fast shifts that /128 has on CPUs compared to /100, and is close enough to 100Hz to behave otherwise the same.

For the -ck patch only I've reinstated updated and improved versions of the high resolution timeouts to improve behaviour of userspace that is inappropriately Hz dependent allowing low Hz choices to not affect latency.
Additionally by request I've added a couple of tunables to adjust the behaviour of the high res timers and timeouts.
/proc/sys/kernel/hrtimer_granularity_us
and
/proc/sys/kernel/hrtimeout_min_us

Both of these are in microseconds and can be set from 1-10,000. The first is how accurate high res timers will be in the kernel and is set to 100us by default (on mainline it is Hz accuracy).
The second is how small to make a request for a "minimum timeout" generically in all kernel code. The default is set to 1000us by default (on mainline it is one tick).

I doubt you'll find anything useful by tuning these but feel free to go nuts. Decreasing the second tunable much further risks breaking some driver behaviour.

Enjoy!
お楽しみ下さい
-ck

115 comments:

  1. cylictest (cyclictest -N -S -p 80) avg times increased by a factor of 10 with this version.

    duud

    ReplyDelete
    Replies
    1. Try it with interactive disabled.

      Delete
    2. Same results with interactive disabled.

      Numbers are in ns.
      T: 0 ( 491) P:80 I:1000 C: 2045 Min: 3989 Act: 71984 Avg: 66114 Max: 80756
      T: 1 ( 492) P:80 I:1500 C: 1363 Min: 4347 Act: 103537 Avg: 95088 Max: 113803
      T: 2 ( 493) P:80 I:2000 C: 1022 Min: 3128 Act: 134320 Avg: 119873 Max: 146422
      T: 3 ( 494) P:80 I:2500 C: 818 Min: 4053 Act: 165569 Avg: 130382 Max: 176960

      Delete
    3. On CFS:

      T: 0 ( 545) P:80 I:1000 C: 817 Min: 2209 Act: 9607 Avg: 9385 Max: 16567
      T: 1 ( 546) P:80 I:1500 C: 545 Min: 2293 Act: 2635 Avg: 7862 Max: 16549
      T: 2 ( 547) P:80 I:2000 C: 408 Min: 3506 Act: 9911 Avg: 9822 Max: 10986
      T: 3 ( 548) P:80 I:2500 C: 327 Min: 2180 Act: 3270 Avg: 9018 Max: 17319

      Delete
    4. Con just puts the userspace programmers' hardcoded 'dodgy' things back on the table. Although it seems buggy, the factor of 10 is nicely significant. Maybe, cyclictest isn't MuQSS-ready(tm) ;-)
      @duud: Any issues or slowdowns?
      BR, Manuel Krause

      Delete
    5. Okay let's take a step back. What's cyclic test and where is it? Need to know if it's benchmarking something valid first before we consider if this is relevant. Also what muqss exactly are you benchmarking? Muqss by itself, with -ck, what Hz config?

      Delete
    6. I tried cyclictest, but it was broken with older MuQSS releases.
      When run with the histogram option, the latency counters were always overflowing. But this seem fixed with MuQSS144.

      This presentation gives usefull info on cyclictest:
      http://events.linuxfoundation.org/sites/events/files/slides/cyclictest.pdf

      Pedro

      Delete
    7. I'd have to see the actual code to pass judgement, but the fact that it was even possible for the latency counters to overflow, regardless of the CPU scheduler it was run on, makes it hard to put any great value on it. It's not unusual that these latency measurement tools depend on kernel design/APIs that may or may not exist in muqss or have a totally different meaning. The same thing happened with lattest.

      Delete
    8. ^^ That first cyclictest looks really horrible.
      Must be some bad driver or bad code in userspace or something.
      Latencies should be spread almost even across cores.
      Something clearly isn't right. It also shows on CFS but not that extreme though.

      Cyclictest running fine as it should here on Core 2 3Ghz. (MuQSS)

      T: 0 ( 1510) P:80 I:1000 C: 11657 Min: 1041 Act: 1223 Avg: 1319 Max: 5217
      T: 1 ( 1511) P:80 I:1500 C: 7771 Min: 1167 Act: 1390 Avg: 1478 Max: 6242

      1.3-1.5µs on Core 2 :)
      Factor 2-3 lower than CFS, like it should be. But then again cyclictest isn't saying too much also.

      I use a custom kernel though with "everything" ripped out except hardware drivers for the machine I am compiling on.

      Delete
    9. Cyclictest on W3690 hexacore, latest ck, kernel 4.8.10.

      T: 0 ( 2036) P:80 I:1000 C: 9970 Min: 1123 Act: 1357 Avg: 1455 Max: 4912
      T: 1 ( 2037) P:80 I:1500 C: 6647 Min: 1113 Act: 1467 Avg: 1504 Max: 5313
      T: 2 ( 2038) P:80 I:2000 C: 4985 Min: 1189 Act: 1367 Avg: 1453 Max: 4933
      T: 3 ( 2039) P:80 I:2500 C: 3988 Min: 1195 Act: 1345 Avg: 1449 Max: 4778
      T: 4 ( 2040) P:80 I:3000 C: 3323 Min: 1224 Act: 1384 Avg: 1397 Max: 3064
      T: 5 ( 2041) P:80 I:3500 C: 2848 Min: 1170 Act: 1312 Avg: 1370 Max: 8951

      Delete
    10. I usually run a quick cyclictest to check for scheduling overhead.

      Now I did some tests. It's the new 128HZ option - nobody tested it? Setting back to 100HZ yields low numbers again.

      duud

      Delete
    11. That reminds me...
      when I was experimenting with higher Hz than 1000 I noticed the same regarding uneven Hz numbers.
      They must be divisible by 2 and 10 and such (no idea exactly), otherwise performance suffers somehow (maybe drivers?).
      As I use 1000 Hz normally I didn't notice that 128 Hz thing.

      Delete
    12. What a fascinating discovery. Now the real question is - is there some in-kernel requirement of it being that multiple that makes it actually misbehave, OR is it just a sampling error that occurs as a result of that multiple and it's reporting bad where in fact it's performing fine?

      Delete
    13. All I can say performance really suffered.
      Like above on that cyclictest CFS/MuQSS (128Hz) comparison.
      Some code seems to be multiple of ? Hz-dependent it seems.
      Maybe in the kernel, maybe in the drivers, maybe in userspace...
      ... no idea...

      Delete
    14. Well I'm saying that I'm not sure that performance actually is suffering and that there may be a reporting error from the clock with unusual Hz values. Either way, Hz 128 isn't proving to be beneficial so that gets the boot too next release...

      Delete
    15. From testing performance really suffered, I didn't measure anything though since it was obvious.
      I think it was 864 Hz or something...
      and figured maybe it was a rounding or accuracy error or something and checking the next value if it was "even".
      Eventually I came up with 1/1250 which is 0.0008.
      I changed Hz to 1250 then and went straight to the next compilation.
      No problems anymore.

      Delete
    16. Probably just your hardware - maybe it makes the tsc unstable. It behaves fine on mine. Don't get anything like your results now that I've tried cyclictest (was getting empty directory when trying to git clone it before.)

      Delete
    17. Also bear in mind that cyclictest should be run as sudo to run realtime. It might think it's running realtime when running on muqss without sudo but it's running sched iso which is nothing like running sched_fifo.

      Delete
    18. The hardware argument makes sense since I need to enable "Enable PCI quirk workarounds" in the kernel also to get low latency.

      Delete
    19. So, summarising several of the reports of this blog entry, choosing one of the more traditional HZ values (and NOT 128) would be more reasonable (more compatible, safer, etc. for drivers, kernel and userspace) ?
      Thanks and best regards, Manuel Krause

      Delete
    20. Here is something that I'm getting only on 128HZ

      APIC calibration not consistent with PM-Timer: 93ms instead of 100ms
      APIC delta adjusted to PM-Timer: 1312496 (1230459)

      I took a very quick look into the code but I can't see how this might be HZ related. Ideas?

      duud

      Delete
    21. This might be interesting:

      #define LAPIC_CAL_LOOPS (HZ/10)

      duud

      Delete
    22. ^ Good find. There might be more...

      Delete
    23. @duud:
      I also got these with 128HZ, second line differed a bit, by "... 1662498 (1558584)". Core2duo cpu, Florian reported this first on this blog page.
      After recompiling same setup with 100HZ no such messages occur. But then, my FF forking issue is appearing again (vs. 128HZ).
      I haven't seen slowdowns with 128HZ on my system. Maybe Con is on the right way with this path, but not touching all corner cases so far.

      What is the "#define LAPIC_CAL_LOOPS (HZ/10)" thing about? The possibly needed fraction of 10?

      Manuel Krause

      Delete
    24. @Manuel:
      Yes, its an integer devision

      Delete
    25. Short excerpt of a single file (not complete)

      /arch/x86/include/asm/apb_timer.h

      Line 44 : (loops_per_jiffy * (HZ/4)));
      Line 78 : #define MIN_SPU_TIMESLICE max(5 * HZ / (1000 * SPUSCHED_TICK), 1)
      Line 79 : #define DEF_SPU_TIMESLICE (100 * HZ / (1000 * SPUSCHED_TICK))
      Line 133: timer64_config(TIMER64_RATE / HZ);
      Line 192: bogosum/(500000/HZ), bogosum/(5000/HZ) % 100);
      Line 470: (loops_per_jiffy/(500000/HZ)),
      Line 539: schedule_timeout(HZ/10);
      Line 553: mod_timer(&timer_virt_cntr, jiffies + HZ / 10);
      Line 667: mod_timer(&timer_spu_event_swap, jiffies + HZ / 25);

      Delete
  2. on 4.8.10 kernel, 4.8-ck8 patchset

    patch -p1 < ../4.8-ck8/patches/0006-Implement-min-and-msec-hrtimeout-un-interruptible-sc.patch
    patching file include/linux/sched.h
    Hunk #1 FAILED at 437.
    1 out of 1 hunk FAILED -- saving rejects to file include/linux/sched.h.rej
    patching file kernel/time/hrtimer.c
    Hunk #1 FAILED at 1788.
    1 out of 1 hunk FAILED -- saving rejects to file kernel/time/hrtimer.c.rej

    ReplyDelete
    Replies
    1. ^^ Also there are two 0001-... patches, maybe this is the problem?

      Delete
    2. Yes that was it. Uploading new tarballs. The second 0001 was not meant to be there.

      Delete
    3. Works now and running great on a W3690 hexacore (4.8.10 + 4.8-ck8 patchset).

      Didn't notice any major increase regarding cyclictest^^.

      Also I like the implementation of tunables although I didn't have time to play around with them yet.

      Best regards,
      Anonymouse :)

      Delete
  3. Hi,

    I just updated to MuQSS v0.144 and for the first time I saw this kernel message in my Arch syslog during boot:

    kernel: APIC calibration not consistent with PM-Timer: 93ms instead of 100ms

    My HZ-config on my Core2 Duo:

    CONFIG_HZ_PERIODIC=y
    # CONFIG_NO_HZ_IDLE is not set
    # CONFIG_NO_HZ_FULL is not set
    # CONFIG_HZ_100 is not set
    CONFIG_HZ_128=y

    Does this anything has to do with HZ_128=y? Do I have to optimize one of the following options:

    /proc/sys/kernel/hrtimer_granularity_us is defaulting to 100 and
    /proc/sys/kernel/hrtimeout_min_us defaults to 1000.

    Thanks,

    Florian.

    ReplyDelete
    Replies
    1. Neither the Hz value nor those tunables should cause or have any effect on that so I don't really know why, except that not all hardware has stable TSC to be able to use them for the apic timers. I don't believe you need to do anything as the kernel will automatically pick the fastest stable clock it can.

      Delete
    2. @Florian & @ck:
      I get the same message (same cpu type), but it seems to be only informative. Second line after this is on my system:
      APIC delta adjusted to PM-Timer: 1662498 (1558584)
      So, kernel seems to know how to deal with. I don't observe anomalies on my system.
      BR, Manuel Krause

      Delete
  4. Hello.
    Here comes the usual benchmarks. The kernel configuration is Archlinux's 4.8.7 one. Intel-pstate+powersave frequency governor is used.

    CFS vs MuQSS144
    http://openbenchmarking.org/result/1611224-LO-CFSVSMUQS05

    MuQSS140 vs MuQSS144
    http://openbenchmarking.org/result/1611224-LO-MUQSS140164

    There is some small improvement with MuQSS144+interactive=1, notably on ebizzy.

    Pedro

    ReplyDelete
    Replies
    1. Nice, but too bad those benchmarks don't show the major gain in responsiveness compared to CFS.

      Delete
    2. There are some more benchmarks, to those who are interested: https://docs.google.com/spreadsheets/d/1EayezAsGlJdXjZbS3b9m7YtvtRF-DJ3xrT3hYCvfymQ/edit?usp=sharing

      Br, Eduardo

      Delete
  5. Hey,

    Just wanted to let you know that I still have issues with the spotify scrolling even with this new release (Same workload as I described in my email).
    I have tested this with 1kHz so far. I will test the new 128Hz in a moment.

    ReplyDelete
    Replies
    1. Is there anything else I can do to help tracking down this problem?

      Delete
    2. Seems that 128Hz didn't make a difference.

      Delete
    3. As I said to you in the email, I did not specifically address your issue and was hoping the fixes in 144 helped but I guess not. I'm still scratching my head on that one for the reasons I've mentioned. Perhaps send me a fresh 'top' output again with the affected workload and make sure it shows threads please.

      Delete
    4. Hey,

      Sorry I couldn't use my email because I am currently out of city, so obviously I am not able to test the nicing stuff (rcu_preempt) until sunday.
      Despite that, I can at least give you my grep RCU .config result:

      # RCU Subsystem
      CONFIG_PREEMPT_RCU=y
      # CONFIG_RCU_EXPERT is not set
      CONFIG_SRCU=y
      # CONFIG_TASKS_RCU is not set
      CONFIG_RCU_STALL_COMMON=y
      # CONFIG_TREE_RCU_TRACE is not set
      # CONFIG_RCU_EXPEDITE_BOOT is not set
      # RCU Debugging
      # CONFIG_PROVE_RCU is not set
      # CONFIG_SPARSE_RCU_POINTER is not set
      # CONFIG_RCU_PERF_TEST is not set
      # CONFIG_RCU_TORTURE_TEST is not set
      CONFIG_RCU_CPU_STALL_TIMEOUT=60
      # CONFIG_RCU_TRACE is not set
      # CONFIG_RCU_EQS_DEBUG is not set

      Delete
  6. @ck:
    With my first shot test I see happened progress between 140 and 144, regarding my FF "issue".
    Running 4.8.10 with the -ck8 timer patches at default 128Hz. And very low system base load. Very nice :-)
    BR, Manuel Krause

    ReplyDelete
  7. Hi Con.
    I've some questions on interbench.

    I ran it several times and if found that sometimes there are big variations with CFS in 'Max Latency', '% Desired CPU' and '% Deadlines Met'. Average latency are more consistent, but still with variations.
    I tried with both intel-pstate performance and powersave. Doesn't make a difference.
    I tried running interbench longer (-t 90), and it is better.

    You can see the results here:
    https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit?usp=sharing

    The colors mean:
    blue = within +- 10% of the reference
    green = better
    red = worse
    The reference is the first run on the left.

    I wonder if such variations are expected.
    If it is so, how to do a fair comparison between schedulers ?

    Interestingly, there are less variations with MuQSS (maybe due to the heuristics in CFS ?).

    Pedro

    ReplyDelete
    Replies
    1. Yes, CFS is a big state machine and that's what happens when there are heuristics to predict how to be interactive - there is wild variation in behaviour. MuQSS is deterministic in its behaviour and has no heuristics so the results should be mostly consistent apart from system load events (kernel threads, interrupts etc.)

      Delete
  8. Thanks a lot, this release seems to alleviate TF2's startup time & fps problems. I also haven't run into any crashes or issues in an hour of testing.

    Subjectively, my whole system seems to be more responsive and also boot up a little quicker but that could be placebo as I haven't done scientific testing. Although I could run a youtube video in the background while gaming without noticing any changes in input responsiveness (which I usually do in that case) so I guess something in this release does improve that.

    ~ kiwii, the anon who filed the TF2 bug report

    ReplyDelete
    Replies
    1. Thanks for the feedback. Yes the timer changes in -ck are specifically designed to work around userspace coding errors which make it inappropriately Hz dependent. Additionally I've noticed that the boot process is Hz dependent too so you're right that the boot is quicker from both kernel code and system(d)/init.

      Delete
    2. Yep, the boot process got a lot faster for me as well. Also, I think there's something about nvidia-drivers that is HZ dependent as well. GDM on my laptop using Intel drivers was much, much faster than my desktop with a nvidia card, and my desktop is A LOT faster than my laptop.

      Delete
    3. So is the GDM slowdown fixed for you now then?

      Delete
    4. Yes, it seems to be as fast as a kernel previously compiled with 1000hz. I'll make some tests with a 1000hz kernel to confirm later.

      Delete
  9. With the conerns regarding security and privacy, distrobutions like debian have started to release grsecurity in their repos. SID only for now.

    any plans to make CK compatible with grsec (in the [near] future)?

    ReplyDelete
  10. Thanks a lot. Very responsive on a i7-870 quadcore (oc) even more than CFS. I also used "KBUILD_CFLAGS += -falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1 -fno-builtin -pipe" in Line 200 of /arch/x86/Makefile which seems to give a nice boost.

    ReplyDelete
    Replies
    1. Any source from where you got it? And, may there be a whitespace typo in it before "-pipe"?
      Do you use full -ck8 or plain MuQSS 0.144, and which HZ value have you chosen?
      Thanks in advance, BR, Manuel Krause

      Delete
    2. Full ck8, 1000Hz. I was testing a lot of compiler options, those proved to increase performance significantly.
      Those alignment options I got from there > http://stackoverflow.com/questions/19470873/why-does-gcc-generate-15-20-faster-code-if-i-optimize-for-size-instead-of-speed

      -fno-builtin is a recommendation by Agner Fog.

      " -pipe" is extra. It pipes output rather than using temp files which speeds up the compilation.

      Delete
    3. Is that supposed to affect the kernel at runtime or only the compile-time? My Makefile already has -falign-jumps=1 and -falign-loops=1 set for the 64bits architecture before line 200, but none of the others.

      Delete
    4. Thanks for your reply! Forget my last posting for the moment. Why still 1000Hz?
      And when speaking so offtopic about gcc compiler options: Wasn't there a Makefile option for it to compile for more performance? I don't remember any more.

      Delete
    5. ^^ It affects runtime of the kernel, it will be faster (on Intel, can't speak for AMD since I don't own any AMD box).
      Out of lazyness I use always the whole "block" and copy-paste anywhere. I didn't pay any attention if it's already there.
      1000 Hz feels more responsive.
      Yes, but it is plain -O2 as opposed to -Os which is for small binary size.
      ^^^ These add some options on top of -O2.

      Delete
    6. Okay, thank you for your detailed info so far. I'm atm. at 250Hz, what doesn't make troubles like with 100 or 128Hz. Reboot is pending. :-)) I'd come back to this in some hours.
      BR, Manuel Krause

      Delete
    7. I gave it a little longer time to prove i'ts no Fata Morgana. I'm quite excited about the scope of the effects: interactivity increased while not slowing down other subsystems (display/window refresh, disk i/o, eth transmission) -- these even seem to benefit too. No benchmarks done, but subjectively with these options the 250Hz kernel feels superior than a 1000Hz one without. And, no errors occurred.
      Great thanks for sharing and BR,
      Manuel Krause

      Delete
    8. You're welcome.
      Nice it worked for you aswell.

      Delete
    9. You said, you've tested many compiler options. If you keep to stay uptodate in this area for the future, please don't hesitate to publish your findings on here. Although a bit offtopic, I can imagine many people on here who may want to benefit from your time consuming testing work.
      I can also only speak for my intel core2duo, hopefully other brands' testers prove the positive effects aswell.
      BR, Manuel Krause

      Delete
    10. If you're finding 250Hz works best that's almost certainly because mainline's default is 250 and no doubt there is code which was developed and optimised at that value and no one ever bothered to test other values to see if they're problematic. Many of the /10 divisions in the code that have been pointed out are still harmless even if they round down.

      Delete
    11. I came back to

      CONFIG_NO_HZ_COMMON=y
      # CONFIG_HZ_PERIODIC is not set
      CONFIG_NO_HZ_IDLE=y

      and

      CONFIG_HZ_250=y

      after I had some problems while compiling palemoon browser a week ago with 100% CPU usage and my system nearly got frozen, no responsiveness, windows weren't redrawn etc.

      No problems with actual 4.8.11-1-ck but I still don't understand one thing: now WITHOUT periodic timers I Have about 10% of CPU usage MORE than before. This is shown by htop and other utilities. Is this really fact or some kind of different measurement (periodic timer vs. tickless idle)? I have no problems with nonfluid programs, but see higher CPU usages even when system is idle (4-5 % instead of 0-1 %).

      Delete
    12. That's the kind of tip I was searching for! Thanks and how Manuel Krause wrote: these little off-topic-tips are great! I read from Agner Fog:

      "The first thing that you can do to improve the performance is to drop the builtin versions of memory and string functions. The speed can be improved by up to a factor 5 in some cases by compiling with -fno-builtin. The builtin version is never optimal, except for memcpy in cases where the count is a small compile-time constant so that it can be replaced by simple mov instructions."

      Two weeks ago I compiled the actual gcc 6.2.1-1 version by myself (took about 4 hours but I wanted to test if this can improve my system without any necessary cross compiling). As described I edited [...]/arch/x86/Makefile and except always the same one message of declared but unused variable had no other compiler warnings and messages during kernel compile. I don't know if the reason is kernel code or the magic -fno-builtin option. I never had only this one message when compiling my kernels (normally there are few more warnings).

      I have no benchmarks or other proving data but it seems to be an improvement as my Core2 Duo Arch Linux System runs fluid and I didn't observe any difficulties with actual 4.8.11-1-ck today. Great! :-)

      Delete
    13. @ck & @Florian:
      The reason to test the 250Hz again, was the imagined promise of higher throughput vs. 1000Hz, with the now obtained extra interactivity by the above-mentioned compiler configuration addons. Without the latter, I'd still prefer 1000Hz on my system when aiming at interactiveness.
      Over the weekend I've started a round of comparative tests with 250Hz, 200Hz and 160Hz, inspired by the division (by 2/4/10) talks on here, mainly to try to pin down my Firefox forking issue privately on my own. Unfortunately, regarding this goal, my results are inconsistent (meaning: unexpected) and may need more rounds.
      What I can say: All three down to the 160Hz version, lowest tested atm., don't throw out the APIC timer confusion+ correction message (reported above).
      For the moment I'm quite confident with the 160Hz version (but it can also be a 'one-boot-wonder' ;-)).
      Con, can you please tell, maybe again, what reason led you to choose 128Hz?

      BR, Manuel Krause

      Delete
    14. Sure. Division on a CPU is a relatively expensive process in terms of how many cycles are used to perform it. In the kernel code the code X/HZ is used quite a lot as a macro which would be converted to an actual division for all the normal values of HZ used in kernel configuration. The value 128, on the other hand, means the macro X/HZ can be converted into a logical shift operation of X right shift 7 bytes (X >> 7) which is an extremely fast operation in a CPU by comparison. I chose the lowest value that is still in the 100-1000 range since values outside this range are known to break code. However this is truly a micro-optimisation and if code expects values to be a multiple of 10 for other reasons it would cause macro-breakage that greatly offsets any micro-improvement.

      Delete
    15. @ck:
      Thank you, Con. Your explanation was quite programmer-oriented, but after reading it twice... ;-) there still remain questions:
      Will using 256Hz take advantage of shift operation too, meaning here X >> 8, etc. with power of 2 (512, 1024), and would also be beneficial? What in fact happens to the divisions /10: do the above mentioned binary shift ops do their job in place anyways, and only the /10 parts suffer when called?
      I've now a 256Hz kernel running, and the aforementioned APIC messages came up again, just for info (not with 100, 160, 200, 240, 250, 1000 -- but with 128 and this 256 Hz one):
      [ 0.058593] APIC calibration not consistent with PM-Timer: 97ms instead of 100ms
      [ 0.058593] APIC delta adjusted to PM-Timer: 1662496 (1623516)
      But no issues observed.

      Best regards, and thanks in advance, if you find time to answer some of my questions,
      Manuel Krause

      Delete
    16. Yes 256 is also a fast shift instead of a slow division. However the kernel code is using something divided by 10, and you'll always have rounding down unless you use a multiple of 10. There is no power of 2 that is also a multiple of 10 anywhere between 100 and 1000 so you cannot get both.

      Delete
    17. @ck:
      Does the "rounding down", in some /10 cases, eat up the advantage of fast shifts? Above, you've written about macro-breakage for /10 cases, and I've not fully understood the circumstances.
      I've taken the 256Hz for this test, as I assumed it to round down to 250 as next lower decimal for related /10 division ops, and given your words that 250Hz is a(n old and so far unquestioned mainline) kernel "icon", that many drivers may rely on.
      In my everyday usage experience this now advances to my best choice.

      BR, Manuel Krause

      Delete
  11. Thanks for the tip concerning

    -falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1 -fno-builtin -pipe

    I used to do "KBUILD_CFLAGS += -march=native -mtune=native -pipe" before and am now testing if I realize some speed boosting on my Core2 Duo. Compiling and starting without any issues, perhaps I discover some milliseconds of speed boosting. ;-)

    ReplyDelete
    Replies
    1. -O3 -march=native -mtune=generic -falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1 -fno-builtin -pipe works well here. Thanks.

      Delete
    2. Did some more "experiments" and came up with:

      KBUILD_CFLAGS += -O3 -march=native -mtune=generic -falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1 -mno-mmx -mno-sse -mno-sse2 -mno-sse3 -mno-ssse3 -mno-sse4.1 -mno-sse4.2 -mno-sse4 -mno-avx -mno-aes -mno-sse4a -mno-3dnow -fno-builtin -pipe

      Delete
    3. Atm. I don't understand why you disable cpu specific enhancements (-mno-mmx -mno-sse...). Can you explain the reason, please?

      BR, Manuel Krause

      Delete
    4. From my test runs it speeds up the kernel considerably. Maybe those enhancements cause latency.

      Delete
    5. What system do you run? More details, please.
      Unfortunately, I've had to take a break with this Makefile testings. The "-O3" was prone to runtime errors in former compiler/kernel days, but really months or years ago.
      I've needed to make sure, that none of these optimisations were causing, that my firefox doesn't playback any flash videos in firefox anymore. It's a pity that I can't rewind the recent MESA updates, what I'd call responsible for my actual issue.

      BR, Manuel Krause

      Delete
    6. The CFLAGS chosen in the kernel makefile are often carefully selected based on regression testing and low level assembly errors discovered with higher levels of optimisation so they appear to be relatively conservative intentionally. Using custom CFLAGS is likely to lead to subtle low level bugs which is why I've never included any in my kernels nor included the option to do so.

      Delete
    7. Lenovo Thinkstation S20, Xeon W3520 2.66 GHz quadcore, 6GB RAM, 250GB SAMSUNG 850 EVO, 500GB Seagate HDD, NVIDIA Quadro 4000, Intel EXPI9301 ethernet, NEC USB 3.0, no issues.
      Regarding conservative cflags, even gcc will sometimes fail to build a proper kernel (microbugs) even when using just the standard -O2.
      But yes -O2 is more safe than -O3.

      Delete
    8. I've reinstalled a very very old MESA known-good backup now. Same problem with flash.
      Either the source server is misbehaving since some days or the recent firefox-esr update is trash.
      The CFLAGS changes are not relevant for my problems, cross tested, but maybe for this subthread's original poster.

      @ck & @Florian:
      Atm. I'm using a 512Hz kernel, what allows virtualbox modules to compile, and astonishing: don't make APIC timer issues (like with 128, 256).

      BR, Manuel Krause

      Delete
    9. The "-O3" option still remains erratic. Like Con wrote about low level bugs above.
      On my system reboots or powerdowns get stuck before doing so.
      BR, Manuel Krause

      Delete
    10. Xeon W3520 quadcore 2.66 GHz, 6 GB 1066 MHz RAM, nvidia quadro 4000, samsung evo 850 250GB SSD, seagate 500GB HDD. Old Lenovo Thinkstation S20 but still good (enough).

      I suggest switching from O3 to O2 and all will be fine, well most of the time.
      I had gcc produce microbugs even with plain O2 in rare cases.

      The O3 adventure went fine here, no problems (and faster in some cases) although I reverted to the above with O2 instead of O3 on the server for reliability.

      Delete
  12. Thanks for the patches.
    Hello Linux Desktop ;).
    Had to revert to 4.8(.0) though since 4.8.7-4.8.11 were too slow for my taste.

    ReplyDelete
    Replies
    1. Did some testing using full ck8 patchset.
      Kernel 4.8 got gradually slower starting from 4.8.0. Slight slowdown from 4.8.0 to 4.8.1. Major slowdown from 4.8.1 to 4.8.2. At that point it was already too slow and I stopped.

      Delete
    2. more details, please

      That's too vague :/

      In what use-cases

      Delete
    3. Low latency desktop, gaming, input lag, ...

      Delete
    4. 4.8.0-ck8 is the best?

      Delete
    5. The best... I don't know. But the fastest and most responsive.

      Delete
  13. Con, is the scheduler responsible for interaction with workqueues ?

    Just got the 53 second lockup while browsing with chromium, having compiz active


    afaik I got more of these in the past few days, X was frozen but it could be rebooted via Magic SYSRQ Key,

    didn't know that it would take a minute or longer for it to "pass", otherwise I would have waited longer and reported here earlier ...


    http://pastebin.com/tdeKZ9ai

    [more than 4096 chars]

    ReplyDelete
    Replies
    1. okay, screw that - it's regressions galore with the nvidia proprietary driver ;)

      https://devtalk.nvidia.com/default/topic/977518/linux/problems-with-multiple-opengl-applications-running-simultaneously-with-375-20-on-a-gtx970/1

      https://devtalk.nvidia.com/default/topic/977518/linux/problems-with-multiple-opengl-applications-running-simultaneously-with-375-20-on-a-gtx970/post/5024978/#5024978

      Delete
    2. @kernelOfTTruth:
      Thank you, that you've found the culprit. I already was getting anxious.
      BR, Manuel Krause

      Delete
    3. Had to downgrade to 370.28,

      so yeah, it seemingly was the proprietary nvidia-drivers,

      I'm suspicious however that it also could be the suggested optimization flags ...

      so far it's stable

      Delete
    4. Your first scheduler related and your very last question regarding the compiler flags were my main concern since I don't use the nvidia drivers.
      Have you found good results with the suggested compiler options on your system too? For my system I'm still convinced of their usefulness.
      BR, Manuel Krause

      Delete
  14. 375.xx drivers are riddled with bugs, for the last month or so. I wouldn't get anxious until I saw the next major update. Even Folding@Home is crippled in the newer drivers, so we have to stay with 343.xx.

    ReplyDelete
    Replies
    1. err...make that 373.xx, the last driver without major bugs.

      Delete
  15. Is there a way to configure the kernel with CONFIG_SCHED_BFS_AUTOISO but then for MUQSS?

    ReplyDelete
    Replies
    1. There was never such an option even for BFS in any of my kernels.

      Delete
  16. I see, I assumed that https://github.com/zen-kernel/zen-kernel/blob/4.7/master/init/Kconfig#L75 was your work.

    ReplyDelete
  17. having said that is there a way to use an automatic sched_iso policy for X using MuQSS?

    ReplyDelete
    Replies
    1. just add schedtool -I `pidof Xorg` to rc.local

      Delete
    2. ^ sry, I mean autostart of desktop environment.

      Delete
  18. @ck:
    Quite a nice one from Virtualbox after trying 1024Hz:
    /tmp/vbox.1/r0drv/linux/the-linux-kernel.h:332:3: error: #error "HZ is not a multiple of 1000, the GIP stuff won't work right!"
    # error "HZ is not a multiple of 1000, the GIP stuff won't work right!"

    BR, Manuel Krause

    ReplyDelete
    Replies
    1. @ck:
      addon: all chosen values tested/ written above haven't led to this msg. 1000 remains a border. I find it funny that virtualbox complains for above 1000.
      BR, Manuel Krause

      Delete
  19. Hi ck, a while back you offered a Ubuntu 4.8.7-ck7 kernel. That is running ever so smoothly that I took to building a more recent 4.8.12 kernel, incl. your latest MuQSS patches. Builds fine, but I must be missing an important part of the puzzle, as I can't get it to boot. Would you consider posting your Ubuntu kernel build script here (if you use one..)?

    ReplyDelete
  20. osu! still crashes, hangs, and locks up (the entire system) for me even with the workaround mentioned in earlier post comments. After several tries with ck and ck-ivybridge, I did get a different result in dmesg:

    snd_hda_intel 0000:00:1b.0: IRQ timing workaround is activated for card #0. Suggest a bigger bdl_pos_adj

    CPU: Intel i5-3317u
    RAM: 7853MiB
    GPU: Intel HD4000/NVIDIA 640M LE
    WM (tested): bspwm/i3
    Dist: Arch Linux (Reinstalled twice)
    Device: Dell 3421

    ReplyDelete
    Replies
    1. For GPU testing, I ran through using intel only, nvidia through bumblebee and with nvidia only configured through xorg.conf.

      For runtime testing with osu on wine (staging), notable environment changes were primusrun with bumblebee and rt priority (STAGING_RT_PRIORITY_SERVER=90 STAGING_RT_PRIORITY_BASE=90) in staging. Either randomly freezes, crashes, or locks up the system. Lock ups always happen after setting staging rt priority environments.

      I've had similar results with vanilla wine and wine-rt, but not thoroughly tested in this case.

      Delete
    2. Did you try muqss by itself?

      Delete
    3. Not yet. I will compile it now and see how it goes.

      Delete
    4. I have removed the ck and commented out the gcc optimization patch and added MuQSS: https://gitlab.com/tom81094/pkgbuild-edits/raw/master/linux-ck-MuQSS.

      Still locks up the system. I did notice that setting a higher buffer size in Cadence for jackd doesn't make it crash until much later or when auto-playing very complex maps. Turning it off and using Pulseaudio locks up on launch.

      A few things to note off the top of my head:
      - I get very measurable (200+) xruns for jackd (-S/hpet) on 128bit buffer size, 3 periods per buffer on linux-rt and linux stable. It doesn't happen on linux-ck.
      When playing osu! with these settings, it crashes or locks up pretty quickly during selection or beginning of songs. Raising it to 256 bit buffer size, 3 periods per buffer - delays this lock up significantly.

      - Regardless of whether I set wine-staging RT priorities, from htop specifically only osu!.exe gets RT priority.

      Delete
    5. --> osu!.exe doesn't crash on linux-rt and linux stable on any Pulseaudio/Cadence setting ... is what I forgot to clearly specify.

      Delete
    6. Ah if you're getting realtime priority then it's highly likely that it's related to rt capabilities and the CPU caps imposed on rt in mainline kernels that aren't there in muqss. Try sysrq-N when the machine locks up to see if it unlocks it. Make sure you have built support for it in your kernel config (under kernel hacking) and that it's enabled by setting the value or /proc/sys/kernel/sysrq to 1. Then when it hangs try the sysrq-n combination which converts real time priority tasks to sched normal.

      Delete
    7. I never knew how cool SysRq was until now.

      Unfortunately, most of the time osu! locks up the system entirely and none of the shortcuts work. When the time comes when it hangs and only hangs, iamready.

      Delete
  21. Thanks for 4.8-ck8.
    How is 4.9 going on? :)

    ReplyDelete
    Replies
    1. Just how impatient can someone be???

      Delete
    2. Sorry,
      obviously very impatient :/

      Delete