May 28, 2014
Participants: Andy Lutomirski, Davidlohr Bueso, Greg KH, Jan Kara, Josh Triplett, and Li Zefan.
People tagged: Fengguang Wu, Ingo Molnar, Jens Axboe, Jiri Kosina, Josef Bacik, Khalid Aziz, Mel Gorman, and Peter Zijlstra.
Chris Mason
is in the middle of upgrading a bunch of systems from 2.6.38 and 3.2
to 3.10 and higher, and previous experiences of this sort have uncovered
a bunch of performance degradations with 10%-30% slowdowns.
Chris expects to be done with the current effort in August, and would
like to report results to Kernel Summit.
Chris is also interested in controlling preemption from user applications
without having to move to the -rt kernel in order to reduce the high
context-switch rate, and plans to experiment with improved preemption
controls and userspace RCU.
Davidlohr Bueso
expressed interest in this topic and a willingness to present his
experience with various performance issues, as did
Jan Kara.
Greg KH
disagreed, arguing that this discussion had taken place at many Kernel Summits
over the years, but without any useful effect.
Chris Mason
countered that Intel had in fact listened and had a huge positive impact.
Chris also found it interesting that even with large performance improvements
in many areas, regressions were still the order of the day when upgrading
large workloads.
In some cases, small .config
changes took care of things,
while in other cases improvements in one area partially masks regressions
in other areas.
Chris also noted that while he could not make any promises, he hoped to
be able to tease out new benchmarks that could be run regularly.
Jan Kara
also argued that it is valuable to learn what in particular regressed
this time.
In addition, Jan said that SUSE was looking at doing more continuous
testing, so learning what others are testing would be useful.
Josh Triplett asked if the regression-triggering workloads could be automated as useful benchmarks, then added to automated patch checkers such as Fengguang Wu's 0day setup. Davidlohr Bueso liked the idea of adding automated tests to 0day, proposing some from perf-bench, but noted that some regressions are triggered by proprietary software, by unusual hardware, and by workloads that are difficult to convert to benchmarks.
Li Zefan
wondered if the regressions fixed by small .config
changes
were suppressing new features, which led him to ask if they were really
regressions.
Chris
agreed, except in the case where the default .config
choice
slowed things down.
Andy Lutomirski wondered how much of the context-switch overhead was due to the actual context switch (and any associated interrupts) and how much was due to caching effects. Andy recommended a simple benchmark for the former case.