NONMEM Users Network Archive

Hosted by Cognigen


From: Bob Leary <bleary>
Date: Mon, 4 Dec 2006 14:06:49 -0500

Mark -
its got nothing to do with the compiler, but rather the non-uniform
memory access (NUMA) hardware characteristics of a cache-based machine
architecture (virtually all common machines, including PCs, use such an =
architecture. )
NUMA means that the computational cost of a memory reference depends on =
where it is in
memory or one of the caches, particularly in relation to where previous =
have been. It is very much faster, for example, to fetch a value from =
the L1 cache than an uncached value in main memory. In fact, it is =
impossible to get a value from main memory without
first bringing it into the L1 cache, which is a fast storage area which =
can feed the
arithmetic units.

The fastest way to reference memory is consecutively, as in a 1-d array
x(1), x(2), x(3), ..... Thus for example, in a column major language =
like Fortran,
it is usually best to operate on the columns of a matrix, since =
successive elements
are consecutive in memory, rather than the rows, where successive row =
elements are
a fixed, non unit distance apart. Going along succeive row elements is =
always slower than a column,
(but often cannot be avoided)
but distances (strides) that are powers of 2 or multiples of powers of 2
are particularly bad. Even for 'good' strides, the effects can be very =
large -
for example, the nested
set of loops to add two arrays

do i=1,n
do j=1,n
a(i,j) = b(i,j) + c(i,j)

is much slower than the same loop with the indices reversed, regardess =
how a, b, and c were originally dimensioned.

An initial memory reference to a 1D array x(1)
needs to fetch a value from main memory by first routing it into
the L1 cache. This is slow. But this causes
a certain number of successive elements x(2), x(3), .., x(N) (determined =
by the
length of the 'cache line')
to be placed in the L1 cache, from which it is much faster to get =
successive references.
There may be a hierarchy of successive caches (I think PCs have 2 or 3), =
which get bigger
but slower, but still faster tham going out to main memory. Cache and =
cache line sizes are
usually powers of 2, and there are lots of very strong effects on =
performance due
to how the caches and main memory are mapped to each other, and the =
patterns of memory reference.

 Due to the way that caches and cache lines work
(see =
oper/books/OrOn2_PfTune/sgi_html/ch06.html for a discussion)
 striding though memory at a moderately large (e.g. 1024)
power of 2 words between memory references (or multiple of a
power of 2) can result in cache thrashing -
(somewhat analogous to page faulting in disk i/o) where lots of time is =
updating caches with values that are in fact never used.
Even smaller powers of 2 like 16 and 32 (note MAXID=10000 is a =
multiple of 16)
can cause signicant performance degradation.

Suppose we have an array that is dimensioned x(1024,1024)
Then if we access array elements in a loop along the row index,
say x(1,1), x(1,2), x(1,3) ,,, x(1,1024)

then each successively accessed element occurs 1024 words
after the previous word, leading to thrashing. It is much,much better
to dimension x as x(1025, 1024) - under some cirumstances the =
performance difference
can be an two orders of magnitude or more.

Cache thrashing can also occur even with 1-d arrays -
suppose x(1024) and y(1024) are consecutive arrays in memory and
we are executing a loop like z(i) = x(i) + y(i) - we get a similar =
(it turns out this is not as bad as the 2-d array case)

Quoting from the document at the above Silicon Graphics web site:

"There are two ways to repair cache thrashing:

Redimension the vectors so that their size is not a power of two. A new =
size that spaces the vectors out in memory so that a(1), b(1), c(1) and =
d(1) all map to different locations in the cache is ideal. For example, =
max = 1024 x 1024 + 32 would offset the beginning of each vector 32 =
elements, or 128 bytes. This is the size of an L2 cache line, so each =
vector begins at a different cache address. All four values may now =
reside in the cache simultaneously, and complete cache line reuse is =

For two-dimensional arrays, it is sufficient to make the leading =
dimension an odd number, as in the following:

    dimension a(1024+1,1024)

For multidimensional arrays, it is necessary to change two or more =
dimensions, as in the following:

    dimension a(64+1,64+1,64)

Eliminating cache thrashing makes the loop at least 100 times faster. =

-----Original Message-----
From: Mark Sale [mailto:msale2
Sent: Friday, December 01, 2006 14:45 PM
To: Bob Leary; Nick Holford
Subject: RE: [NMusers] MAXIDS=10000

Interesting, thanks Bob, did not know that. Most the
arrays are, I think single dimension, and the others
are most X(N,N) [do those count as leading
dimension?). I know that Fortran stores arrays
differenty than C, but I hadn't heard of this bug
(sounds like a bug - is it?). And true for all ANSI
77 Fortran compilers? Something else to benchmark.

--- Bob Leary <bleary

> Mark and Nick,
> If any of these buffer sizes are used internally as
> the leading dimension
> of a Fortran array with 2 or more indices, then to
> avoid thrashing the various caches
> it is usually a good idea to make the size an odd
> number (the point is to avoid
> numbers which have a factor of 2 or even worse 4 or
> even worse 8, etc.
> An unfortunate memory access pattern in such an
> array can severely inpact performance.
> Bob Leary
> -----Original Message-----
> From: owner-nmusers
> [mailto:owner-nmusers
> Mark Sale - Next Level
> Solutions
> Sent: Friday, December 01, 2006 7:31 AM
> To: Nick Holford
> Cc: nmusers
> Subject: RE: [NMusers] MAXIDS=10000
> Nick,
> You make a good point about LIM1, and the others.
> Interestingly, I
> tried this, maybe 10 years ago, thinking I might be
> able to improve
> performance (reduced I/O). The good news is that
> the memory footprint
> of NONMEM is, as you point out, really small, so
> increasing buffer sizes
> really doesn't hurt, they will all fit easily in
> memory. The bad news
> is that it didn't help at all. I was told (and it
> makes sense to me),
> that modern operating systems are very, very good at
> figuring out what
> to buffer from disc. So, even if you tell NONMEM
> only to keep a small
> part of the data, the OS keeps pretty much all of it
> in memory, when
> you have 1 Gb of memory and only 800k of data, not a
> challenge. If you
> look at the task manager, you'll almost always see
> NONMEM at 100%,
> meaning it really isn't waiting for any disc I/O.
> But, your point is good, the days of when memory was
> an issue for NONMEM
> are long past, but my experience is that is doesn't
> make much
> difference, at least in Windows (and I'd guess Linux
> is at least as
> good a memory management). It may be time to try it
> again, did you
> benchmark your version with the large LIM1?
> Mark Sale MD
> Next Level Solutions, LLC
> > -------- Original Message --------
> > Subject: Re: [NMusers] MAXIDS=10000
> > From: Nick Holford <n.holford
> > Date: Fri, December 01, 2006 4:59 am
> > To: nmusers <nmusers
> >
> > Mark,
> >
> > Thanks for suggesting I look at LIM6.
> >
> > I've struggled again to comprehend Guide III
> Installation (the NMVI version is not changed from
> NMV).
> > These are the key words I think:
> >
> > "The size of buffer 1 is related to the number,
> LIM1, of data records stored in memory at
> > any one time. A large proportion of data sets will
> consist of no more than 400 data
> > records. Consequently, the size of buffer 1 has
> been set to allow LIM1=400 data records.
> > The least number of data records allowable must
> exceed the largest number of data
> > records used with any one individual, which rarely
> will be as large as 400."
> >
> > The size of buffer 2 has been set to allow
> LIM2=400 residual records.
> > The least number of residual records allowable
> must exceed the largest number of data
> > records used with any one individual, which rarely
> will be as large as 400.
> >
> > The size of buffer 6 has been set to allow
> LIM6=200 PREDdefined
> > records. The least number of PRED-defined records
> allowable must exceed the
> > largest number of data records used with any one
> individual, which rarely will be as large
> > as 200."
> >
> > It seems that the values for LIM1, LIM2 and LIM6
> should be no less than the maximum number of data
> records in any one individual. The way I intepret a
> data record it means any kind of record,
> observation, dose, other event, etc. (i.e. EVID 0 to
> 4). But for LIM1 it may be helpful to increase its
> value up to the maximum total number of records in
> the data set so that as many records as possible
> stay in memory (or at least in virtual memory).
> >
> > LIM2 and LIM6 need only be the size of the largest
> number of data records used with any one individual.
> But I dont understand why the "rarely will be as
> large" example for LIM2 is 400 and for LIM6 is only
> 200.
> >
> > I increased LIM1 to 2500000 (I have just under 2.5
> million recs) but Windows 2003 Server with 1 GB RAM
> wouldn't start the executable - I suspect because it
> wanted too much initial memory. However with
> LIM1=1000000 and LIM2 and LIM6 to 5000 (I have upto
> 3200 recs/subject) then NONMEM started. The NONMEM
> executable uses 90 MB of actual memory and 711 MB of
> virtual memory and there are no page faults reported
> by the Task Manager. I assume this means that most
> of the data is in actual memory.
> >
> > There doesn't seem to be any good reason to have
> LIM1, LIM2 and LIM6 set to the same small default
> values of 400. I think LIM2 and LIM6 should usually
> be the same and LIM1 some multiple of LIM2 (or LIM6)
> reflecting the number of individuals in the typical
> data set.
> >
> > C Altered on installation by NMQual (copyright
> 2006.12.01.2124
> > C PARAMETER (LIM1=400)
> > PARAMETER (LIM1=1000000)
> > C Altered on installation by NMQual (copyright
> 2006.12.01.2124
> > C PARAMETER (LIM2=400)
> > PARAMETER (LIM2=5000)
> > PARAMETER (LIM3=200)
> > PARAMETER (LIM5=200)
> > C Altered on installation by NMQual (copyright
> 2006.12.01.2124
> > C PARAMETER (LIM6=400)
> > PARAMETER (LIM6=5000)
> >
> > Mark Sale - Next Level Solutions wrote:
> > >
> > > Nick,
> > > There is a buffer 6 size parameter (LIM6) in
> NSIZES in v5 and in
> > > SIZES in v6. I don't recall what array(s) are
> dimensioned with this.
> > > But, looks like the default size in v5 is 600
> and default is 400 in v6.
> > > Another upward compatiability problem, perhaps.
> Do you have an
> > > individual with between 400 and 600
> observations?
> > >
> > > Mark Sale MD
> > > Next Level Solutions, LLC
> > >
> > >
> >
> > --
> > Nick Holford, Dept Pharmacology & Clinical
> Pharmacology
> > University of Auckland, 85 Park Rd, Private Bag
> 92019, Auckland, New Zealand
> > email:n.holford
> tel:+64(9)373-7599x86730 fax:373-7556
> >

Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
Received on Mon Dec 04 2006 - 14:06:49 EST

The NONMEM Users Network is maintained by ICON plc. Requests to subscribe to the network should be sent to:

Once subscribed, you may contribute to the discussion by emailing: