Return of the deadlock-at-exit problem

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Return of the deadlock-at-exit problem

Julian Seward-2

Just before 2.4.0 went out, there was a long discussion about how
thread exiting should work, and possible deadlocking that could
result.  In the end we settled on an inherently deadlockful scheme
(master thread waits for everybody else) but modified the getppid
wrapper so as to sidestep the deadlocks.

Unfortunately the problem is back, in a hard-to-reproduce way.  It
afflicts both 2.4.0 and the 3 line.  Reproducing it requires
a machine with a Quadrics Elan3 network card and the relevant
user-space driver and (presumably) kernel module.

When a program using this driver starts up, it creates a child
thread using clone.  No problem.  The child hangs around and
basically doesn't do anything much (purpose is unclear, but that
doesn't matter).  It calls a custom ioctl which communicates with
the Elan3 kernel module.  The ioctl doesn't return until (I assume)
the parent thread tells the kernel module that it is done with the
card.  The ioctl returns and the child exits.

Hence the child waits for the parent to exit, then exits itself.

Running on V, the result is a deadlock at exit since now we also
have that the parent, being the master thread, is waiting for the
child to exit.

Suggestions on how to fix this?

I've been playing with a hacked version of the head, which implements
the "last-one-out" strategy we discussed before.  I haven't got it
working reliably yet, though.

J


-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers
Reply | Threaded
Open this post in threaded view
|

Re: Return of the deadlock-at-exit problem

jeremy (Bugzilla)
Julian Seward wrote:

>When a program using this driver starts up, it creates a child
>thread using clone.  No problem.  The child hangs around and
>basically doesn't do anything much (purpose is unclear, but that
>doesn't matter).  It calls a custom ioctl which communicates with
>the Elan3 kernel module.  The ioctl doesn't return until (I assume)
>the parent thread tells the kernel module that it is done with the
>card.  The ioctl returns and the child exits.
>
>Hence the child waits for the parent to exit, then exits itself.
>  
>
How does the parent thread tell the kernel it is done with the module?
By closing all the file descriptors?  It must be some implicit
mechanism, because if it did an explicit ioctl() or something, we would
do the same.

    J



-------------------------------------------------------
SF.Net email is sponsored by: GoToMeeting - the easiest way to collaborate
online with coworkers and clients while avoiding the high cost of travel and
communications. There is no equipment to buy and you can meet as often as
you want. Try it free.http://ads.osdn.com/?ad_id=7402&alloc_id=16135&op=click
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers
Reply | Threaded
Open this post in threaded view
|

Re: Return of the deadlock-at-exit problem

Ashley Pittman
On Wed, 2005-05-25 at 11:36 -0700, Jeremy Fitzhardinge wrote:

> Julian Seward wrote:
>
> >When a program using this driver starts up, it creates a child
> >thread using clone.  No problem.  The child hangs around and
> >basically doesn't do anything much (purpose is unclear, but that
> >doesn't matter).  It calls a custom ioctl which communicates with
> >the Elan3 kernel module.  The ioctl doesn't return until (I assume)
> >the parent thread tells the kernel module that it is done with the
> >card.  The ioctl returns and the child exits.
> >
> >Hence the child waits for the parent to exit, then exits itself.
> >  
> >
> How does the parent thread tell the kernel it is done with the module?
> By closing all the file descriptors?  It must be some implicit
> mechanism, because if it did an explicit ioctl() or something, we would
> do the same.

It's somewhat complicated...

the parent thread calls elan3_detach (an ioctl) and the device driver
sets some state and wakes up the kernel thread sitting in the lwp ioctl.
This thread then returns done and the lwp exits.  Other than that the
lwp only returns to user-space to take signals.

That's the theory anyway, it's complicated by the fact that we have
kernel patches (not just modules) to provide "ptrack" functionality,
basically the job starts in a container and when the job finishes all
processes (and sys-v stuff) created in that container also get
destroyed.

If you are using rms/pdsh/slurm to start jobs then you will be using the
ptrack code (it's done by the open source "rms" kernel module), if you
are just running your programs by hand then you won't have the ptrack
stuff.

It's purpose is to make syscalls on behalf of the nic, the c code on the
nic sets up a descriptor, generates and interrupt which the lower half
forwards onto the lwp kernel thread.  This thread then makes syscalls
back into the kernel from the top half as the appropriate user with
suitable permissions.

Ashley,


-------------------------------------------------------
SF.Net email is sponsored by: GoToMeeting - the easiest way to collaborate
online with coworkers and clients while avoiding the high cost of travel and
communications. There is no equipment to buy and you can meet as often as
you want. Try it free.http://ads.osdn.com/?ad_id=7402&alloc_id=16135&op=click
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers
Reply | Threaded
Open this post in threaded view
|

Re: Return of the deadlock-at-exit problem

jeremy (Bugzilla)
Ashley Pittman wrote:

>It's somewhat complicated...
>  
>
Er, yep.

>the parent thread calls elan3_detach (an ioctl) and the device driver
>sets some state and wakes up the kernel thread sitting in the lwp ioctl.
>This thread then returns done and the lwp exits.  Other than that the
>lwp only returns to user-space to take signals.
>  
>
So what makes it return done?  What triggers that event?

>That's the theory anyway, it's complicated by the fact that we have
>kernel patches (not just modules) to provide "ptrack" functionality,
>basically the job starts in a container and when the job finishes all
>processes (and sys-v stuff) created in that container also get
>destroyed.
>  
>
Is this some extra kernel state which Valgrind needs to understand to do
a correct emulation?  How are these containers created?  In this case,
would the program running under valgrind create a new container which is
expected to mop up all the threads when the main thread exits?  How is a
"job" defined?

>If you are using rms/pdsh/slurm to start jobs then you will be using the
>ptrack code (it's done by the open source "rms" kernel module), if you
>are just running your programs by hand then you won't have the ptrack
>stuff.
>
>It's purpose is to make syscalls on behalf of the nic, the c code on the
>nic sets up a descriptor, generates and interrupt which the lower half
>forwards onto the lwp kernel thread.  This thread then makes syscalls
>back into the kernel from the top half as the appropriate user with
>suitable permissions.
>  
>
Hm, I think I follow, but I don't see at what point it depends on the
initial thread terminating before one of the child threads (or when they
should all exit together).

In the 2.6 NPTL thread model, exit_group() terminates all threads in the
thread group atomically, so there's no waiting around for things to
terminate (or dependence on termination order).  Is this running in a
2.4 thread model, or a 2.6 one?  It sounds like the container machinery
has an atomic group termination property similar to exit_group().

    J

    J



-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers
Reply | Threaded
Open this post in threaded view
|

Re: Return of the deadlock-at-exit problem

Ashley Pittman
On Sat, 2005-05-28 at 13:17 -0700, Jeremy Fitzhardinge wrote:

> Ashley Pittman wrote:
>
> >It's somewhat complicated...
> >  
> >
> Er, yep.
>
> >the parent thread calls elan3_detach (an ioctl) and the device driver
> >sets some state and wakes up the kernel thread sitting in the lwp ioctl.
> >This thread then returns done and the lwp exits.  Other than that the
> >lwp only returns to user-space to take signals.
> >  
> >
> So what makes it return done?  What triggers that event?

Either the elan3_detach ioctl or the close of the fd at program exit
causes a bit to be set and the extra thread then wakes up, notices the
bit, returns to user-space where the thread exits.  The code in question
looks like this:

        if (--ctxt->LwpCount != 0) /* Still other LWPs running */
        {
            spin_unlock_irqrestore (&dev->IntrLock, flags);
            return;
        }

        kcondvar_wakeupall (&ctxt->LwpWait, &dev->IntrLock); /* Wakeup anyone waiting on LwpCount */

I'm not really a kernel programmer though, I can go over it again or
forward this onto someone with a better understanding of this if you
need better understanding of this.

I'd be surprised if many programs actually call elan3_detach() though,
there are no hooks from MPI_Finilize through so it probably never gets
called.

> >That's the theory anyway, it's complicated by the fact that we have
> >kernel patches (not just modules) to provide "ptrack" functionality,
> >basically the job starts in a container and when the job finishes all
> >processes (and sys-v stuff) created in that container also get
> >destroyed.
> >  
> >
> Is this some extra kernel state which Valgrind needs to understand to do
> a correct emulation?

Possibly but hopefully not.

> How are these containers created?  In this case,
> would the program running under valgrind create a new container which is
> expected to mop up all the threads when the main thread exits?  How is a
> "job" defined?

These containers are created by the rms kernel modules (the kernel
module is open-source, RMS the application is not.  There is open-source
software which uses the kernel module)  Typically to run a "job" over
say four cpus you would type "prun -n4 mping" which would start four
programs each of which would be expected to call elan_init().  Each
program in this job would have a "vp" or virtual process number from 0
to N-1 (In MPI terms this is called "rank").  Each of these four
processes is kept inside it's own container and the rms kernel module
keeps track of any child processes and/or sys-v objects made and ensures
that they are all torn down properly at program exit.

There is another way of running programs outside of this mechanism
though, it's kind of messy and we don't recommend using it for anything
other than fine-grained bug-hunting but it does work so I suspect the
above may be a red herring.

> In the 2.6 NPTL thread model, exit_group() terminates all threads in the
> thread group atomically, so there's no waiting around for things to
> terminate (or dependence on termination order).  Is this running in a
> 2.4 thread model, or a 2.6 one?  It sounds like the container machinery
> has an atomic group termination property similar to exit_group().

It does sound similar, it works across child programs though, not just
thread groups.  Probably not relevant to this bug however.

Going back to the original questions, the thread should be implicitly be
woken and then die when the parent thread terminates, hence the deadlock
if the parent thread isn't exiting.  How does V work WTR any other
blocking syscall being in progress at program exit?

Ashley,


-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers
Reply | Threaded
Open this post in threaded view
|

Re: Return of the deadlock-at-exit problem

jeremy (Bugzilla)
Ashley Pittman wrote:

>I'd be surprised if many programs actually call elan3_detach() though,
>there are no hooks from MPI_Finilize through so it probably never gets
>called.
>  
>
So it's probably the result of an explicit close()?

>>In the 2.6 NPTL thread model, exit_group() terminates all threads in the
>>thread group atomically, so there's no waiting around for things to
>>terminate (or dependence on termination order).  Is this running in a
>>2.4 thread model, or a 2.6 one?  It sounds like the container machinery
>>has an atomic group termination property similar to exit_group().
>>    
>>
>
>It does sound similar, it works across child programs though, not just
>thread groups.  Probably not relevant to this bug however.
>
>Going back to the original questions, the thread should be implicitly be
>woken and then die when the parent thread terminates, hence the deadlock
>if the parent thread isn't exiting.  How does V work WTR any other
>blocking syscall being in progress at program exit?
>
If a thread calls exit_group(), Valgrind hits any thread blocked in a
syscall with a signal to get it out of the kernel, and tells all threads
to terminate; once they're all dead the process exits. Normally this
happens more or less instantaneously, but if a thread refuses to come
out of the kernel for some reason it will hold things up.  That's the
2.6/NPTL thread model.

In the 2.4/LinuxThreads case, the threads library coordinates the
process termination by getting each thread to explicitly call exit().
There are some tricky edge cases depending on whether the manager thread
or the initial thread is the last to exit.  Again, Valgrind only exits
once all threads have terminated.

Now, your elan thread is created by a native clone() rather than via
pthread_create, right?  Are you creating the thread in the same thread
group as the rest of the program, or in a separate thread group?  If the
main program terminates with exit_group, but the elan thread is not in
the thread group, then Valgrind will not attempt to kill it, but will
still wait around for it to exit; if the elan thread is waiting for the
Valgrind thread to exit, then we're in a deadlock.  I guess that's
what's happening.  There are two fairly easy solutions:

   1. change the elan driver to create the thread in the same thread
      group as the rest of the process, so exit_group() does the
      expected thing, or
   2. hack exit_group() so it just kills all threads in the process
      rather than just the thread group

Option 2 might be preferred.  It isn't strictly correct, but using
multiple thread groups within a process is pretty rare, except in the
degenerate case where every thread is in its own group (as you get with
LinuxThreads).  You could make it a --weird-hack
(exit-nukes-everything?) specifically for this case.

Or perhaps the alternative is to explicitly get the elan thread to
terminate as part of the programs cleanup/shutdown actions (ie, do the
appropriate call in MPI_Finalize).

    J




-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers
Reply | Threaded
Open this post in threaded view
|

Re: Return of the deadlock-at-exit problem

Ashley Pittman
On Wed, 2005-06-15 at 12:24 -0700, Jeremy Fitzhardinge wrote:
> Ashley Pittman wrote:
>
> >I'd be surprised if many programs actually call elan3_detach() though,
> >there are no hooks from MPI_Finilize through so it probably never gets
> >called.
> >  
> >
> So it's probably the result of an explicit close()?

Probably the result on an implicit close().  Very few programs call
detach or close so it will come from the fd being closed on program
teardown.

> >Going back to the original questions, the thread should be implicitly be
> >woken and then die when the parent thread terminates, hence the deadlock
> >if the parent thread isn't exiting.  How does V work WTR any other
> >blocking syscall being in progress at program exit?
> >
> If a thread calls exit_group(), Valgrind hits any thread blocked in a
> syscall with a signal to get it out of the kernel, and tells all threads
> to terminate; once they're all dead the process exits. Normally this
> happens more or less instantaneously, but if a thread refuses to come
> out of the kernel for some reason it will hold things up.  That's the
> 2.6/NPTL thread model.

This should work, any signal will cause it to return to userspace
briefly and if it's sigterm then the thread should exit whilst it's
there.

> In the 2.4/LinuxThreads case, the threads library coordinates the
> process termination by getting each thread to explicitly call exit().
> There are some tricky edge cases depending on whether the manager thread
> or the initial thread is the last to exit.  Again, Valgrind only exits
> once all threads have terminated.

This isn't going to happen until what I guess you are referring to as
the initial thread has exited, couldn't deadlock also happen this way?
How exactly does it do in this coordination?

> Now, your elan thread is created by a native clone() rather than via
> pthread_create, right?  Are you creating the thread in the same thread
> group as the rest of the program, or in a separate thread group?

I'm not familiar enough with low-level threads to tell, I assume it's in
the same thread group as we don't do anything special to request it's
own.  Here is the code in question:

      if ((res = __clone (elan3_lwp, stack + ELANLWP_STACK_SIZE,
                            CLONE_VM | CLONE_FS | CLONE_FILES |
CLONE_SIGHAND,
                            (void *) ctx)) == -1)

> If the
> main program terminates with exit_group, but the elan thread is not in
> the thread group, then Valgrind will not attempt to kill it, but will
> still wait around for it to exit; if the elan thread is waiting for the
> Valgrind thread to exit, then we're in a deadlock.  I guess that's
> what's happening.  There are two fairly easy solutions:
>
>    1. change the elan driver to create the thread in the same thread
>       group as the rest of the process, so exit_group() does the
>       expected thing, or
>    2. hack exit_group() so it just kills all threads in the process
>       rather than just the thread group
>
> Option 2 might be preferred.  It isn't strictly correct, but using
> multiple thread groups within a process is pretty rare, except in the
> degenerate case where every thread is in its own group (as you get with
> LinuxThreads).

I can certainly try option 1 if required but option 2 would be
preferred.  Generally people with sizable clusters regard stability
above all else and lead times to pushing new software releases on can be
extensive.

> You could make it a --weird-hack
> (exit-nukes-everything?) specifically for this case.

There is already a command line option to allow this thread to be
created so extending it to cover this shouldn't cause any problems.

I assume Valgrind itself runs in the same thread as (and shared fd's
with) the main thread of the host program?

> Or perhaps the alternative is to explicitly get the elan thread to
> terminate as part of the programs cleanup/shutdown actions (ie, do the
> appropriate call in MPI_Finalize).

This isn't a very good solution, it would rely on programs being
"well-behaved" for V to run correctly.  There is no equivalent for shmem
(the CRAY api) and then elan_fini() function is only used in a
scattering of cases.

Ashley,


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers
Reply | Threaded
Open this post in threaded view
|

Re: Return of the deadlock-at-exit problem

Julian Seward-2

Jeremy, Ashley,

I appreciate you both looking into this.  I'm unclear as to whether
you grokked that I changed the exit semantics in the 3 line a couple
of weeks back to use the "last-one-out-turn-out-the-lights" semantics.
As a result (following some further GDT-copying entertainment) the
Elan3 driver now runs fine on Valgrind, and nothing else appears to
be broken as a result.

J


On Thursday 16 June 2005 10:56, Ashley Pittman wrote:

> On Wed, 2005-06-15 at 12:24 -0700, Jeremy Fitzhardinge wrote:
> > Ashley Pittman wrote:
> > >I'd be surprised if many programs actually call elan3_detach() though,
> > >there are no hooks from MPI_Finilize through so it probably never gets
> > >called.
> >
> > So it's probably the result of an explicit close()?
>
> Probably the result on an implicit close().  Very few programs call
> detach or close so it will come from the fd being closed on program
> teardown.
>
> > >Going back to the original questions, the thread should be implicitly be
> > >woken and then die when the parent thread terminates, hence the deadlock
> > >if the parent thread isn't exiting.  How does V work WTR any other
> > >blocking syscall being in progress at program exit?
> >
> > If a thread calls exit_group(), Valgrind hits any thread blocked in a
> > syscall with a signal to get it out of the kernel, and tells all threads
> > to terminate; once they're all dead the process exits. Normally this
> > happens more or less instantaneously, but if a thread refuses to come
> > out of the kernel for some reason it will hold things up.  That's the
> > 2.6/NPTL thread model.
>
> This should work, any signal will cause it to return to userspace
> briefly and if it's sigterm then the thread should exit whilst it's
> there.
>
> > In the 2.4/LinuxThreads case, the threads library coordinates the
> > process termination by getting each thread to explicitly call exit().
> > There are some tricky edge cases depending on whether the manager thread
> > or the initial thread is the last to exit.  Again, Valgrind only exits
> > once all threads have terminated.
>
> This isn't going to happen until what I guess you are referring to as
> the initial thread has exited, couldn't deadlock also happen this way?
> How exactly does it do in this coordination?
>
> > Now, your elan thread is created by a native clone() rather than via
> > pthread_create, right?  Are you creating the thread in the same thread
> > group as the rest of the program, or in a separate thread group?
>
> I'm not familiar enough with low-level threads to tell, I assume it's in
> the same thread group as we don't do anything special to request it's
> own.  Here is the code in question:
>
>       if ((res = __clone (elan3_lwp, stack + ELANLWP_STACK_SIZE,
>                             CLONE_VM | CLONE_FS | CLONE_FILES |
> CLONE_SIGHAND,
>                             (void *) ctx)) == -1)
>
> > If the
> > main program terminates with exit_group, but the elan thread is not in
> > the thread group, then Valgrind will not attempt to kill it, but will
> > still wait around for it to exit; if the elan thread is waiting for the
> > Valgrind thread to exit, then we're in a deadlock.  I guess that's
> > what's happening.  There are two fairly easy solutions:
> >
> >    1. change the elan driver to create the thread in the same thread
> >       group as the rest of the process, so exit_group() does the
> >       expected thing, or
> >    2. hack exit_group() so it just kills all threads in the process
> >       rather than just the thread group
> >
> > Option 2 might be preferred.  It isn't strictly correct, but using
> > multiple thread groups within a process is pretty rare, except in the
> > degenerate case where every thread is in its own group (as you get with
> > LinuxThreads).
>
> I can certainly try option 1 if required but option 2 would be
> preferred.  Generally people with sizable clusters regard stability
> above all else and lead times to pushing new software releases on can be
> extensive.
>
> > You could make it a --weird-hack
> > (exit-nukes-everything?) specifically for this case.
>
> There is already a command line option to allow this thread to be
> created so extending it to cover this shouldn't cause any problems.
>
> I assume Valgrind itself runs in the same thread as (and shared fd's
> with) the main thread of the host program?
>
> > Or perhaps the alternative is to explicitly get the elan thread to
> > terminate as part of the programs cleanup/shutdown actions (ie, do the
> > appropriate call in MPI_Finalize).
>
> This isn't a very good solution, it would rely on programs being
> "well-behaved" for V to run correctly.  There is no equivalent for shmem
> (the CRAY api) and then elan_fini() function is only used in a
> scattering of cases.
>
> Ashley,
>
>
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> _______________________________________________
> Valgrind-developers mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/valgrind-developers


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers
Reply | Threaded
Open this post in threaded view
|

Re: Return of the deadlock-at-exit problem

Ashley Pittman
On Thu, 2005-06-16 at 12:04 +0100, Julian Seward wrote:
> Jeremy, Ashley,
>
> I appreciate you both looking into this.  I'm unclear as to whether
> you grokked that I changed the exit semantics in the 3 line a couple
> of weeks back to use the "last-one-out-turn-out-the-lights" semantics.
> As a result (following some further GDT-copying entertainment) the
> Elan3 driver now runs fine on Valgrind, and nothing else appears to
> be broken as a result.

No, I hadn't spotted this, I've been away for a couple of weeks and am
still catching up.  I'm glad this is all working now.

Is it true to say that elan3 over Valgrind is ready for prime-time now?

Ashley,


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers
Reply | Threaded
Open this post in threaded view
|

Re: Return of the deadlock-at-exit problem

jeremy (Bugzilla)
In reply to this post by Ashley Pittman
Ashley Pittman wrote:

>Probably the result on an implicit close().  Very few programs call
>detach or close so it will come from the fd being closed on program
>teardown.
>  
>
There won't be an implicit close until the process actually exits
though, so this isn't relevent to the "won't exit under Valgrind" problem.

>>In the 2.4/LinuxThreads case, the threads library coordinates the
>>process termination by getting each thread to explicitly call exit().
>>There are some tricky edge cases depending on whether the manager thread
>>or the initial thread is the last to exit.  Again, Valgrind only exits
>>once all threads have terminated.
>>    
>>
>
>This isn't going to happen until what I guess you are referring to as
>the initial thread has exited, couldn't deadlock also happen this way?
>How exactly does it do in this coordination?
>  
>
Well, the client's initial thread may exit, but Valgrind itself will
keep using that thread to wait for other threads to exit; if the elan
thread is waiting for the initial thread to exit, there's the deadlock.

>      if ((res = __clone (elan3_lwp, stack + ELANLWP_STACK_SIZE,
>                            CLONE_VM | CLONE_FS | CLONE_FILES |
>CLONE_SIGHAND,
>                            (void *) ctx)) == -1)
>  
>
Counter-intuitively, you need to request to be part of the thread group,
with CLONE_THREAD, so this call will create a new thread group.

    J



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers
Reply | Threaded
Open this post in threaded view
|

Re: Return of the deadlock-at-exit problem

jeremy (Bugzilla)
In reply to this post by Julian Seward-2
Julian Seward wrote:

>Jeremy, Ashley,
>
>I appreciate you both looking into this.  I'm unclear as to whether
>you grokked that I changed the exit semantics in the 3 line a couple
>of weeks back to use the "last-one-out-turn-out-the-lights" semantics.
>As a result (following some further GDT-copying entertainment) the
>Elan3 driver now runs fine on Valgrind, and nothing else appears to
>be broken as a result.
>

I still think that's a bad idea in general.  It probably won't make much
practical difference in normal NPTL programs, since all threads exit
simultaneously on exit_group, and if the initial thread is the one doing
the exit(), it will generally be the last one out.  But if you have a
program which exit()s from another thread, and you use leak checking or
some other time-consuming post-exit processing, there'll be a window in
which the program has appeared to exit but output hasn't been produced.

    J



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers
Reply | Threaded
Open this post in threaded view
|

Re: Return of the deadlock-at-exit problem

Julian Seward-2

> >of weeks back to use the "last-one-out-turn-out-the-lights" semantics.
>
> I still think that's a bad idea in general.

I know.  I do appreciate the design tradeoffs here and I think it's
a difficult call.  Perhaps last-one-out will indeed fare badly in
practice; we shall see -- if that's so, a Plan C will have to be
devised.

J


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers