Segfault at exit

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Segfault at exit

Jim Henderson-4
As I prepare to take some time off work over the holidays, I thought I
might try to track down a minor issue that's been bugging me for a little
bit now - when I exit Pan, I get a segfault, and it appears to happen at
a time that stops the app from writing all of its "last read" information
out.

It's been happening for a while now, with various builds, using the data
I have right now.  I'm guessing that the issue isn't a code bug, but
rather something wrong in my data files.  I'd rather not recreate my
config from scratch, so I'm trying to track down the problematic config
file so I can fix it.

Call it a "foxhunt" if you like. Something of a puzzle to be solved. :)

I built a debug build and did a backtrace in gdb, and it points to pan.cc
line 1140.  That seems to tie to the process of freeing up secured
passwords in memory, so I thought it might be something in servers.xml,
but I don't see anything obvious in that file that's a problem (other
than perhaps a missing CR/LF at the end of the file, but I tried adding
that and the behavior didn't change).

Any ideas on where I should start?

Jim
--
 Jim Henderson
 Please keep on-topic replies on the list so everyone benefits


_______________________________________________
Pan-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/pan-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Segfault at exit

Duncan-42
Jim Henderson posted on Thu, 22 Dec 2016 23:23:09 +0000 as excerpted:

> As I prepare to take some time off work over the holidays, I thought I
> might try to track down a minor issue that's been bugging me for a
> little bit now - when I exit Pan, I get a segfault, and it appears to
> happen at a time that stops the app from writing all of its "last read"
> information out.

While I don't see your segfault (which does support your theory that it's
a problem with your config), pan does have a workaround for crashes
preventing read-message, etc, writeouts, that I've been using for years
now.  IIRC I reinforced the habit (which I /think/ I had even before
that, but that reinforced it) back when I was having problems not with
pan itself, but with xorg and kwin, back when the composite extension
(and kde/kwin's use thereof) was new and had a leak that would regularly
crash xorg... of course taking pan down along with it, without a chance
to write out its current state, of course.  Yes, it was /that/ long ago.
Anyway, by doing this, I kept the lost data to some level I considered
reasonably manageable.

The key here is to realize that pan writes out the per-newsgroup data
when it switches groups, so in busy groups with enough unread messages
that I didn't want to risk losing at least semi-current read-messages
tracking, I developed the habit of deliberately clicking to some other
group and back every N messages or so.  On busy mostly binary groups with
thousands of unread messages, N would be 500 messages or so, a couple
times per thousand messages; on more technical groups where I went
slower, N might be 50 or 100 messages.

That seemed to address the problem rather nicely, even in the above case
where it wasn't pan, but X, taking pan with it, that was crashing.  I
keep up with the habit today, tho I'm probably less religious about it
than I was when X was regularly crashing, so one might say pan has
trained me well. =:^)

Of course if you find your specific problem and develop a patch to
prevent pan crashing, nobody's going to complain, but this should make
dealing with the problem a bit easier than it was.  Just switch
newsgroups immediately before quitting pan, and/or do the periodic switch-
out-and-back that I've learned to do, if pan (or X) is crashing somewhat
unpredictably on you.

As for the last few messages since the last group-switch, at least in
groups where you download to read on-demand and where cache size isn't a
factor (thus more commonly in text groups), it's typically easy to see
where you were even if pan crashed and lost the read-status, because
those messages will already be downloaded, while the others aren't.

> It's been happening for a while now, with various builds, using the data
> I have right now.  I'm guessing that the issue isn't a code bug, but
> rather something wrong in my data files.  I'd rather not recreate my
> config from scratch, so I'm trying to track down the problematic config
> file so I can fix it.
>
> Call it a "foxhunt" if you like. Something of a puzzle to be solved. :)
>
> I built a debug build and did a backtrace in gdb, and it points to
> pan.cc line 1140.  That seems to tie to the process of freeing up
> secured passwords in memory, so I thought it might be something in
> servers.xml, but I don't see anything obvious in that file that's a
> problem (other than perhaps a missing CR/LF at the end of the file, but
> I tried adding that and the behavior didn't change).
>
> Any ideas on where I should start?

First thing I'd do is isolate whether the problem only triggers when you
connect to the server, not if you're local-only.  Either set pan as
offline (if that setting sticks across pan restarts, I'm not sure whether
it does), or if necessary, toggle the get new headers on startup and when
entering group options (in preferences, behavior, groups) to OFF, so you
can start pan and switch groups without pan fetching headers, then
restart pan and browse some already locally cached headers and messages
without doing anything that actually triggers pan to connect to the
server, and see if the problem still occurs.

That alone should help confirm whether it's password and/or server
related.

Then, if it's still happening when pan isn't network connecting, it makes
this next step easier.

Use the old bisect method on pan's data dir, first ensuring that the
problem disappears with a clean config, then try bisecting the problem
down to a single file, testing a theoretical half of the problem space at
a time.  Of course this is dramatically easier if the test set of files
remain static, thus the reason to test without network activity and
downloads going on, if possible.

I'd probably test right away with a clean article cache, to see if it's
that, and particularly if the problem only happens with a server
connection, I'd test right away with a clean ssl_certs dir and let pan
redownload the certs, of course comparing them to the previous certs
manually.

You've long since updated pan and the certs from back when pan was
writing corrupt binary files instead of the ascii-based cert files it
should have been writing and writes in current versions, right?

The groups dir and newsrc files are also suspect since it's writing them
out that's failing.

And IIRC I had a problem with a corrupt tasks.nzb at one point, tho that
should be regularly updated, so I wouldn't expect it to be the problem in
this case as it has been an ongoing problem for you for some time, and
that was a more urgent "pan won't work at all" problem for me, when it
got corrupted.

--
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/pan-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Segfault at exit

Jim Henderson-4
On Sat, 24 Dec 2016 07:49:09 +0000, Duncan wrote:

> While I don't see your segfault (which does support your theory that
> it's a problem with your config), pan does have a workaround for crashes
> preventing read-message, etc, writeouts, that I've been using for years
> now.  IIRC I reinforced the habit (which I /think/ I had even before
> that, but that reinforced it) back when I was having problems not with
> pan itself, but with xorg and kwin, back when the composite extension
> (and kde/kwin's use thereof) was new and had a leak that would regularly
> crash xorg... of course taking pan down along with it, without a chance
> to write out its current state, of course.  Yes, it was /that/ long ago.
> Anyway, by doing this, I kept the lost data to some level I considered
> reasonably manageable.
>
> The key here is to realize that pan writes out the per-newsgroup data
> when it switches groups, so in busy groups with enough unread messages
> that I didn't want to risk losing at least semi-current read-messages
> tracking, I developed the habit of deliberately clicking to some other
> group and back every N messages or so.  On busy mostly binary groups
> with thousands of unread messages, N would be 500 messages or so, a
> couple times per thousand messages; on more technical groups where I
> went slower, N might be 50 or 100 messages.
>
> That seemed to address the problem rather nicely, even in the above case
> where it wasn't pan, but X, taking pan with it, that was crashing.  I
> keep up with the habit today, tho I'm probably less religious about it
> than I was when X was regularly crashing, so one might say pan has
> trained me well. =:^)

Yeah, I'd observed the writing out using strace as I used it.  The last
thing it does before the segfault, according to strace, is ... well,
here's the strace output:

--- snip ---

eventfd2(0, O_NONBLOCK|O_CLOEXEC)       = 28
write(28, "\1\0\0\0\0\0\0\0", 8)        = 8
write(8, "\1\0\0\0\0\0\0\0", 8)         = 8
futex(0x17f2bd0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x13fc8d0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x17e90b8, FUTEX_WAKE_PRIVATE, 1) = 1
poll([{fd=28, events=POLLIN}], 1, 25000) = 1 ([{fd=28, revents=POLLIN}])
poll([{fd=28, events=POLLIN}], 1, 25000) = 1 ([{fd=28, revents=POLLIN}])
read(28, "\1\0\0\0\0\0\0\0", 16)        = 8
poll([{fd=28, events=POLLIN}], 1, 25000) = 1 ([{fd=28, revents=POLLIN}])
read(28, "\1\0\0\0\0\0\0\0", 16)        = 8
write(28, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x23bdc00, FUTEX_WAKE_PRIVATE, 2147483647) = 0
close(28)                               = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} ---
+++ killed by SIGSEGV +++

--- snip ---

Before that, it was deleting cache files.  But as you can see here, the
eventfd2() file handle closes successfully.

The interesting thing is that even switching groups doesn't get the
message counters updated properly.  But as I think about what I do in my
setup, I wonder if fuse might be a factor here.

I may need to test that.

See, what I do with my pan installation is store the config files in an
encfs container.  I mount the containers (one for .pan2 and one for News)
prior to launching pan, and unmount them after pan exits.  I had toyed
around with adding a delay to the 'fusermount -u' commands, but that
didn't make a difference on the segfault (which makes sense, since pan
has to exit before the volumes are umounted).

(Mostly as a point of interest - I do this because I have secure access
to a couple NNTP servers that hold sensitive information, and I sync the
config between multiple systems using Dropbox - but I don't want the
passwords stored in a format that can be read by anyone who happens to
hack my Dropbox account for some reason.)

But maybe I need to sync before umounting the encfs containers, if that's
corrupting data in some way that's causing the segfault.

>> I built a debug build and did a backtrace in gdb, and it points to
>> pan.cc line 1140.  That seems to tie to the process of freeing up
>> secured passwords in memory, so I thought it might be something in
>> servers.xml, but I don't see anything obvious in that file that's a
>> problem (other than perhaps a missing CR/LF at the end of the file, but
>> I tried adding that and the behavior didn't change).
>>
>> Any ideas on where I should start?
>
> First thing I'd do is isolate whether the problem only triggers when you
> connect to the server, not if you're local-only.  Either set pan as
> offline (if that setting sticks across pan restarts, I'm not sure
> whether it does), or if necessary, toggle the get new headers on startup
> and when entering group options (in preferences, behavior, groups) to
> OFF, so you can start pan and switch groups without pan fetching
> headers, then restart pan and browse some already locally cached headers
> and messages without doing anything that actually triggers pan to
> connect to the server, and see if the problem still occurs.

That's a good idea - I hadn't thought about that as a way of isolating
online vs. offline behaviour.  I don't think that setting is persistent
across restarts, as it's not a preference, but I can check that easily
enough.  I do have it configured to clear cache (to reduce data storage
needs in Dropbox).

> That alone should help confirm whether it's password and/or server
> related.
>
> Then, if it's still happening when pan isn't network connecting, it
> makes this next step easier.
>
> Use the old bisect method on pan's data dir, first ensuring that the
> problem disappears with a clean config, then try bisecting the problem
> down to a single file, testing a theoretical half of the problem space
> at a time.  Of course this is dramatically easier if the test set of
> files remain static, thus the reason to test without network activity
> and downloads going on, if possible.
>
> I'd probably test right away with a clean article cache, to see if it's
> that, and particularly if the problem only happens with a server
> connection, I'd test right away with a clean ssl_certs dir and let pan
> redownload the certs, of course comparing them to the previous certs
> manually.
>
> You've long since updated pan and the certs from back when pan was
> writing corrupt binary files instead of the ascii-based cert files it
> should have been writing and writes in current versions, right?

None of the servers I use use SSL (which is silly, given the nature of
some of the data hosted on them), so that won't be an issue, but
bisecting the issue is easy enough - I can break out the multiple newsrc
files from other configs and see what happens.  My guess is that it's a
newsrc file that's having a problem (I've historically had problems with
message counters getting corrupted, though that hasn't happened in a
while now).

> The groups dir and newsrc files are also suspect since it's writing them
> out that's failing.
>
> And IIRC I had a problem with a corrupt tasks.nzb at one point, tho that
> should be regularly updated, so I wouldn't expect it to be the problem
> in this case as it has been an ongoing problem for you for some time,
> and that was a more urgent "pan won't work at all" problem for me, when
> it got corrupted.

Also good to know.  I don't tend to use nzb files, but if that's a
standard behaviour, then that could well be something to look at.

Thanks for the ideas!

Jim

--
 Jim Henderson
 Please keep on-topic replies on the list so everyone benefits


_______________________________________________
Pan-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/pan-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Segfault at exit

Duncan-42
Jim Henderson posted on Sat, 24 Dec 2016 17:56:17 +0000 as excerpted:

>> And IIRC I had a problem with a corrupt tasks.nzb at one point, tho
>> that should be regularly updated, so I wouldn't expect it to be the
>> problem in this case as it has been an ongoing problem for you for some
>> time, and that was a more urgent "pan won't work at all" problem for
>> me, when it got corrupted.
>
> Also good to know.  I don't tend to use nzb files, but if that's a
> standard behaviour, then that could well be something to look at.

Standard behavior indeed, as tasks.nzb is how pan stores downloads it
hasn't completed yet when it shuts down.

One thing Charles did a good job on in the pan C++ rewrite is choosing to
use established standard solutions whenever possible, even when it meant
some extra work as it did with the newsrc files because they are only
single-server and the Charles worked hard to make the pan rewrite
transparent multi-server (and did a good job at it, if I do say so!)

And since he was adding nzb support already, I guess he decided to reuse
that code to store unfinished tasks over a shutdown, as well.  Which is
genius in a way, as the nzb code gets far more routine use on a far
broader set of systems than it otherwise would, that way.

But it threw me for a loop when I had problems with it as well, because
to my knowledge I wasn't doing anything with nzbs, and at the time I had
no idea pan was actually using the file, so it was the /last/ thing I
expected to be the problem.  Of course once I found out it was and I
opened it and saw the corruption, I gained both a better understanding of
how pan works, and a new appreciation for Charles' wisdom and genius in
using an established standard in a way I certainly hadn't expected.

--
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/pan-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Segfault at exit

Petr Kovar-2
In reply to this post by Jim Henderson-4
Hi Jim,

On Thu, 22 Dec 2016 23:23:09 +0000 (UTC)
Jim Henderson <[hidden email]> wrote:

> As I prepare to take some time off work over the holidays, I thought I
> might try to track down a minor issue that's been bugging me for a little
> bit now - when I exit Pan, I get a segfault, and it appears to happen at
> a time that stops the app from writing all of its "last read" information
> out.
>
> It's been happening for a while now, with various builds, using the data
> I have right now.  I'm guessing that the issue isn't a code bug, but
> rather something wrong in my data files.  I'd rather not recreate my
> config from scratch, so I'm trying to track down the problematic config
> file so I can fix it.
>
> Call it a "foxhunt" if you like. Something of a puzzle to be solved. :)
>
> I built a debug build and did a backtrace in gdb, and it points to pan.cc
> line 1140.  That seems to tie to the process of freeing up secured
> passwords in memory, so I thought it might be something in servers.xml,
> but I don't see anything obvious in that file that's a problem (other
> than perhaps a missing CR/LF at the end of the file, but I tried adding
> that and the behavior didn't change).

Yep, I can confirm that issue. It's only present when you compile in support
for gnome-keyring (HAVE_GKR). Patches welcome. :)

Cheers,
pk

_______________________________________________
Pan-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/pan-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Segfault at exit

Duncan-42
Petr Kovar posted on Fri, 30 Dec 2016 01:04:31 +0100 as excerpted:

> Hi Jim,
>
> On Thu, 22 Dec 2016 23:23:09 +0000 (UTC)
> Jim Henderson wrote:
>
>> I thought I might try to track down a minor issue that's been bugging
>> me for a little bit now - when I exit Pan, I get a segfault, and it
>> appears to happen at a time that stops the app from writing all of its
>> "last read" information out.

>> I built a debug build and did a backtrace in gdb, and it points to
>> pan.cc line 1140.  That seems to tie to the process of freeing up
>> secured passwords in memory [... .]
>
> Yep, I can confirm that issue. It's only present when you compile in
> support for gnome-keyring (HAVE_GKR).

That would explain why I've not seen it.  I'm a kde guy, and gentoo
exposes that option as a USE flag which I of course have off, so I never
saw the bug as I was building both the feature and the bug out...

--
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/pan-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Segfault at exit

Jim Henderson-4
In reply to this post by Duncan-42
On Sun, 25 Dec 2016 14:33:21 +0000, Duncan wrote:

> Jim Henderson posted on Sat, 24 Dec 2016 17:56:17 +0000 as excerpted:
>
>>> And IIRC I had a problem with a corrupt tasks.nzb at one point, tho
>>> that should be regularly updated, so I wouldn't expect it to be the
>>> problem in this case as it has been an ongoing problem for you for
>>> some time, and that was a more urgent "pan won't work at all" problem
>>> for me, when it got corrupted.
>>
>> Also good to know.  I don't tend to use nzb files, but if that's a
>> standard behaviour, then that could well be something to look at.
>
> Standard behavior indeed, as tasks.nzb is how pan stores downloads it
> hasn't completed yet when it shuts down.

Good to know. :)

> One thing Charles did a good job on in the pan C++ rewrite is choosing
> to use established standard solutions whenever possible, even when it
> meant some extra work as it did with the newsrc files because they are
> only single-server and the Charles worked hard to make the pan rewrite
> transparent multi-server (and did a good job at it, if I do say so!)

That is a pretty good decision. :)

> And since he was adding nzb support already, I guess he decided to reuse
> that code to store unfinished tasks over a shutdown, as well.  Which is
> genius in a way, as the nzb code gets far more routine use on a far
> broader set of systems than it otherwise would, that way.
>
> But it threw me for a loop when I had problems with it as well, because
> to my knowledge I wasn't doing anything with nzbs, and at the time I had
> no idea pan was actually using the file, so it was the /last/ thing I
> expected to be the problem.  Of course once I found out it was and I
> opened it and saw the corruption, I gained both a better understanding
> of how pan works, and a new appreciation for Charles' wisdom and genius
> in using an established standard in a way I certainly hadn't expected.

He certainly did an outstanding job getting this program up and running -
and converted to OOP. :)

Jim

--
 Jim Henderson
 Please keep on-topic replies on the list so everyone benefits


_______________________________________________
Pan-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/pan-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Segfault at exit

Jim Henderson-4
In reply to this post by Petr Kovar-2
On Fri, 30 Dec 2016 01:04:31 +0100, Petr Kovar wrote:

> Hi Jim,
>
> On Thu, 22 Dec 2016 23:23:09 +0000 (UTC)
> Jim Henderson
> <hendersj-Re5JQEeQqe8AvxtiuMwx3w-
[hidden email]>

> wrote:
>
>> As I prepare to take some time off work over the holidays, I thought I
>> might try to track down a minor issue that's been bugging me for a
>> little bit now - when I exit Pan, I get a segfault, and it appears to
>> happen at a time that stops the app from writing all of its "last read"
>> information out.
>>
>> It's been happening for a while now, with various builds, using the
>> data I have right now.  I'm guessing that the issue isn't a code bug,
>> but rather something wrong in my data files.  I'd rather not recreate
>> my config from scratch, so I'm trying to track down the problematic
>> config file so I can fix it.
>>
>> Call it a "foxhunt" if you like. Something of a puzzle to be solved. :)
>>
>> I built a debug build and did a backtrace in gdb, and it points to
>> pan.cc line 1140.  That seems to tie to the process of freeing up
>> secured passwords in memory, so I thought it might be something in
>> servers.xml, but I don't see anything obvious in that file that's a
>> problem (other than perhaps a missing CR/LF at the end of the file, but
>> I tried adding that and the behavior didn't change).
>
> Yep, I can confirm that issue. It's only present when you compile in
> support for gnome-keyring (HAVE_GKR). Patches welcome. :)

Good to know - since I don't tend to use gnome-keyring with Pan, I'll
just work around it by disabling the option. :)

Jim

--
 Jim Henderson
 Please keep on-topic replies on the list so everyone benefits


_______________________________________________
Pan-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/pan-users
Loading...