Re: Filtering

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Filtering

Duncan-42
DLSauers posted on Fri, 16 Sep 2016 14:18:38 +0000 as excerpted:

> In one particular group I haunt the alot of cruft gets crossposted in
> for non related topics...
>
> I heavily filter this group, but could probably gut down on adding
> filters daily and/or the existing ones if I could just get pan to filter
> out things.
>
> What I am after is something along...
>
> Lets say the group is x
>
> If the post has MORE than group X and contains *.politics.* etc... mark
> it -99999999999999999999999999999999999999999999999999999 or what ever..
>
> None of the options for scoring rules seems to allow this or work, the
> only way to filter this stuff is set up a lot of rules like
>
> contains Hillary contains Trump contains gay contains ......
>
> Just being able to if post is xposted to more than 1 group ie X mark it
> -9999 would nuke a lot of stuff....
>
> Or is pan not able to setup such advanced scoring filters via the GUI
> and/
> or otherwise????
>
> This group is rather problematic, and always has been.. It has the
> biggest fitlering/killfile and well the only filtering and killfil I've
> used on Usenet in 30+ years!
>
> Any hints on getting more advanced filtering done???

First the general stuff, since you didn't indicate whether you knew this
already yet or not, but might, if you're a list regular, as I've posted
it here many times over the years, tho you likely won't otherwise unless
you've used the other clients previously and looked at the scorefile
itself, comparing it with that of the other clients.

Pan's scorefile format is in general a less advanced implementation of
SLRN's scorefile format (without the fancy stuff such as includes...):

http://slrn.sourceforge.net/docs/score.txt

... but with the case insensitivity (but not the other changes) of xnews
(my link for that one is dead, but slrn is primary, so it's not worth
trying to google or otherwise resurrect the xnews one).

Here's the abridged version of the format description I keep as comments
in my own scorefile:

% [newsgroup.*] wildcard (not regex) format (~ negates).
% header lines regex. (~ negates).
% Score conditions, single : and, double :: or.
% Expires: immed. below score if present.
% Leading % indicates comment
% Leading whitespace and blank lines ignored.
% Regex and newsgroup matches case insensitive with
% keyword:, sensitive with keyword=.
% Newsgroup change delimits section,
% Score delimits "rule", multiple rules per section allowed.
% Comment after score becomes rule "name".

% Score levels: <=-9999 kill, -9998 to -1 low,
%               0, 1 - 4999 med, 5000 - 9998 high, >=9999 watch


** EXCEPT: Unfortunately the last time I investigated, pan's scoring had
a bug, and would **NOT** do logical AND -- the single : was treated as OR
(::) regardless.  Fortunately, most of my scoring (and I guess pretty
much anyone elses) is single-shot OR logic anyway, so that's not as big
of a deal as if OR logic were broken instead of AND, but it /does/ rather
kill a direct implementation of your AND test above... if the bug still
exists, which I suppose it does but haven't recently tested.

However, it's /somewhat/ possible to work around that limitation by
judicious use of additive scoring -- as an example, use two rules that
each set -5000, so they combine to -10000 and trigger the kill level.  
(Tho if you have other rules that add say 100 and a message triggers them
as well it'll end up at -9900 and not trigger kill, but that's a good
thing as it makes it far more flexible, just make the two -9998 each so
each one /almost/ kills, and any trivial +100s won't undo the kill of
both combined, if you want that, or make them both -4950 if you want a
trivial -100 to be necessary as well to kill, or...)


The other thing that should stick out as pretty important from the above
rules, once you understand a leading % indicates a comment, when looking
at the rules pan creates if you use its gui to create rules, is that:

** Most of the lines pan adds to the scorefile are simply extra
explanatory comments -- they don't actually affect the rules at all and
deleting many of them can help massively shrink your scorefile without
affecting actual scorefile logic at all.


Finally, if you've been using pan's GUI to create most of your scores and
haven't edited or have only lightly edited the scorefile itself, and you
do a LOT of scoring, you should be able to *greatly* optimize things with
some rather more active manual scorefile formatting and editing.  For
instance, a short excerpt from the alt.* spam-kill section of my own
scorefile:

Warning, adult themed example!

%#####################################################################
%#####################################################################
[alt.*]
Score:: =-9999 %Alt kill
        From: Seeking teens
        From: teens seeker
        From: ^LoLiTa <
        From: ^GOBLIN <
        From: sex coed
        From: NudeGirls
        From: voyeur only
        From: amateur
        From: SEXmag
        From: teens
        From: intermixed
        From: rectal

        Subject: adult movies
        Subject: dupped
        Subject: ^\([-0-9/]*\)
        Subject: Use critical pack from Microsoft Corporation
        Subject: R/-\\PE
        Subject: R/-\|PE
        Subject: Horny mom
        Subject: rectal exam
        Subject: body cavity
        Subject: mature women
        Subject: candid voyeur
       

Just imagine how many lines that would take if they were each
individually added as separate rules, complete with multiple comment
lines each, by pan's GUI.  Here, they're both easily human-read, and far
easier and more efficient for pan to parse.

The down side to this level of scorefile editing, of course, is that in
ordered to maintain it, you pretty much have to either add new entries
manually, or pretty regularly go in and reoptimize all the entries you've
added via the pan GUI since the last time you cleaned up.

The up side is of course that once you have it cleaned up, it's dead easy
to manually add an additional single-line entry.

Meanwhile, a few hints:

* Set a pan hotkey for the articles, edit article's watch/ignore score,
function.  From there you can hit the close and rescore button, to rescore
based on any manual edits you just made to the scorefile.  That's the
easiest way to get pan to reapply freshly manually edited scores I've
come up with.

* Use %#### or similar comment lines to visually separate sections, as I
did in the example above.

* Consider whether you want an expiring or permanent score.  Permanent
scores can be easily added to the nicely edited groups manually, while
it's tougher to group expiring scores since the expires line will differ,
so adding these via the pan gui works well enough.

* Consider adding a %### separator line or two at the bottom of your
permanent scores, so pan can append the expiring scores you add via the
GUI, and it's easier to go in and clean up later since you know where the
new ones start.  Talking about which...

* Pan doesn't clean up expired scores on its own.  You'll have to go thru
and weed them out once in awhile.  (After doing so a few times, you may
find yourself not adding so many expiring scores, choosing instead to
either add a permanent one or simply skip it, so you don't have to clean
up the expired score later.  But if you're like me you'll still add a
few, for people irritating enough to want to score down temporarily, but
who you think might still learn some maturity, in say a year or so, so
you don't want to make it permanent just yet.)

* For expiring scores, I've found it helpful to keep pan's "created by
Pan on <date>" comments, as that way I not only know when it expires, but
I know when it was created, and thus have some idea of how irritated I
was when I created the entry, based on how long I set it to last before
expiring.

*** Pan can score based on any header, not just the ones the GUI allows
you to score.  However, headers that aren't in the overviews as sent from
the server won't apply until the message is actually downloaded to cache,
making them much less efficient since you won't be able to see the effect
until the message is already downloaded and in cache.  That's a
limitation of the protocol (and overviews) that pan can't do anything
about, but sometimes, having to download a message before it can be
killed is still better than having to actually read it.

*** The above should let you manually add scores based on either the
newsgroups header (as opposed to the newsgroup you're actually in at the
time, the [*] section head specifier), or the xrefs header, both of which
will contain the list of cross-posted groups (the xrefs header only
listing the ones carried on that server, along with the message number
for the message in each of those groups, the newsgroups header listing
all the groups the message was posted to, regardless of whether that
server carries them or not).  However, I'm not sure whether these rules
will apply before or after download, due to the above mentioned overviews
issue.


Those last two hints should allow you to score based on crossposting to
N+ groups, provided you know enough about the crossposted group names in
advance to create a score for them.  Alternatively, scoring on xref and
counting the number of colons should allow you to score on a message
posted to N+ groups regardless of name, provided the server carries that
many of the groups and thus crossposts the message to them.   But again,
I'd not know for sure without actually testing it, whether such scores
could be applied before download, with only the overviews information
available, or if they could only be applied after download.  Either way,
it should be possible, but one will obviously be far more convenient than
the other.

And again, as I said above, tho I believe the AND logic bug will prevent
combining both an N newsgroups and a subject line filter into one,
requiring both, by using multiple scoring rules and adjusting the scores
applied by each, you should be able to approximate the same thing.

--
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/pan-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Filtering

Dave-58
On Tue, 27 Sep 2016 18:19:33 +0000, DLSauers wrote:

> Hopefully maybe some one will work on the scoring features to improve
> them for non regex speaking users.

Back in the days when I used Windows, the GUI filtering for Foret Agent
was OK but it was very much worthwhile learning at least the basics of
regex to get the full benefit, especially if one had more complex
requirements.

--
Climate Change may be raising the sea levels, but the gene pool
seems to be drying up.


_______________________________________________
Pan-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/pan-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Filtering

Jim Henderson-4
On Tue, 27 Sep 2016 19:29:01 +0000, Dave wrote:

> On Tue, 27 Sep 2016 18:19:33 +0000, DLSauers wrote:
>
>> Hopefully maybe some one will work on the scoring features to improve
>> them for non regex speaking users.
>
> Back in the days when I used Windows, the GUI filtering for Foret Agent
> was OK but it was very much worthwhile learning at least the basics of
> regex to get the full benefit, especially if one had more complex
> requirements.

FWIW, I would agree with this assessment.  The regex requirements for
this are simpler than full JavaRE (for example), because we're not doing
replacements, which is where a lot of regex complexity comes in - so
grouping and forward/backwards references aren't needed.

But a switch to go between simple wildcard settings and full regex would
be nice.   Similarly, tools that would enable/disable regex flags
(another area that can get confusing for novices) would be good - so case
sensitivity can be enabled/disabled using a checkbox rather than having
to remember to prepend (?i) to the expression.

But the essentials of regex are useful:

"." = a single character (or nothing)
"*" = repeat the previous character any number of times
"[A-Z]" and other ranges = ranges of characters.
"^" = start of line
"$" = end of line

And of course "\" to escape special characters used in the above. :)

Jim
--
 Jim Henderson
 Please keep on-topic replies on the list so everyone benefits


_______________________________________________
Pan-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/pan-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Filtering

Duncan-42
Jim Henderson posted on Tue, 27 Sep 2016 20:30:55 +0000 as excerpted:

> On Tue, 27 Sep 2016 19:29:01 +0000, Dave wrote:
>
>> On Tue, 27 Sep 2016 18:19:33 +0000, DLSauers wrote:
>>
>>> Hopefully maybe some one will work on the scoring features to improve
>>> them for non regex speaking users.
>>
>> Back in the days when I used Windows, the GUI filtering for Foret Agent
>> was OK but it was very much worthwhile learning at least the basics of
>> regex to get the full benefit, especially if one had more complex
>> requirements.
>
> FWIW, I would agree with this assessment.  The regex requirements for
> this are simpler than full JavaRE (for example), because we're not doing
> replacements, which is where a lot of regex complexity comes in - so
> grouping and forward/backwards references aren't needed.

But... some people's minds simply don't wrap around regex.  I get that.

FWIW my mind doesn't wrap around sports.  Luckily for me, not much I
really want to do involves sports, so I don't have to worry about it
/too/ much.

But it would have been /nice/ had I had just enough more "common sense"
in the area to realize, at one point many years ago, that footballs
aren't "Nerf" inside -- they're inflated.  As it happens I was /way/
short on sleep that day and that probably had something to do with it as
well, but...  I was working dorm reception that day at college, and
someone came in to drop off a football for someone, and I *stapled* the
guy's name to it! =:^(

So yeah, I definitely understand not "getting it" with regard to some
element of something that's just assumed to be common sense knowledge...
it _might_ be "common sense" within that domain, but that's the point,
not everybody has that sort of domain knowledge, or even cares to have it.

OTOH, it could be argued that there are simply certain things that you
can't do if you don't know how, that they /do/ require some level of
domain knowledge, and that it's simple fact, not /bad/, that it is so.  
That both allows and encourages people to specialize, and ideally, to
help each other.

Which is what I'm going to try to do here, devising a set of scores and
actions to allow DLS to do what he set out to do.

But I believe I'm going to punt for a few days and will have to come back
to it, for reasons I'll put in a different, somewhat OT, post, following
up on a thread from a few months ago...  IRL has me pretty busy ATM and
I'm afraid I can't properly focus on this ATM, but it's a /good/ busy.
=:^)

So DLS, please followup with a reminder in a couple weeks if I've not
gotten back to this by then (and if no one else has beat me to it), as I
still think it's possible, but I just can't think about it that hard ATM
as there's too much else going on IRL.

> But a switch to go between simple wildcard settings and full regex would
> be nice.   Similarly, tools that would enable/disable regex flags
> (another area that can get confusing for novices) would be good - so
> case sensitivity can be enabled/disabled using a checkbox rather than
> having to remember to prepend (?i) to the expression.


This bit doesn't require thinking so hard, tho, so...

* FWIW, pan scoring is case insensitive by default, and I'd consider that
a good thing.  No having to set that specifically. =:^)

* Keep in mind that at least some of the reason pan's scoring works the
way it does is because of the common slrn scorefile format Charles chose
to use for it.  Since that's a format common to multiple clients, it's
worth keeping, and at least here, I'd find it a sad day if pan were to
move away from that format, because there *is* value in keeping to a
common format like that.

* Which means if the tools are adjusted to make more advanced usage
simpler for those who don't know regex, it'll have to be the GUI front-
end only -- they'll have to do the translating to the common format,
including regex, as appropriate, and store the score in the same common
format they do now.

* Which brings up another point -- those regex helping tools don't
necessarily have to be part of pan itself.  And indeed, qt, for instance,
and I'd imagine gtk as well, I'm just not familiar enough with it to be
sure, has regex editing tools.  In qt4/kde4 era, I believe they were
actually in kde4, but in the current qt5/kde5/plasma5/frameworks5 era, it
took some time to bring them back, but I believe they're available now
and are no longer quite so kde dependent, they're more qt-generic now,
altho they may still require a kde framework or two.

* More generically, I haven't googled it, but I'd be /very/ surprised if
there weren't regex translation websites out there, that allowed you to
plug in conditions via some GUI, and spit out the regex as output.


--
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


_______________________________________________
Pan-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/pan-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Filtering

Jim Henderson-4
On Wed, 28 Sep 2016 03:18:38 +0000, Duncan wrote:

> But... some people's minds simply don't wrap around regex.  I get that.

Oh, yeah, I get that too, which is why I suggested a "simple wildcard"
method as well.  But for those who do speak regex, having that grammar
available is incredibly useful.

Jim

--
 Jim Henderson
 Please keep on-topic replies on the list so everyone benefits


_______________________________________________
Pan-users mailing list
[hidden email]
https://lists.nongnu.org/mailman/listinfo/pan-users
Loading...