Author Topic: Unexpected SMTP delay or failure with some domains  (Read 96429 times)

BGSWebDesign

  • Posts: 625
    • View Profile
    • http://www.bgswebdesign.com
Unexpected SMTP delay or failure with some domains
« on: April 13, 2006, 09:58:45 pm »
Hi DW,

PLEASE READ ALL OF THIS MESSAGE - 2 updates - something caused resume.php to stall...

Ok, I've upgraded to 1.86, ran into the same problem as others with the 'addopts' - but then restarted the machine to clear the cookie and all is ok - you should tell everyone to make sure they log OUT before uploading the install...

Now, a question on the resume.php script, I'm trying to install it in CRON, so I'm testing it from the command line in SSH, I keep getting strange results and do NOT no if it's running, maybe you can help?

I set debug=true at the top, and tried running with the sample command line you gave:
Code: [Select]
*/15 * * * * /usr/bin/wget -O /dev/null -T 0 http://example.com/mail/resume.php?pw=YourDailyMailPass  

When I do that I keep getting an error:
Code: [Select]
wget: timeout: Invalid time period `O'

So I took out the -T 0 from the command line giving me this:
Code: [Select]
*/15 * * * * /usr/bin/wget -O /dev/null  http://example.com/mail/resume.php?pw=YourDailyMailPass  

When I do that I get:
Code: [Select]

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

    [ <=>                                                            ] 42            --.--K/s

00:49:56 (410.16 KB/s) - `/dev/null' saved [42]


Here's another one:
Code: [Select]

TTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

    [<=>                                                             ] 0             --.--K/s             a    [ <=>                                                            ] 42            --.--K/s


Any idea what is going on here?  Is resume.php running properly or is there an error on line 42?   Is this in one of the includes - config.php or admin.php or is it in resume.php?

Let me know what you think?

UPDATE
======
It seems that it may be working correctly, this morning I had many followups go out and they did not stall this morning - so - is it working?

SECOND UPDATE
===========
No, it's not running, I just noticed I stalled at 1802 message left, it's been stalled like that for the last 1-2 hours, unless the system was re-set, the CRON is not running properly - no CRON is still running, I just checked.  

Here is what I found in lm_sendp, 2 entries, 1 is for DailyMail Report, the other is my batch which reads:
Code: [Select]
batid=cab744, qtype=1, formid=2006041401253, started=2006-04-14 01:25:43, lastact=2006-04-14 01:26:09, report=(empty), completed=1


I'm going to reset the completed to '0' to see if it will re-start (resume).  Nope, that didn't help - I wonder if it has choked on an Email address of someone that was deleted -  I know in the past that could be a problem?  Nope that's not it, I tracked down the userid (uid) for the first user in the lm_sendq table and all is fine, then I clicked the 'Resume' button and it started to resume the mailing - very strange - any ideas DW?

When I look in lm_sendq I see 1802 records, with this:
Code: [Select]
bat=e18c3a, battype=2, mtype=2, mid=85 the user ids are there for the records.  This one is stalled...  

THIRD UPDATE
==========
Now - here is where it gets really wierd!  I just came back and looked at my 'resumed' mailing that I got by Clicking on 'Resume' button, that one had stalled out too with an error - 'Server said goto config', anyway, usually that would mean a stalled queue again - but this time when I clicked on a link to go back to the main LMP menu I see that the mailing has completed - so something must have happened - I'm guessing that the 'resume.php' script must have completed the mailing?

What exactly is going on here DW?  It looks like the queue was loaded at 1:25:43, and then finished at 1:26:09, is that right?   What about the mailing then - if it stalls doesn't resume.php get it started again?  If so, why is completed set to '1'?   And what about lastact, does that mean it did NOT have to re-start it at all throughout the early morning?  If I sent out as many emails as I think I did, then that would not be true - it must have restarted it many times, then the big question becomes - why did it stall at 1802 left?   How did this record get flagged as 'completed' (1)?

Any ideas - DW?
Thanks,
-Brett
http://www.bgswebdesign.com/Contact-Us.php

*** I do custom List Mail Pro installations ***
Contact me through my website (above)

DW

  • Administrator
  • Posts: 3787
    • View Profile
    • https://legacy.listmailpro.com
Unexpected SMTP delay or failure with some domains
« Reply #1 on: April 14, 2006, 12:47:29 pm »
Brett,
Quote
wget: timeout: Invalid time period `O'

Seems to be a typo.  The time period should be set to 0 (Zero) not the letter O.  Failing to specify an unlimited time limit (-T 0) could result in unexpected behaviour but shouldn't cause any undesired email.
Quote
00:49:56 (410.16 KB/s) - `/dev/null' saved [42]

This looks like it could be the 'normal' result from the script running successfully.
Quote
I'm going to reset the completed to '0' to see if it will re-start (resume).

I cannot recommend doing this as, without some testing, I have no idea what behaviour it might cause.
Quote
Here is what I found in lm_sendp, 2 entries, 1 is for DailyMail Report, the other is my batch which reads:

The Dailymail entry should contain the report and queue data for the dailymail execution.  The other entry (from another mailing, NOT the dailymail resume?) should have everything the same except qtype will be different and the report will be empty.

I notice that you mention 2 different batch ids "cab744" and "e18c3a" - could you have upgraded while a mailing was in progress?
Quote
What exactly is going on here DW?  It looks like the queue was loaded at 1:25:43, and then finished at 1:26:09, is that right?

I wish I could tell you but I'm a little confused, too.  I just verified that the script does set "last active" at the end of mailing when it also sets completed to 1.  Completed should only be set to 1 when the mailing is finished without error or interruption by the qfinish() function in admin.php.

I wonder if the confusion is caused by the different batch ids?

I can provide a text file containing several thousand test email addresses you can use if you want to do some further testing.  Please let me know if you have any more updates/information.

Regards
Dean Wiebe
ListMailPRO Author & Developer - Help | Support | Hosting

BGSWebDesign

  • Posts: 625
    • View Profile
    • http://www.bgswebdesign.com
Unexpected SMTP delay or failure with some domains
« Reply #2 on: April 14, 2006, 03:04:43 pm »
Hi DW,

Quote
Seems to be a typo. The time period should be set to 0 (Zero) not the letter O.


I see that is a problem, I fixed that and the script runs properly, still that doesn't explain why it hung up and then was able to re-start later on?

Also, I don't think you need the -O /dev/null since I don't get any output if I include the 1> /dev/null 2> /dev/null on the end.

Also, I did not update in the middle of a mailing, this is from the mailing following the update, and also, I don't think the batch id's were different I could have made a copying error since I was hand typing in what I saw in the database.

Thanks, I'll keep you posted if I notice anything else...
Thanks,
-Brett
http://www.bgswebdesign.com/Contact-Us.php

*** I do custom List Mail Pro installations ***
Contact me through my website (above)

BGSWebDesign

  • Posts: 625
    • View Profile
    • http://www.bgswebdesign.com
Unexpected SMTP delay or failure with some domains
« Reply #3 on: April 17, 2006, 06:17:12 am »
Hi DW,

Ok, it's stalled again - I can't get it to move with over 30,000 messages in the queue!   Should I click 'resume'?

When I change 'debug' to 'true' I NEVER see any output here - even if I remove the /dev/null - how can I tell what is going on?

Update
=====
Here's some more information, I looked in lm_sendp, it does NOT have an active (stalled queue) listed in it, ONLY Dailymail reports from the last 4 days, and the last queue from 4/14/2006!   Can you tell me WHEN lm_sendp is wiped and when the new queue record is added when a new mailing starts?   I don't see any new record in here, AND I have a stalled queue.  - Oops check that, I DID find a record for the queue - read below - but there were still OLD records hanging around in here with completed=1 when are these removed?

2nd UPDATE
=========
OK - here's what I see in lm_sendp - there is ONE record that seems to be for this queue, it includes a report 'Dailymail report for 2006-04-17',
so I did not think this was the queue but just the report, it seems to be the queue since it has the same queue ID as the records that are in lm_sendq ('7afbbf'), ANd I notice that lastact is getting updated frequently - as if email is being sent out - BUT - the 'Mail Sent' counter is NOT updating in the header of LMP?  Here's what is in this lm_sendp record:
Code: [Select]

id=5, batid=7afbbf, qtype=2, formid=, started=2006-04-17 04:15:15, lastact=2006-04-17 09:29:27, report=Dailymail Report for 2006-04-17 04:15:15Totals: ..., completed=0


3RD UPDATE
========
Ok, this time I deleted the top record in lm_sendq - since usually when it chokes it's because the top record is bad (bad address/or bad deleted record) and NOW the Mail Sent is being updated, it's running again - BUT, here's something for you DW - it seems that you need to add this check to resume.php - look and see if the NUMBER of records in lm_sendq is the SAME as it was last time you ran - if it is, and it stays that way for 10 minutes or longer you need to DELETE the top record in lm_sendq and then re-set your counter again and see if the number of records change next time in (indicating that the mailing is going out properly).

It seems that a bad record in lm_sendq (previously deleted or bad email address) is stalling out the resume function (resume.php).

What do you think DW?  - More info below - but this was BEFORE I knew what seems to be going on - so can probably be disregarded...

-------------------------------------------------------------------------

Any ideas WHAT is going on here, it seems lastact is getting updated but I DO NOT see that any emails are going out - at least the number of records in lm_sendq are always the same - as if NO email is going out??? Is that the way it should be?

I also get this message:
Code: [Select]
This mailing appears to be in the process of sending normally. It has responded within 1 minute How does it respond, I don't see that the numbers in 'Mail Sent' are going down, so how is it responding - and there is no record in the lm_sendp table?

It seems there is a problem here of some kind?   Does the 'resume' automatically updated the records in the queue and delete them as they are sent out - does it updated the 'Messages Left' displayed in the header as I update my LMP page?   I don't see that it's doing anything - it's stalled, what can I do to help you figure this out???
Thanks,
-Brett
http://www.bgswebdesign.com/Contact-Us.php

*** I do custom List Mail Pro installations ***
Contact me through my website (above)

DW

  • Administrator
  • Posts: 3787
    • View Profile
    • https://legacy.listmailpro.com
Unexpected SMTP delay or failure with some domains
« Reply #4 on: April 17, 2006, 08:56:24 am »
Brett,

Quote
I notice that lastact is getting updated frequently - as if email is being sent out - BUT - the 'Mail Sent' counter is NOT updating in the header of LMP?

This does seem strange.  Normally when a queue is processing it will be sending and removing email from the queue (lm_sendq) and updating last active.
Quote
Ok, this time I deleted the top record in lm_sendq - since usually when it chokes it's because the top record is bad (bad address/or bad deleted record) and NOW the Mail Sent is being updated, it's running again

Yes, it's starting to make some sense.  If ListMail were to be caught in a loop on a single user it would mean no other rows are processed and deleted from the queue table.
Quote
Does the 'resume' automatically updated the records in the queue and delete them as they are sent out - does it updated the 'Messages Left' displayed in the header as I update my LMP page?

Each individual email is entered as a row in the lm_sendq table.  As the mailing is sent to the server entries are removed one by one from the lm_sendq table.  The counter in the ListMail header is based on the row count for each batid in the lm_sendq table.

If deleting a row in the sendq table allows the mailing to run you may want to note the userid so we can cross-reference the user data.  If you did this already did you notice anything strange?

I also recommend enabling the $smtp_debug var in admin.php so we can see what's happening with SMTP.

Regards
Dean Wiebe
ListMailPRO Author & Developer - Help | Support | Hosting

BGSWebDesign

  • Posts: 625
    • View Profile
    • http://www.bgswebdesign.com
Unexpected SMTP delay or failure with some domains
« Reply #5 on: April 17, 2006, 12:36:30 pm »
Hi DW,

Quote from: "DW"
Yes, it's starting to make some sense.  If ListMail were to be caught in a loop on a single user it would mean no other rows are processed and deleted from the queue table.


This must be happening, it just stalled again - I deleted the top record and it's running fine again - here's the email address domain for the user I deleted from lm_sendq:
Code: [Select]
cprk.com.my

I don't think that's a valid domain address - so that must be what caused it to stall out this time?  The question is that you need to write something into resume.php that can handle this - like I said, keep a count of the number of records in lm_sendq, if it stays the same 10 minutes later, or whatever the CRON is set at, if it remains the same - then there is a problem and you'll need to delete the top record in lm_sendq - BUT keep a log of what is going on so we can see how many of these users are bad - write it out as a text file or something.

Quote
I also recommend enabling the $smtp_debug var in admin.php so we can see what's happening with SMTP.


Do you still want me to do this?  It seems to me it's bad email addresses that are causing the problem, right, or do you still want to see what SMTP is saying?  Let me know if you do, next time it chokes I'll put it in and then watch what it says - is that what you want?
Thanks,
-Brett
http://www.bgswebdesign.com/Contact-Us.php

*** I do custom List Mail Pro installations ***
Contact me through my website (above)

mr.trevor

  • Posts: 125
    • View Profile
Unexpected SMTP delay or failure with some domains
« Reply #6 on: April 17, 2006, 01:57:54 pm »
Presumably you could manually add invalid addresses to your list to check this happening? Maybe this could be a bit dangerous with large mailings but if it was a small 'special' list and you were watching for it then it could be conclusive.
I, for one, appreciate all the extra work that is done by others to clear up these glitches.
Thank you for this work Brett. (and DW of course...)
TrevorW

DW

  • Administrator
  • Posts: 3787
    • View Profile
    • https://legacy.listmailpro.com
Unexpected SMTP delay or failure with some domains
« Reply #7 on: April 17, 2006, 03:12:15 pm »
Instead of covering up the issue I would really like to get to the bottom of it.  I will try some tests with invalid addresses, but I fear this problem may only be producable on a server running particular mail server software.

I think that yes, Brett, you should enable the SMTP log.  It's possible that an SMTP response is not being interpreted properly, causing a loop.

If the logs don't yield any definitive information is there any way you can set me up on a test list on your server?  You might set up a 2nd bare-bones installation for this.

Regards
Dean Wiebe
ListMailPRO Author & Developer - Help | Support | Hosting

BGSWebDesign

  • Posts: 625
    • View Profile
    • http://www.bgswebdesign.com
Unexpected SMTP delay or failure with some domains
« Reply #8 on: April 17, 2006, 07:01:16 pm »
Hi DW,

Quote
Instead of covering up the issue I would really like to get to the bottom of it. I will try some tests with invalid addresses, but I fear this problem may only be producable on a server running particular mail server software.

I think that yes, Brett, you should enable the SMTP log. It's possible that an SMTP response is not being interpreted properly, causing a loop.


Sure, I'll turn it on - but ONLY when it stalls again - let us know what you find with your own testing...

Regarding the problem with the server causing looping - if this will be a problem for others - why not just write it into resume.php to check for these type of bad addresses and delete them so the mailing can continue?   Apparently if I have the problem others will also, but let us know what you find...
Thanks,
-Brett
http://www.bgswebdesign.com/Contact-Us.php

*** I do custom List Mail Pro installations ***
Contact me through my website (above)

DW

  • Administrator
  • Posts: 3787
    • View Profile
    • https://legacy.listmailpro.com
Unexpected SMTP delay or failure with some domains
« Reply #9 on: April 17, 2006, 09:16:21 pm »
First I'll need to determine what, exactly, makes the email address 'bad'.  I believe it may be a server DNS/mailer issue dealing with non-existent domains.  The SMTP logs should give us an indication of what is going on.

Unfortunately I have never had this problem on my own servers or other servers I have had access to.  I don't think I'll be able to recreate this one without access to an installation on your particular server.  If you set up a 2nd ListMail install (just basic table setup and SMTP settings) on your site and provide access (plus FTP, if possible) I can promise you that I won't email any legitimate addresses that are not my own.  I might be able to figure something out with just a couple emails to fake / non-existent domains.

Regards
Dean Wiebe
ListMailPRO Author & Developer - Help | Support | Hosting

BGSWebDesign

  • Posts: 625
    • View Profile
    • http://www.bgswebdesign.com
Unexpected SMTP delay or failure with some domains
« Reply #10 on: April 18, 2006, 05:26:38 am »
Hi DW,

Ok, it happened again, the domain: goldcountry.bc.ca is also server not found, so I suspect that is the problem...(no, that's not it - it happens every time the email address has 2 periods in it - it did NOT do that in version 1.84 so the problem lies there).

You should be able to test this on your end, or are you saying that if you run email addresses at these bad servers your system runs fine?  Just use these two email addresses.

test1@cprk.com.my
test2@goldcountry.bc.ca

Also note how it seems to be addresses that containt two '.'s?   Not sure if that means anything...

2ND UPDATE
=========
It just happened again stalling with this domain: col.com.np, here's another test address for you:
test3@col.com.np

3RD UPDATE
========
Here's another one:
test4@net-rosas.com.br

Do you see a pattern here?  It always seems to be with addresses with 2 periods in them ('.').  Another thing to keep in mind it did NOT stall like this on addresses like this in version 1.84, so there's something going on here between that version and this one!

I'll be happy to load these two addresses to a test mail list and turn on SMTP debugging and try a mailing - but I've got to get my daily mailing out for today first - and that will take all day from what's left stalled in the queue!

I'll let you know tomorrow what I find, if you don't find something yourself...

Still you should think about providing an option in resume.php to DOUBLE CHECK the reccount() in lm_sendq - if it does NOT change within 15 minutes - obviously something is wrong and the queue is stalled - at that point delete the top record (write the address to a log first) and then go on your merry way - why not?   Or tell me how to include this - or I'll hack it myself - since apparently I'm going to need it on this server...
Thanks,
-Brett
http://www.bgswebdesign.com/Contact-Us.php

*** I do custom List Mail Pro installations ***
Contact me through my website (above)

DW

  • Administrator
  • Posts: 3787
    • View Profile
    • https://legacy.listmailpro.com
Unexpected SMTP delay or failure with some domains
« Reply #11 on: April 18, 2006, 11:12:49 am »
This is good information.  I do not believe this is as simple as "email addresses with two periods", however.  I did some tests and it seems that all domains can have quick or delayed DNS replies.  This is probably due to nameserver response time.

I'm remembering something from awhile ago... On some servers I noticed that SMTP communication can be delayed by these types of DNS issues.  In my experience it was not fatal, however - after about 10-20 seconds max the queue resumed at a normal speed until another one of these slow addresses was hit.  When sending, if you only get one delayed address in a block of 50 (LMP counter update interval) you might be able to notice that this is true.

Another possibility is that the delays are caused by the receiving mail server trying to determine if your domain exists (DNS lookup) before accepting the message.  This way your host's delays or the distance between servers might come into play...

The problem seems to be that your server waits for the DNS lookup before accepting the message for delivery.  My servers, running qmail and just tested with all of those addresses you provided, accept all messages instantly to the queue, and attempt all delivery in the background.

Shortly after delivery (<10s), I saw this in the logs indicating the destination server could in fact be reached.. Nothing yet for the other addresses, though:
Quote
Apr 18 11:08:02 serv2 qmail: 1145383682.898725 delivery 16100729: failure: 219.94.65.171_does_not_like_recipient./
Remote_host_said:_550_<test1@cprk.com.my>:_Recipient_address_rejected:_User_unknown_in_virtual_alias_table/Giving_up_on_219.94.65.171./

The rest returned something like this:
Quote
Apr 18 11:11:59 serv2 qmail: 1145383919.164225 delivery 16100735: deferral: Sorry,_I_couldn't_find_any_host_by_that_name._(#4.1.2)/

Either way, my server should send me a bounce so I can remove these users.

Your suggestion to scan the queue to see if it's changed might be a viable quick-fix for you, but I would rather not implement it hastily into ListMail as it could remove legitimate emails that have a chance of delivery.  If you give me the exact specs for your custom script (ie. scan queue table every 1 minute while mailings are active within 1 minute) I'll set it up for you then do more research on this.

I am not sure (yet) how to avoid this issue.  It could be as simple as a single config file change on your server.  Can you tell me what mail software your server uses?  This information might be available at the very top of any ListMail-created SMTP log in the 'greeting' for the connection.

Note: I have moved this post to development and changed it's subject.

Regards
Dean Wiebe
ListMailPRO Author & Developer - Help | Support | Hosting

BGSWebDesign

  • Posts: 625
    • View Profile
    • http://www.bgswebdesign.com
Unexpected SMTP delay or failure with some domains
« Reply #12 on: April 18, 2006, 12:50:56 pm »
Hi DW,

Quote
If you give me the exact specs for your custom script (ie. scan queue table every 1 minute while mailings are active within 1 minute) I'll set it up for you then do more research on this.


That would be great - let's say scan every 80 seconds while active and if the same reccount() then delete the top record...

Let me know when you have it ready, this is causing me to 'baby-sit' the mailing - exactly what I didn't want to have to do...
Thanks,
-Brett
http://www.bgswebdesign.com/Contact-Us.php

*** I do custom List Mail Pro installations ***
Contact me through my website (above)

mike2

  • Posts: 193
    • View Profile
Unexpected SMTP delay or failure with some domains
« Reply #13 on: April 18, 2006, 01:12:47 pm »
This does almost sounds like a broken SMTP program to me if indeed this is what is happening.

If you are running Sendmail in interactive mode, I could see this as a problem.  

My suggestion for this is run it in background mode.  What program are you using if you don't mind me asking?

BGSWebDesign

  • Posts: 625
    • View Profile
    • http://www.bgswebdesign.com
Unexpected SMTP delay or failure with some domains
« Reply #14 on: April 18, 2006, 01:21:55 pm »
Hi Mike,

No, I'm not running sendmail, I'm running LMP own SMTP
mailer, from LMP Configuration page:

Code: [Select]
SMTP Server (recommended):
settings - port: 25, reconnect every 249 emails.


Let me know if you have any other ideas - this thing seems to be stalling constantly - about every 30-45 minutes!  

Is the 'reconnect every 249 emails' too low?  I used to have it at 499, but lowered it when I started having problems.
Thanks,
-Brett
http://www.bgswebdesign.com/Contact-Us.php

*** I do custom List Mail Pro installations ***
Contact me through my website (above)