Author Topic: List fails partway through, but cannot resume!  (Read 3761 times)

jason.holloway

  • Posts: 5
    • View Profile
List fails partway through, but cannot resume!
« on: June 28, 2007, 11:39:06 am »
Hi,

We've been sending out our first mailshot with LMP today, after spending a few days setting up a brand new server and LMP installation.

I'd made a series of tests on a small list (12 addresses) and it was all looking in a good shape, and this afternoon (only two days late!) started our main run of ~63k addresses. We're configured to use sendmail and reconnect every 1000.

Looked good from the outset, server load very reasonable, throughput OK, but failed in the low thousands. Restarted, and a few more 1000s before failing. Repeat several times. The errors reported at the failure were:
  • no message given
  • 250 2.1.5 ... Recipient ok
  • Called domail with no messages in queue, aborting.

Restarting, while annoying, is doable. However, LMP now thinks that the list has finished sending, and no longer gives an option to resume. "Sent Messages" gives 62564 as the number of emails sent. I can also see in the lm_sendp table:
Quote
started=2007-06-28 15:16:35
lastact=2007-06-28 17:23:33
completed=1

How can we restart this?

I've processed /var/log/maillog to grab a likely list of the 4452 addresses used so far, but trying to filter those out of the current list and re-do will cause havoc with users trying to unsubscribe. We considered creating a new list, but that will cause similar fragmentation problems which we'll have to re-merge - and what if the same failure occurs? We could end up with 10-20 lists to merge.

Any thoughts at this point on how to recover, how to determine the cause, and how to prevent recurrence, would be most welcome!

cheers,
James Beckett
(Consultant, on behalf of registered user)

DW

  • Administrator
  • Posts: 3787
    • View Profile
    • https://legacy.listmailpro.com
List fails partway through, but cannot resume!
« Reply #1 on: June 28, 2007, 04:21:48 pm »
Hi James,

I haven't seen these symptoms before... Are you sure your entire list was not sent?  The number of rows in the lm_sendq table are the number of messages remaining.  If you do not see the "queue status" header in ListMail with the "Resume" option there should be no rows in the lm_sendq table.  If some messages remain in lm_sendq for a batch marked completed in lm_sendp you could try setting completed to NULL, blank or 0 and then may be able to resume.

Some reasons for SMTP stalling can be found by searching the forum for the string "smtp_timeout".  One known issue is most common on Sendmail-based systems.  Try setting this in config.php to increase the default per-message timeout from 9s (this was incorporated due to some servers taking huge amounts of time or crashing when checking a bad domain).
Code: [Select]
$smtp_timeout = 30;
If possible, on-the-fly DNS checking should be disabled, which in itself could solve this:

http://listmailpro.com/forum/index.php?topic=1718.0

To try to gain more information in the future you may want to enable the "Always write SMTP log" option on the Configuration page.

Please let me know if you continue to have troubles :)

Regards
Dean Wiebe
ListMailPRO Author & Developer - Help | Support | Hosting

jason.holloway

  • Posts: 5
    • View Profile
List fails partway through, but cannot resume!
« Reply #2 on: June 28, 2007, 05:56:16 pm »
Quote from: "DW"
Are you sure your entire list was not sent?

That, while embarrassing, would be my preferred result! I don't think it's so, though:
  • We had a few 1000s in a few minutes before the first failure, then a few 1000s again in another few minutes, etc.. it seems pretty unlikely that in the same sort of timespan the vast bulk of the mailing list would suddenly launch itself at lightspeed off the machine.
  • I started some process and network monitoring before the mailshot. It looks like TX data on veth0 is only about 200MB from then until now (long after the failure), but I'd calculated the total mailshot size as being in the region of 3G of data. Less than 10% would fit with the idea that we've sent about 5k/60k.
  • Total size of /var/log/maillog is 19174 lines, and includes traffic from Sunday onwards. Doubtful that that corresponds to 60k recipients.
... so no, I'm pretty confident the entire list was not sent.

Quote from: "DW"
The number of rows in the lm_sendq table are the number of messages remaining.  If you do not see the "queue status" header in ListMail with the "Resume" option there should be no rows in the lm_sendq table.

lm_sendq is empty; there's no "queue status" header.

Quote from: "DW"
Some reasons for SMTP stalling ...
Code: [Select]
$smtp_timeout = 30;
If possible, on-the-fly DNS checking should be disabled, which in itself could solve this:

http://listmailpro.com/forum/index.php?topic=1718[/quote.0]
Thanks, I've upped the timeout and will check those references - but they'd only be able to help with the root cause, not the situation I have now.

Quote from: "DW"
To try to gain more information in the future you may want to enable the "Always write SMTP log" option on the Configuration page.

Oh, I've done that now... though I fear the horse has bolted.

We'd really like to resume the mailout. Some thoughts I have now:
  • It seems to me that LMP would attempt sending in a predictable order - if I knew the order of sending, I could perhaps reset a counter and have it "notice" that it hasn't finished yet, allowing me to resume. However, it doesn't look like they went in email address alpha order, id order or uid order - is there any predictable pattern?
  • I have a list which is probably a good match for the addresses to which the mailshot has already been sent (or at least queued). How can I redo the mailshot without these, without disrupting unsubscribe operations?
You described lm_sendq above - if I find a way to populate that correctly, and reset the count/status in lm_sendp, would that be enough to resume the failed mailing?

cheers,
James

jason.holloway

  • Posts: 5
    • View Profile
List fails partway through, but cannot resume!
« Reply #3 on: June 29, 2007, 02:16:42 am »
Rather than fiddling with the lm_send* I'm going to try the technique suggested in http://listmailpro.com/forum/index.php?topic=1440.0 for a custom user selection, creating a temporary list with the addresses I think succeeded, and re-running the mailing with those filtered out. This seems pretty clean to me.

I've also put sendmail into queue-only mode as suggested by mike2 in http://listmailpro.com/forum/index.php?topic=1718.0 ...

-jmb

DW

  • Administrator
  • Posts: 3787
    • View Profile
    • https://legacy.listmailpro.com
List fails partway through, but cannot resume!
« Reply #4 on: June 29, 2007, 06:22:19 am »
jmb,

This is a tough one.  It does appear that if your messages were not set they are lost and we would have to rebuild the lm_sendq table carefully as you described in order to recover somewhat gracefully.

3G for a mailout?  That seems high... are you sending attachments?  I must recommend a link to your file instead as there always seem to be undesired results with attachments to large lists.

You seem to really know what you are doing.  I guess that's why you come to me with the toughest of situations! :D  I suppose what must have happened is ListMail erroneously "skipped past" a number of your remaining emails, most likely due to the smtp timeout issue.  If we had the LM SMTP logs we'd know for sure but unfortunately we do not.  Scanning system logs is likely to be time consuming and somewhat inaccurate, but if that's what's available that's what we'll use, eh? :)  It sounds like you have a plan now that is much easier than rebuilding the lm_send* tables manually... do let me know how it turns out or if I can be of further assistance at all.

Regards
Dean Wiebe
ListMailPRO Author & Developer - Help | Support | Hosting

jason.holloway

  • Posts: 5
    • View Profile
List fails partway through, but cannot resume!
« Reply #5 on: June 29, 2007, 06:58:49 am »
Quote from: "DW"
It does appear that if your messages were not set they are lost

I'd rather come to the same conclusion myself. :(

Quote from: "DW"
3G for a mailout?  That seems high... are you sending attachments?

Yes, the mailout is html with a few inline images. I wanted to run some size reduction but didn't quite have the time.

Quote from: "DW"
I must recommend a link to your file instead as there always seem to be undesired results with attachments to large lists.

We looked at that, and while lower in initial bandwidth (and probably overall as addresses bounce and recipients don't load images) we don't like the "click to load remote images" that you get with most email clients - it just doesn't look professional. Remote images always seem to me to smack of recipient tracking. I hate spam as much as anyone, and tracking always gets my goat.

Quote from: "DW"
I suppose what must have happened is ListMail erroneously "skipped past" a number of your remaining emails, most likely due to the smtp timeout issue.

Seems that way - yes, the SMTP diagnostic would be good (drat, drat and double drat; curse my metal body, I was too slow) but I'm wondering what LMP could have done to record its own action after such a failure mode... silently losing the remainder of a mailout is, mmm, not ideal.

Quote from: "DW"
It sounds like you have a plan now that is much easier than rebuilding the lm_send* tables manually...

It seems to have gone pretty well. We got a sensible looking "remainder to send" list, and LMP has not had any problems sending it - it finished a few minutes ago.

Now comes the problem that the queue seems to be being processed very slowly. Time to kick off a bunch of queue runners in parallel, I think.

Another couple of ideas came to mind:

pre-loading DNS - with a local BIND running and a sizeable cache, one could perhaps make a script that would look up MX records for the domains in the user list, causing them to be cached locally; then when sendmail comes to look them up, they'll already be here. You could also heavily parallelize that, unlike the serial processing of sendmail (smtp or single queue running).

progress monitoring - I wrote a SQL fragment that can pull out the status of the queue and store it in another table, where it can be queried to produce graphs both realtime and after completion of the rate of sending. This is quick and dirty, but works:

Code: [Select]
CREATE TABLE jmb_stats_send (
        id SMALLINT(5) UNSIGNED,                -- per lm_sent.id = mailout identifier
        date DATETIME,                          -- really, we should have a unique key across id/date
        total MEDIUMINT(8) UNSIGNED,
        remaining MEDIUMINT(8) UNSIGNED
        );

INSERT INTO jmb_stats_send
  SELECT s.id,
    NOW() AS date,
    s.numsent AS total,
    count(q.id) AS remaining
        FROM lm_sendq q, lm_sent s
        WHERE q.mid = s.id
        GROUP by s.id;

SELECT * FROM jmb_stats_send;
I stuck the INSERT into a small php and loaded it via cron like the resume script.